Evaluation Metrics: Easily Evaluate ML Model in a Project
We are thrilled to announce the official launch of our latest feature: Evaluation Metrics on Project Analytics. This powerful addition is designed to help you assess the precision of your data labeling projects. Read on to discover how this feature is set to redefine your data labeling experience. And for more detailed explanation, we have a dedicated GitBook page for this.
Understanding Evaluation Metrics
Evaluation Metrics on Project Analytics bring a data-driven approach to assessing the performance of your labeled data. It calculates accuracy, precision, recall, and F1 score. Furthermore, the feature also provides the confusion matrix to see the distribution of the answers.
- Accuracy: Calculates the overall correctness of your labeled data, providing a percentage of accurately labeled instances.
- Precision: Measures the accuracy of positive predictions, providing insights into the labeling precision of your data.
- "Of all the instances predicted as positive, how many were actually positive?".
- Real world example that should focus on maximizing precision: Email spam detection because we do not want to have a perfectly normal email being incorrectly classified as a spam, essentially minimizing the false positives.
- Recall: Assesses the ability to capture all relevant instances, giving you a sense of how well your labeling process identifies true positives.
- "Of all the actual positive instances, how many were correctly identified?"
- Real world example that should focus on maximizing recall: Medical diagnostic tools to check cancer because the system cannot afford to label a cancerous case as a non-cancerous one, essentially minimizing the false negatives.
- F1 Score: Strikes a balance between precision and recall, offering a holistic measure of your data labeling project's overall effectiveness.
How It Works and the Feature Scope
The Evaluation Metrics process kicks in automatically after project completion. It compares the responses from labelers (from Labeler Mode) with the ground truth established by reviewers (from Reviewer Mode), generating the information above and present it to you with filters for questions, documents, and labelers.
Currently, it’s available for Row Labeling with dropdown, hierarchical dropdown, checkbox, or radio button questions, excluding the questions with multi answers.
Evaluating ML Model
While Evaluation Metrics are commonly associated with assessing machine learning model performance, our innovative approach extends its application to human labelers. That being said, it could definitely still provide two distinct approaches to facilitate the ML model evaluation.
- With ML-Assisted Labeling
In this approach, each labeler is represented as the inference result of an ML model for evaluation. This is the recommended method as it allows for the assessment of multiple ML models, providing a comprehensive understanding of performance variations. - With Pre-labeled Data
For scenarios where using ML-Assisted Labeling might not be suitable for you due to additional work the custom API, you can go with this second approach. Basically, the ML model inference would be represented as the pre-labeled data. While this method accommodates only one ML model per document, it doesn't require a custom API like ML-Assisted Labeling. This is especially suitable for users employing customized ML models not currently available from our providers.
The detailed step-by-step for these approaches is provided on the GitBook page.
Unlocking Insights
With Evaluation Metrics, Datasaur empowers you to evaluate the performance of your ML model in comparison to reviewer answers. This feature not only ensures the quality of your labeled data but also provides actionable insights to enhance the overall efficiency of your data labeling projects and fine-tune your ML model. At Datasaur, our commitment to innovation and user-centric solutions drives us to continually enhance your data labeling experience.