Enhancing Data Quality: Finding and Fixing Label Errors with Datasaur

Leverage Label Error Detection to help you train your models more efficiently
Post Header Image
Datasaur
July 6, 2024
Published on
July 6, 2024
July 12, 2024
Post Detail Image

The data-centric approach in machine learning believes that high-quality datasets are critical and will eventually make or break the model. However, data preparation itself can take up 80% of the allocated resources, emphasizing the need for labeling consistency, data cleansing, and error correction. Driven by the same paradigm, the Datasaur app supports these needs with a labeling system promoting consistency through label consensus and error detection, improving overall dataset quality.

Label Error Detection

We are thrilled to announce Label Error Detection, an innovative feature powered by Metadata Archaeology, designed to identify label errors by tracking model loss during the training process for each individual example.

This technique involves creating two subsets, called probe suites: one for correct labels and one for incorrect labels. By analyzing the training dynamics of the dataset in conjunction with these subsets, we can infer the metadata for each data point, helping us identify whether its label is correct or incorrect.

Case Study

We conducted experiments using Metadata Archaeology on various classification tasks with publicly available datasets: AGnews for news articles, IMDB for movie review sentiment, Yelp_polarity for Yelp review sentiment, and Dbpedia_14 for Wikipedia content classification. To simulate data errors, we sampled the data and altered 10% of the labels, as no specific datasets for this purpose were available. This setup allowed us to evaluate our label error detection's effectiveness using the f1-score metric.

Find more details and insights in our Whitepaper below.

Full Whitepaper

Additionally, if you want to learn how dataset quality affects model performance and how our Label Error Detection can play a part, check out our exploration further below:

The Importance of Data Accuracy and How Label Error Detection Automates QA

No items found.