The Subjective Nature of Data Labeling

In the space of NLP, labeling is a subjective experience. Every annotator will see each dataset a little bit differently. Managers of labeling projects have to create a system of review and clear instruction to ensure consistent, quality output.
Post Header Image
Anna Redbond
PUBLISHED ON
May 14, 2022
PUBLISHED ON
May 14, 2022
May 13, 2021
Post Detail Image

Human language is incredibly complex. Just look at words like “tear” and “bear”, which can mean different things depending on context. “Bear” can mean carry, withstand, or the animal, depending on context. Homonyms like this are just one example of how contextual and layered our language and data are. To create a truly robust ML algorithm, you need to be able to effectively train the algorithm to handle subjectivity, context, and annotator disagreements when a word can be taken in multiple ways. Subjectivity is a core factor that any robust ML algorithm must be able to handle, so let’s take a look at what it is, how it shows up, and some best practices to mitigate potential issues.

What is Subjectivity in Data Labeling?

Subjectivity occurs in data labeling when there is no single correct answer or ground truth for the data. For example, there is no objective or black and white answer to whether or not a YouTube video is “funny” and should be labeled as such. When subjectivity crops up, the data labeler’s biases (language, expertise, experiences, cultural lenses etc.) will influence the way that they interpret the data. Subjectivity must be handled with care and accuracy to effectively train the ML algorithm and make sure that your data labeling stays robust. 

Data Quality and Loss

Before we dig too deep into best practices for subjectivity in data, it’s important to talk about why it matters. And that comes down to two words: data quality. Data quality is the lynchpin for training ML (Machine Learning) algorithms. If you put poor data into your ML algorithm, poor algorithms will come out the other side. There are a few key factors that can affect data quality, one of which is subjectivity: 

  1. Volume: A large dataset means a lot of data to check, maintain, and analyze, and more opportunities for quality drop-off.
  2. Decentralized data: If you have data coming from a lot of different platforms (YouTube, TikTok etc), it takes a lot of work to centralize and maintain the data and uphold quality. You are also bound to and dependent on the APIs that the platforms use, and some data can simply be lost in transit. 
  3. Cleaning: Cleaning and enriching data regularly is important, and not having best practices in place for this can cause data quality to plummet—as well as having significant detrimental effects for your ML model.  
  4. Human error: Even the best annotators and taggers can make errors or interpret data incorrectly. For example, take a simple mislabeling of words like “aberrant” and “apparent” in medical data. This could happen when handling mass volumes of data and making a simple error with words that look similar, or it could happen if subjectivity is at play and the labeler is interpreting doctors’ notes. 
  5. Subjectivity: Data can be labeled and annotated subjectively, which can affect the quality of the final AI model and the annotation process itself. 

How Does Subjectivity Show Up in Data Labeling?

In AI, subjectivity can show up in many ways, such as sentiment analysis or emotional analysis. This can then present itself through the data labeling in a couple of different levels:

  1. Data labelers can label differently to one another depending on their biases and experiences. 
  2. The very same data labeler might give a different answer or label from one day to the next. 

When subjectivity exists in data labeling, it can create an annotator conflict or inter-annotator disagreement. This means that there’s more than one option for labeling the data, and the data needs to be flagged, reviewed, and resolved to make sure that it’s labeled correctly. Let’s look at an example: 

On Twitter, Burger King tweets about chicken fries and someone retweets with the caption “That’s sick!”. The word “sick” could be taken as “awesome” or “disgusting” and labelers may disagree on whether the sentiment is positive or negative based on their experiences and biases. To add another layer, if one labeler has seen a lot of data using “sick” in a positive light that day, that could sway their labeling even if the previous day they would have labeled it negatively. 

Another Lens: the Perks to Subjectivity in Data Labeling

“Subjectivity” and “bias” are words that often carry a negative association; but they don’t necessarily have to. Sometimes subjective data can be a good thing. Take the recommendations that streaming services like Netflix create. These are based on bias and preferences and AI models learning that people that like one TV program tend to like another. Regardless of the nature of subjectivity in your data labeling, it’s important to follow best practices and maintain data quality to keep your ML algorithm strong. 

Best Practices for Subjectivity in Data Labeling

To mitigate the risks of subjectivity, it pays to follow best practices, including:

  1. Mapping the decision-making process: Create clear, repeatable instructions for annotators with guidelines. If there is potential for ambiguity or disagreement, try to create examples that provide clarity and guidance. Annotators are crucial to the success of data labeling, and setting up solid processes is vital.
  2. Managing examples with high disagreements: When annotators or labelers disagree, this can create great material to inform the decision-making process. Be careful not to leave contentious data from inter-annotator disagreement unresolved, though, since this will lead to confusion for labelers and/or the AI model when learning how to deal with subjectivity.
  3. Cleaning and enriching the data:

      • Clean the data: If there are missing data or bad values, clean the data first. Remember, “garbage in, garbage out.” 

     • Enrich the data: Some data categories aren’t sufficient alone or are too generic. Enrich the data by adding annotations and tags within more granular, custom fields.

      • Build a data catalog: Build a catalog that lists things like the data source, availability, ownership, and what transformation has been done in the lifecycle of the data. 

      4. Using a labeling tool: There is no magic solution, but there are some incredible online annotation tools and natural language processing (NLP) tools out there that can help (you might be able to guess which one we recommend 🦕). There are a lot of benefits to using data annotation software or labeling tools to handle subjectivity in data labeling, namely the fact that this is more scalable when handling large volumes data and you can minimize the risks of human error. What’s more, tools like Datasaur include a review feature that will surface the inter-annotator disagreements so that you can easily see—and resolve—contentious labels. 

Subjectivity is a factor to keep in mind when you’re labeling data and keeping quality in mind, and the good news is that there are a growing number of tools and options out there to help you manage it effectively. If you’d like to continue the conversation about subjectivity, best practices, and data labeling, feel free to reach out to info@datasaur.ai!