Human language is incredibly complex. Just look at words like “tear” and “bear”, which can mean different things depending on context. “Bear” can mean carry, withstand, or the animal, depending on context. Homonyms like this are just one example of how contextual and layered our language and data are. To create a truly robust ML algorithm, you need to be able to effectively train the algorithm to handle subjectivity, context, and annotator disagreements when a word can be taken in multiple ways. Subjectivity is a core factor that any robust ML algorithm must be able to handle, so let’s take a look at what it is, how it shows up, and some best practices to mitigate potential issues.
Subjectivity occurs in data labeling when there is no single correct answer or ground truth for the data. For example, there is no objective or black and white answer to whether or not a YouTube video is “funny” and should be labeled as such. When subjectivity crops up, the data labeler’s biases (language, expertise, experiences, cultural lenses etc.) will influence the way that they interpret the data. Subjectivity must be handled with care and accuracy to effectively train the ML algorithm and make sure that your data labeling stays robust.
Before we dig too deep into best practices for subjectivity in data, it’s important to talk about why it matters. And that comes down to two words: data quality. Data quality is the lynchpin for training ML (Machine Learning) algorithms. If you put poor data into your ML algorithm, poor algorithms will come out the other side. There are a few key factors that can affect data quality, one of which is subjectivity:
In AI, subjectivity can show up in many ways, such as sentiment analysis or emotional analysis. This can then present itself through the data labeling in a couple of different levels:
When subjectivity exists in data labeling, it can create an annotator conflict or inter-annotator disagreement. This means that there’s more than one option for labeling the data, and the data needs to be flagged, reviewed, and resolved to make sure that it’s labeled correctly. Let’s look at an example:
On Twitter, Burger King tweets about chicken fries and someone retweets with the caption “That’s sick!”. The word “sick” could be taken as “awesome” or “disgusting” and labelers may disagree on whether the sentiment is positive or negative based on their experiences and biases. To add another layer, if one labeler has seen a lot of data using “sick” in a positive light that day, that could sway their labeling even if the previous day they would have labeled it negatively.
“Subjectivity” and “bias” are words that often carry a negative association; but they don’t necessarily have to. Sometimes subjective data can be a good thing. Take the recommendations that streaming services like Netflix create. These are based on bias and preferences and AI models learning that people that like one TV program tend to like another. Regardless of the nature of subjectivity in your data labeling, it’s important to follow best practices and maintain data quality to keep your ML algorithm strong.
To mitigate the risks of subjectivity, it pays to follow best practices, including:
• Clean the data: If there are missing data or bad values, clean the data first. Remember, “garbage in, garbage out.”
• Enrich the data: Some data categories aren’t sufficient alone or are too generic. Enrich the data by adding annotations and tags within more granular, custom fields.
• Build a data catalog: Build a catalog that lists things like the data source, availability, ownership, and what transformation has been done in the lifecycle of the data.
4. Using a labeling tool: There is no magic solution, but there are some incredible online annotation tools and natural language processing (NLP) tools out there that can help (you might be able to guess which one we recommend 🦕). There are a lot of benefits to using data annotation software or labeling tools to handle subjectivity in data labeling, namely the fact that this is more scalable when handling large volumes data and you can minimize the risks of human error. What’s more, tools like Datasaur include a review feature that will surface the inter-annotator disagreements so that you can easily see—and resolve—contentious labels.
Subjectivity is a factor to keep in mind when you’re labeling data and keeping quality in mind, and the good news is that there are a growing number of tools and options out there to help you manage it effectively. If you’d like to continue the conversation about subjectivity, best practices, and data labeling, feel free to reach out to info@datasaur.ai!