Data Labeling for Natural Language Processing: a Comprehensive Guide

ML is a “garbage in, garbage out” technology. The effectiveness of the resulting model is directly tied to the input data; data labeling is therefore a critical step in training ML algorithms. Indeed, increasing the quantity and quality of training data can be the most efficient way to improve an algorithm. And with ML’s growing popularity the labeling task is here to stay. As you approach setting up or revisiting your own labeling process, review the following guide.
Post Header Image
Ivan Lee
April 29, 2022
Published on
April 29, 2022
April 28, 2022
Post Detail Image

We founded Datasaur to build the most powerful data labeling platform in the industry. We have spoken with 100+ machine learning teams around the world and compiled our learnings into the comprehensive guide below.

Table of Contents

— An Introduction to Machine Learning and Training Data

— Basic Task Types in NLP

— Raw Data

— Labeling Operations

— Labeling Tools

— Best Practices

— Conclusion

An Introduction to Machine Learning and Training Data

Machine Learning (ML) has made significant strides in the last decade. This can be attributed to parallel improvements in processing power and new breakthroughs in deep learning research. Another key contributor is the abundance of data that has been accumulated. Analysts estimate humankind sits atop 44 zetabytes of information today. The newly released GPT-3 by OpenAI was trained on 500 billion tokens, or 700GB of internet text! These algorithms have advanced at a phenomenal rate and their appetite for training data has kept pace.Methods of feeding data into algorithms can take multiple forms. Unsupervised learning takes large amounts of data and identifies its own patterns in order to make predictions for similar situations. Unsupervised learning has been applied to large, unstructured datasets such as stock market behavior or Netflix show recommendations. This article will focus on supervised learning, in which humans apply their own set of labels to data in order to better understand and classify other data. Supervised learning requires less data and can be more accurate, but does require labeling to be applied. The dataset, along with its associated labels, is referred to as ground truth. We will cover common supervised learning use cases below.

Additionally, data itself can be classified under at least 4 overarching formats — text, audio, images and video. While there are interesting applications for all types of data, we will further hone in on text data to discuss a field called Natural Language Processing (NLP).

Given humanity’s reliance on language as our primary form of communication, I firmly believe NLP will soon become ubiquitous in augmenting our everyday lives.

Business Use Cases for NLP

ML adoption has been on the rise over the past decade, but I believe NLP is particularly well-suited for immediate adoption in a broad range of industries. Customers use Datasaur for summarizing millions of academic articles and identifying patterns in COVID-related research. Others rely on NLP models in the fight against misinformation to scan through every article uploaded to the internet and flag suspicious articles for human review. NLP can also support recurring business tasks such as sorting through customer support requests or product reviews. Given humanity’s reliance on language as our primary form of communication, I firmly believe NLP will soon become ubiquitous in augmenting our everyday lives.

Basic Task Types in NLP

There is a broad spectrum of use cases for NLP. One common use case is to understand the core meaning of a sentence or text corpus by identifying and extracting key entities. This sub-branch is commonly referred to as Named Entity Recognition or Named Entity Extraction. In the following example: 

Big Bird can be identified as a character, while the porch might be labeled as a location. With enough examples, a model may be able to start recognizing other sentences following the same pattern, such as “Elmo sits on the porch” or “Cookie Monster stands on the street”. Extrapolating beyond this toy example, companies around the world are able to use this methodology to read a doctor’s notes and understand what medical procedures were performed; an algorithm can read a business contract and understand the parties involved and how much money changed hands.

Another popular area for NLP is semantic analysis. This allows algorithms to understand the tone of a sentence. In the following example, we can train a binary classifier to understand whether a sentence is positive or negative.

More advanced classifiers can be trained beyond the binary on a full spectrum, differentiating between phenomenal, good, and mediocre. Sentiment analysis has been used to understand anything as varied as product reviews on shopping sites, understanding posts about a political candidate on social media, and customer experience surveys. Generalizing sentiment analysis further, a field called document labeling allows us to categorize entire documents — a user sending a support email about login issues can be classified separately from an email about product availability, allowing a business to route the requests to the appropriate department.

Other, more advanced tasks in NLP include coreference resolution, dependency parsing, and syntax trees, which allow us to break down the structure of a sentence in order to better deal with ambiguities in human language.

Interpretation 1: Ernie is on the phone with his friend and says hello

Interpretation 2: Ernie sees his friend who is on the phone, and says hello

Finally, it is possible to blend the tasks above, highlighting individual words as the reason for a document label.

Raw Data

While many of the toy examples above may seem clear and obvious, labeling is not always so straightforward. You will need to start with 2 key ingredients: data and a label set.

Some companies may have to begin by finding appropriate data sources. Many academics have scraped sites like Wikipedia, Twitter and Reddit to find real-world examples. Open-source datasets such as Kaggle, Project Gutenberg, and Stanford’s DeepDive may be good places to start.

Thanks to the period of Big Data and advances in cloud computing, many companies already have large amounts of data. Oftentimes this data will be referred to as unstructured data, or raw data. However, before it is ready to be labeled this data often needs to be processed and cleaned. For example, when presenting data to your labeler, how would you like to determine where one sentence begins, and another ends? How are semicolons treated? Make sure you don’t accidentally treat the ‘.’ at the end of “Mrs.” as an end of sentence delimiter! Data may also be missing or misspelled. In certain industries like healthcare and financial institutions, it is important or even legally required to remove personally identifiable information (PII) before it is ready to be presented to labelers.

Once you have identified your training data, the next big decision is in determining how you’d like to label that data. The labels to be applied can lead to completely different algorithms. One team browsing a dataset of receipts may want to focus on the prices of individual items over time and use this to predict future prices. Another may be focused on identifying the store, date and timestamp and understanding purchase patterns.

Practitioners will refer to the taxonomy of a label set. What level of granularity is required for this task? Is it enough to understand that a customer is sending in a customer complaint and route the email to the customer support team? Or would you like to specifically understand which product the customer is complaining about? Or even more specifically, whether they are asking for an exchange/refund, complaining of a defect, an issue in shipping, etc.? Note that the more granular the taxonomy you choose, the more training data will be required for the algorithm to adequately train on each individual label; phrased differently, each label requires a sufficient number of examples, so more labels means more labeled data overall.

Labeling Operations

Okay — we’ve established the raison d’être for labeled data. How do we actually start?

Many data scientists and students begin by labeling the data themselves. This has the advantage of staying close to the ground on the labeled data. You may label 100 examples and decide you need to refine your taxonomy, adding or removing labels. You also fully control your own data quality.

In order to scale to the large number of labels often required to train algorithms and to save time, companies may choose to hire a professional service. The choice in labeling service can make a big difference in the quality of your training data, the amount of time required and the amount of money you need to spend.

Crowd-sourced labeling services

Amazon Mechanical Turk was established in 2005 as a way to outsource simple tasks to a distributed “crowd” of humans around the world. Since the ascent of AI, we have also seen a rise in companies specializing in crowd-sourced services for data labeling. Some of the top companies include Appen, Scale, Samasource, and iMerit. For a fee, these companies will take your data and set up a labeling task on their platforms. Labelers around the world registered with their service can label your data. The advantages to using these companies include elastic scalability and efficiency. Due to the number of labelers on their platform they can frequently finish labeling your data faster than any other option. They will also bring expertise to the job, advising you on how to validate data quality or suggesting how to spot check the quality of work to ensure it is up to your standards. Disadvantages include higher price, higher variance in data quality and the potential for data leaks. The companies will often charge a sizable margin on the data labeling services and require a minimum threshold on the number of labels applied. Fully crowd-sourced solutions can also suffer from labelers who game the system and create fake accounts. We have seen data leaks publicly embarrass companies such as Facebook, Amazon and Apple as the data falls into the hands of strangers around the world.

A separate but related class of labeling companies includes CloudFactory and DataPure. Their labelers are employed full-time and fully trained. This has the benefit of improving quality while also increasing costs.

Bringing labeling in-house

In response to the challenges above some companies choose to hire labelers in-house. This offers greater control of access to and quality of the data output. However, this choice does come with its own disadvantages. Sometimes models need to be trained in time to meet a business deadline. It is possible to outsource 500,000 labels in 2 weeks to a professional labeling service but such capacity is difficult to build out internally; labeling projects can be seasonal and it is difficult to maintain an appropriately sized workforce at all times. In-house teams require significantly more planning and require compromises in project timelines. Building out operational services require a new set of skills that don’t always coincide with the company’s expertise.

So what should I do?

The decision to outsource or to build in-house will depend on each individual situation. I would start by answering the following questions:

  • Is subject matter expertise required for this labeling?
    Some types of data cannot be handled by laypersons. A legal document may require someone with a law degree to properly understand the technical lingo. A certain level of linguistics expertise may also be required. Despite considering myself fairly fluent in the English language, I personally had to think twice before labeling a verb, past-participle.
  • What are the risks (or legal requirements) for data privacy?
    If you are considering working with an external party talk to them about the level of privacy they can adhere to. Is HIPAA compliance required for your data? Do you need labelers working with your data to be working on air-gapped computers? What tradeoffs are you willing to make on this front?
  • What is my threshold for data quality?
    What are the repercussions if my algorithm makes a mistake? Does an email get routed to the incorrect department? Or is there a life on the line? The more critical the data quality, the more you may want to bring this in-house so you can train your own labelers to the level of accuracy required for your line of work.
  • Will this be a core part of my business in the long-term?
    If training your own AI is a core part of your company identity, it may be helpful to make the investment and learn how to set up your own labeling workforce. It will likely save you money and operational efficiency in the long run.

Many companies also choose to do a combination of both — using an in-house labeling workforce for recurring or mission-critical jobs, while supplementing sudden bursts of data needs with an outsourced solution.

Labeling tools

Now that you’ve got your data, your label set and your labelers, how exactly is the sausage made, precisely? The young ML industry is still quite varied in its approach.

The most common starting point is an Excel/Google spreadsheet. This interface is serviceable, ubiquitously understood and requires a relatively low learning curve. It handles common labeling tasks such as part-of-speech and named entity recognition labeling. Disadvantages to the spreadsheet are that its interface was not created for the purpose of this task. Furthermore, it can be error prone. Typos are easier to make and columns of cells are not the most intuitive way to read a text document. Some types of labeling such as dependency parsing are simply not viable using spreadsheets. Most importantly, this approach is not scalable as your needs will expand to more advanced interfaces and workforce management solutions.

A standard for more advanced NLP companies is to turn to the open source community. Tools such as brat and WebAnno are popular labeling tools. These were built with labeling in mind, offering a wide array of customizations. They can be freely set up and hosted and handle more advanced NLP tasks such as dependency labeling. The downsides are that the learning curve is higher and some level of training and adjustment is required. Direct customer support can be limited. These tools are also in various levels of maintenance as they rely on the open-source community for improvements and bug fixes.

Others still choose to build their own tools in-house. This has the benefit of full integration with your own stack. However, building in-house tools requires the investment of engineering time to not only set up the initial tool but also ongoing support and maintenance.

Commercial tools are also available. These include Prodigy, and (you can imagine our recommendation ❤️ ️🦕). These companies offer labeling tools at various price points. Similar to the open-source tools they offer customizability and handle advanced NLP tasks. Other features to consider include team management workflows for your labeling team, labeling performance reports/dashboards, data security and access control, on-premise optionality and ML-assisted labeling. ML-assisted labeling is a relatively recent development that allows your labelers to have a head start when labeling. Instead of labeling everything from scratch, a model can be plugged in to label relatively common terms.

As with many situations, choosing the right tool for the job can make a significant difference in the final output. Considerations should include the intuitiveness of the interface for your particular task. What types of labeling jobs do they specialize in? Is there sufficient customizability for your project’s unique needs? Will you be able to organize and prioritize labeling projects from a single interface? What level of support is offered when questions or issues arise? What is your budget allocation? Identify your primary pain points to find the right solution for your job.

Best practices

We’ve interviewed 100+ data science teams around the world to better understand best practices in the industry. Below are 3 of the most common observations:

  • Iteration — if 500,000 documents need to be labeled, start with a small subset first. Review the first batch of data carefully and make sure it conforms to your expectations. As with many aspects of our industry, rarely is a project set up perfectly on the first try and an iterative approach will save significant time and money in the long run.
  • Labeling redundancy — humans are fallible and may make mistakes after labeling at the end of a long day. Additionally, there are subjective biases in each judgment. A common practice is to have 2+ labelers label the same data. For some projects, a majority consensus is sufficient for determining ground truth. For others, nothing short of unanimity and a discussion around each disagreement is acceptable.
  • Setting up comprehensive guidelines — one of the most common points of failure in the industry is lack of specificity when setting up the project. As one example, a client I work with needed to remove “inappropriate content” from their live chat. Certain words are easy to weed out, such as extreme racial slurs and death threats. However, where is the line drawn? How should sarcasm or jokes be treated? These edge cases need to be well-defined by the product and engineering team in order to avoid surprises when the labeled work is complete.


ML is a “garbage in, garbage out” technology. The effectiveness of the resulting model is directly tied to the input data; data labeling is therefore a critical step in training ML algorithms. Indeed, increasing the quantity and quality of training data can be the most efficient way to improve an algorithm. And with ML’s growing popularity the labeling task is here to stay. As you approach setting up or revisiting your own labeling process, review the following checklist:

Data source

  • How will you collect the data?
  • How will you clean it?

Label set

  • In order to train your model, what types of labels will you need to feed in?
  • What level of granularity in taxonomy is required for your model to make the correct predictions?
  • Can you start with a more simple model first, then refine it later?

Labeling service

  • Will you go with an external or internal workforce? Should you use a hybrid approach?
  • Are subject-matter experts required?
  • Are there any compliance or regulatory requirements to be met?

Labeling tool

  • What type of interface is needed?
  • Is semi-automated labeling applicable to your project?
  • What level of security and data permissioning is required?
  • How do you intend to manage your workforce? Should that be included in the software?

There are many options available and the industry is still figuring out its standards. But by answering the questions above you should be able to narrow down your choices quickly. Best of luck and, if you’d like to continue the conversation feel free to reach out to!

No items found.