We founded Datasaur to build the most powerful data labeling platform in the industry. We have spoken with 100+ machine learning teams around the world and compiled our learnings into the comprehensive guide below.
Table of Contents
— An Introduction to Machine Learning and Training Data
— Basic Task Types in NLP
— Raw Data
— Labeling Operations
— Labeling Tools
— Best Practices
An Introduction to Machine Learning and Training Data
Machine Learning (ML) has made significant strides in the last decade. This can be attributed to parallel improvements in processing power and new breakthroughs in deep learning research. Another key contributor is the abundance of data that has been accumulated. Analysts estimate humankind sits atop 44 zetabytes of information today. The newly released GPT-3 by OpenAI was trained on 500 billion tokens, or 700GB of internet text! These algorithms have advanced at a phenomenal rate and their appetite for training data has kept pace.Methods of feeding data into algorithms can take multiple forms. Unsupervised learning takes large amounts of data and identifies its own patterns in order to make predictions for similar situations. Unsupervised learning has been applied to large, unstructured datasets such as stock market behavior or Netflix show recommendations. This article will focus on supervised learning, in which humans apply their own set of labels to data in order to better understand and classify other data. Supervised learning requires less data and can be more accurate, but does require labeling to be applied. The dataset, along with its associated labels, is referred to as ground truth. We will cover common supervised learning use cases below.
Additionally, data itself can be classified under at least 4 overarching formats — text, audio, images and video. While there are interesting applications for all types of data, we will further hone in on text data to discuss a field called Natural Language Processing (NLP).
Given humanity’s reliance on language as our primary form of communication, I firmly believe NLP will soon become ubiquitous in augmenting our everyday lives.
Business Use Cases for NLP
ML adoption has been on the rise over the past decade, but I believe NLP is particularly well-suited for immediate adoption in a broad range of industries. Customers use Datasaur for summarizing millions of academic articles and identifying patterns in COVID-related research. Others rely on NLP models in the fight against misinformation to scan through every article uploaded to the internet and flag suspicious articles for human review. NLP can also support recurring business tasks such as sorting through customer support requests or product reviews. Given humanity’s reliance on language as our primary form of communication, I firmly believe NLP will soon become ubiquitous in augmenting our everyday lives.
Basic Task Types in NLP
There is a broad spectrum of use cases for NLP. One common use case is to understand the core meaning of a sentence or text corpus by identifying and extracting key entities. This sub-branch is commonly referred to as Named Entity Recognition or Named Entity Extraction. In the following example:
Big Bird can be identified as a character, while the porch might be labeled as a location. With enough examples, a model may be able to start recognizing other sentences following the same pattern, such as “Elmo sits on the porch” or “Cookie Monster stands on the street”. Extrapolating beyond this toy example, companies around the world are able to use this methodology to read a doctor’s notes and understand what medical procedures were performed; an algorithm can read a business contract and understand the parties involved and how much money changed hands.
Another popular area for NLP is semantic analysis. This allows algorithms to understand the tone of a sentence. In the following example, we can train a binary classifier to understand whether a sentence is positive or negative.
More advanced classifiers can be trained beyond the binary on a full spectrum, differentiating between phenomenal, good, and mediocre. Sentiment analysis has been used to understand anything as varied as product reviews on shopping sites, understanding posts about a political candidate on social media, and customer experience surveys. Generalizing sentiment analysis further, a field called document labeling allows us to categorize entire documents — a user sending a support email about login issues can be classified separately from an email about product availability, allowing a business to route the requests to the appropriate department.
Other, more advanced tasks in NLP include coreference resolution, dependency parsing, and syntax trees, which allow us to break down the structure of a sentence in order to better deal with ambiguities in human language.
Interpretation 1: Ernie is on the phone with his friend and says hello
Interpretation 2: Ernie sees his friend who is on the phone, and says hello
Finally, it is possible to blend the tasks above, highlighting individual words as the reason for a document label.
While many of the toy examples above may seem clear and obvious, labeling is not always so straightforward. You will need to start with 2 key ingredients: data and a label set.
Some companies may have to begin by finding appropriate data sources. Many academics have scraped sites like Wikipedia, Twitter and Reddit to find real-world examples. Open-source datasets such as Kaggle, Project Gutenberg, and Stanford’s DeepDive may be good places to start.
Thanks to the period of Big Data and advances in cloud computing, many companies already have large amounts of data. Oftentimes this data will be referred to as unstructured data, or raw data. However, before it is ready to be labeled this data often needs to be processed and cleaned. For example, when presenting data to your labeler, how would you like to determine where one sentence begins, and another ends? How are semicolons treated? Make sure you don’t accidentally treat the ‘.’ at the end of “Mrs.” as an end of sentence delimiter! Data may also be missing or misspelled. In certain industries like healthcare and financial institutions, it is important or even legally required to remove personally identifiable information (PII) before it is ready to be presented to labelers.
Once you have identified your training data, the next big decision is in determining how you’d like to label that data. The labels to be applied can lead to completely different algorithms. One team browsing a dataset of receipts may want to focus on the prices of individual items over time and use this to predict future prices. Another may be focused on identifying the store, date and timestamp and understanding purchase patterns.
Practitioners will refer to the taxonomy of a label set. What level of granularity is required for this task? Is it enough to understand that a customer is sending in a customer complaint and route the email to the customer support team? Or would you like to specifically understand which product the customer is complaining about? Or even more specifically, whether they are asking for an exchange/refund, complaining of a defect, an issue in shipping, etc.? Note that the more granular the taxonomy you choose, the more training data will be required for the algorithm to adequately train on each individual label; phrased differently, each label requires a sufficient number of examples, so more labels means more labeled data overall.
Okay — we’ve established the raison d’être for labeled data. How do we actually start?
Many data scientists and students begin by labeling the data themselves. This has the advantage of staying close to the ground on the labeled data. You may label 100 examples and decide you need to refine your taxonomy, adding or removing labels. You also fully control your own data quality.
In order to scale to the large number of labels often required to train algorithms and to save time, companies may choose to hire a professional service. The choice in labeling service can make a big difference in the quality of your training data, the amount of time required and the amount of money you need to spend.
Crowd-sourced labeling services
Amazon Mechanical Turk was established in 2005 as a way to outsource simple tasks to a distributed “crowd” of humans around the world. Since the ascent of AI, we have also seen a rise in companies specializing in crowd-sourced services for data labeling. Some of the top companies include Appen, Scale, Samasource, and iMerit. For a fee, these companies will take your data and set up a labeling task on their platforms. Labelers around the world registered with their service can label your data. The advantages to using these companies include elastic scalability and efficiency. Due to the number of labelers on their platform they can frequently finish labeling your data faster than any other option. They will also bring expertise to the job, advising you on how to validate data quality or suggesting how to spot check the quality of work to ensure it is up to your standards. Disadvantages include higher price, higher variance in data quality and the potential for data leaks. The companies will often charge a sizable margin on the data labeling services and require a minimum threshold on the number of labels applied. Fully crowd-sourced solutions can also suffer from labelers who game the system and create fake accounts. We have seen data leaks publicly embarrass companies such as Facebook, Amazon and Apple as the data falls into the hands of strangers around the world.
A separate but related class of labeling companies includes CloudFactory and DataPure. Their labelers are employed full-time and fully trained. This has the benefit of improving quality while also increasing costs.
Bringing labeling in-house
In response to the challenges above some companies choose to hire labelers in-house. This offers greater control of access to and quality of the data output. However, this choice does come with its own disadvantages. Sometimes models need to be trained in time to meet a business deadline. It is possible to outsource 500,000 labels in 2 weeks to a professional labeling service but such capacity is difficult to build out internally; labeling projects can be seasonal and it is difficult to maintain an appropriately sized workforce at all times. In-house teams require significantly more planning and require compromises in project timelines. Building out operational services require a new set of skills that don’t always coincide with the company’s expertise.
So what should I do?
The decision to outsource or to build in-house will depend on each individual situation. I would start by answering the following questions:
Many companies also choose to do a combination of both — using an in-house labeling workforce for recurring or mission-critical jobs, while supplementing sudden bursts of data needs with an outsourced solution.
Now that you’ve got your data, your label set and your labelers, how exactly is the sausage made, precisely? The young ML industry is still quite varied in its approach.
The most common starting point is an Excel/Google spreadsheet. This interface is serviceable, ubiquitously understood and requires a relatively low learning curve. It handles common labeling tasks such as part-of-speech and named entity recognition labeling. Disadvantages to the spreadsheet are that its interface was not created for the purpose of this task. Furthermore, it can be error prone. Typos are easier to make and columns of cells are not the most intuitive way to read a text document. Some types of labeling such as dependency parsing are simply not viable using spreadsheets. Most importantly, this approach is not scalable as your needs will expand to more advanced interfaces and workforce management solutions.
A standard for more advanced NLP companies is to turn to the open source community. Tools such as brat and WebAnno are popular labeling tools. These were built with labeling in mind, offering a wide array of customizations. They can be freely set up and hosted and handle more advanced NLP tasks such as dependency labeling. The downsides are that the learning curve is higher and some level of training and adjustment is required. Direct customer support can be limited. These tools are also in various levels of maintenance as they rely on the open-source community for improvements and bug fixes.
Others still choose to build their own tools in-house. This has the benefit of full integration with your own stack. However, building in-house tools requires the investment of engineering time to not only set up the initial tool but also ongoing support and maintenance.
Commercial tools are also available. These include Prodigy, Snorkel.ai and Datasaur.ai (you can imagine our recommendation ❤️ ️🦕). These companies offer labeling tools at various price points. Similar to the open-source tools they offer customizability and handle advanced NLP tasks. Other features to consider include team management workflows for your labeling team, labeling performance reports/dashboards, data security and access control, on-premise optionality and ML-assisted labeling. ML-assisted labeling is a relatively recent development that allows your labelers to have a head start when labeling. Instead of labeling everything from scratch, a model can be plugged in to label relatively common terms.
As with many situations, choosing the right tool for the job can make a significant difference in the final output. Considerations should include the intuitiveness of the interface for your particular task. What types of labeling jobs do they specialize in? Is there sufficient customizability for your project’s unique needs? Will you be able to organize and prioritize labeling projects from a single interface? What level of support is offered when questions or issues arise? What is your budget allocation? Identify your primary pain points to find the right solution for your job.
We’ve interviewed 100+ data science teams around the world to better understand best practices in the industry. Below are 3 of the most common observations:
ML is a “garbage in, garbage out” technology. The effectiveness of the resulting model is directly tied to the input data; data labeling is therefore a critical step in training ML algorithms. Indeed, increasing the quantity and quality of training data can be the most efficient way to improve an algorithm. And with ML’s growing popularity the labeling task is here to stay. As you approach setting up or revisiting your own labeling process, review the following checklist:
There are many options available and the industry is still figuring out its standards. But by answering the questions above you should be able to narrow down your choices quickly. Best of luck and, if you’d like to continue the conversation feel free to reach out to firstname.lastname@example.org!