Natural Language Processing (NLP) assists Artificial Intelligence (AI) in looking through the perspective of a human. The way humans converse has many different meanings depending on the flow of a conversation. The way we humans speak is known as natural language’. For a computer, the concept of nuance and subtlety is alien. In order to understand natural language and ‘sound’ like a human, AI depends on NLP to make sense of the statements presented. (See also: How is NLP different from AI)
There are many examples of NLP being applied practically, especially in systems like resume-parsing for hiring to Interactive Voice Response (IVR) systems. NLP is usually used for chatbots, virtual assistants, and modern spam detection. But NLP isn’t perfect, although there are over 7000 languages spoken around the globe, most NLP processes only use seven languages: English, Chinese, Urdu, Farsi, Arabic, French, and Spanish.
Even amongst these seven languages, the biggest advancement in NLP technology is in English-based NLP systems. For instance, optical character recognition (OCR) is still limited with non-English languages. The translators in our phones have quite limited accuracy as the translation of most languages is always translated word to word without taking the meaning of the statement or grammar into play. The most accurate reading in translating languages would contain these 7 languages.
To understand the bias between languages in NLP, one must first understand how NLP learns the language. The building of an NLP system usually starts with gathering and labeling data. Usually, a large bank of information is required for a NLP system, as it needs that data for training as well as testing the algorithms.
If one is working with a language that has a smaller data bank, then it must have a strong set of patterns and fewer variables in order for the system to learn all the possibilities with limited information. For smaller datasets, tools like synonym replacement to simplify the language and back-translation to create similarly phrased sentences to bulk up the dataset. One can also replace common words with similar meanings.
Languages are always evolving, therefore it is important that the dataset available keeps updating regularly. When the system is dealing with a non-English language like Chinese that uses special characters, proper Unicode normalization (it replaces equivalent sequences of characters so that any two texts that are equivalent will be reduced to the same sequence of code points) is typically required. This allows the text to be transformed into a binary form that's recognizable by all NLP systems, reducing the chance to process errors.
The Unicode normalization isn’t ideal for all languages especially for languages that heavily rely on accentuation. For instance, in Japanese, a minor accent change can turn a positive word into a negative one. Therefore, these details must be manually encoded to ensure a strong dataset.
The final step after gathering the data bank for the NLP system would require dividing the dataset into a training model and testing split. It would also be sent through the machine learning process of feature engineering, modelling, evaluation, and refinement.
This is where-in the problem lies for languages that are not very globalized. To build a data bank, these NLP systems often need augmentation work, and then to further clean and update the languages requires a monumental effort. With a limited dataset, building an entire NLP system for localized languages would not be profitable at this stage. To build any aspect of AI, the information gathered must be immense in order to be successful.
Having only a handful of languages that can be successfully used for NLP systems creates several issues. Imagine, being unable to use technology just because the language one speaks and writes with isn’t global. This disparity rises up the technological chain: not to mention the psychological impact this has on many communities. If NLP continues to develop without initiating a diverse range of languages, It will become increasingly difficult to introduce those languages in later stages of NLP. This could mean losing a huge chunk of global languages.
The current assumption around NLP is when we talk about ‘natural language’ the postulation is always that the researcher is working on an English database. This mold must be taken apart and one must spread awareness regarding international systems and their developments. Regardless, only spreading the word isn’t enough. Action must be taken to further the cause.
To introduce more languages in the NLP data bank, it is important to consider the size of the dataset. If one is in the process of creating a new dataset, a significant portion of the budget must be applied to creating a dataset in another language. Moreover, additional research in optimizing current cleaning and annotation programs in other languages is also essential to broadening NLP systems around the globe.
We [Consensus] had a very complex and specific set of annotation needs. Datasaur was able to address those needs efficiently and effectively.
Information labeling tasks has been reduced by 80% which has allowed us to optimize our workflow much more, allowing us to focus on other areas that are also priorities for us.
"We looked at Prodigy, LightTag, LabelBox, Scale and more. You really can't beat Datasaur for their suite of features and price point."