Home /Blog / Data Labeling Approaches, Challenges, Tools

Data Labeling Approaches, Challenges, Tools

Data Labeling Approaches, challenges, tools in ML | by Galliot

As AI systems become more complex, Data Labeling has become an inevitable stage in developing most AI and Machine Learning-related applications. But what is data labeling? and how is it done?

Published date: 11-14-2022

Note:
This article requires familiarity with the basics of Machine Learning; If you are not familiar with these basics, you can learn them from provided links below:
1) Basic Concepts in Machine Learning by Jason Brownlee
2) Supervised vs. Unsupervised Learning by IBM Cloud
3) Supervised Learning by IBM Cloud Education
4) Machine Learning Fundamentals by Javaid Nabi – Medium

1. Introduction

Let’s begin our topic with some omnipresent use cases of AI. You have most definitely used at least one of the smart digital assistants like Siri, Google Assistant, and Alexa. It is awesome that you can actually speak out your commands, and they will take that action for you. When you tell Siri to “play some music,” she starts sifting the background noises and interpreting and executing your command.

Another amazing use case of AI is in self-driving cars. Using the power of AI, these cars can identify pedestrians, objects, and other vehicles and understand the signposts in real-time. So they can automatically control the vehicle without any interference from the passenger.

A huge portion of our AI problems is addressed using supervised machine learning. Therefore, this vast number of use cases of AI will not be feasible without having labeled data.

Many ML engineers and data scientists only focus on optimizing the algorithms to increase their model’s performance. This is while data labeling has a critical impact on developing a high-performance machine learning model on an industrial scale. They might think the labeled data is easily available; however, data gathering and labeling for supervised learning algorithms are among the most laborious tasks in big industrial or production-level projects. This is a crucial phase that needs intelligent decision-making and management. Thus, poor planning can lead to the project’s failure and impose high financial and time costs on the company.

In this article, we aim to talk about data labeling, its challenges, methodologies, and intelligent approaches that can ease and speed up this process.

Data Labeling Complexities

When working on mission-critical applications such as self-driving cars or medical diagnosis models, you should create; 1) precise labels with minimum error and 2) a dataset considering various conditions and cases for model generalization.

Imagine you are building a human detection model for self-driving cars. In this case, you should ensure having labeled data for different lighting (day or night), weather conditions, camera angles, etc. Otherwise, a false detection can even cause loss of life.

In medical use cases such as labeling the X-ray images, you can not rely on regular labelers for the job. For this purpose, you should employ experts that can precisely determine the labels based on the target disease. So, finding a workforce with such specialty and high costs can increase the complexity of the process.

2. Data Labeling Approaches

There are several approaches for solving your data labeling (also known as data annotation) problems. Since each method has its own advantages and disadvantages, companies should undergo a detailed assessment phase using various factors to determine the best labeling approach based on their project’s scope, size, and duration.

Here we have mentioned some of the most common methods for labeling data with a short explanation of each one:

2.1. In-house Labeling

In this approach, companies should gather a data labeling team using their data scientists and ML engineers. Internal data scientists and experts know the problem and its complexities and hence understand the use case of their dataset. So, their labels (created datasets) are more suitable for the project compared to other labeling approaches. Consequently, companies use their internal data scientists to increase the quality and accuracy of labeling and make it easier to track. However, this method is time-consuming and requires extensive resources. Plus, it engages expensive company employees such as data scientists and ML engineers for simpler tasks. Therefore, this method can be suitable for different companies based on their project’s size, duration, and available infrastructures. You can see an approximate diagram in figure 1.

In house data labeling cost and project size tradeoff | Galliot

Figure 1. Project size and In-house data labeling cost trade-off. Outsource labeling is more beneficial in the zone between the dotted lines.

2.2. Outsourcing

Gathering and managing a team of freelancers will be challenging and time-consuming. If you have a temporary project that does not require constant improvement, this method can be efficient for you. In this practice, companies hire outside teams specially trained for manual data labeling. Companies like Scale AI or AILabelers.com can handle hiring and managing people for data labeling. Moreover, most freelancing platforms offer data labeling teams with pre-vetted staff and tools for data labeling.

Outsourcing is a good way for companies to save money and time compared to in-house labeling. Using outside labeling teams is very fit for projects that need labeling large amounts of data in a short time.

Still, this method has its cons; for instance, your ML engineers have much lower control over the labeling workflow compared to internal labeling. Note that this statement might not be true all time; using a “hybrid internal and external workforce,” you can have more control over the process. In this scenario, you can have your tools and workflow and hire outsourced resources to do the manual job.

2.3. Automated Labeling

Currently, there are various AI-powered data annotation tools that can automatically annotate and prepare the data for further labeling. These tools usually use pre-trained AI models to detect and annotate the elements of interest in a new dataset. For example, they use a pre-trained object detection model to annotate various objects in the images of a dataset.

Another type of these tools uses AI modules to assist the annotators and ease the labeling process. The AI modules can learn from the annotator’s pattern and start labeling the data autonomously. They can provide significant assistance to annotators and optimize the process.

For example, Snorkel AI offers programmatic labeling and data-centric workflows to accelerate AI development. Snorkel Flow is a data-centric platform for building AI applications to make programmatic labeling accessible and performant.

AI-assisted data labeling workflow - Automated labeling | Galliot
Figure 2. Automated (AI-assisted) data labeling workflow. 

2.4. Crowdsourcing

Crowdsourcing is a faster and more cost-efficient approach to your data labeling problem. Similar to outsourcing, there are many crowdsourcing platforms that assign your data labeling tasks to freelancers from around the world. This way, you can eliminate employing hundreds of temporary employees, invest in annotation tools, and micro-manage the team. Moreover, using a multinational workforce can reduce the risk of bias. However, in this method, quality control is not guaranteed, and you may not achieve consistent results over time.

Scale and Amazon Mechanical Turk are two of the most popular crowdsourcing platforms for data gathering and annotation. There are also some specialized vendors for domain-specific data annotation services. iMerit is one of the platforms offering data labeling services for specific industries.

2.5. Synthetic Data

This approach uses pre-existing datasets to generate data for a new project. Synthetic data is scalable, improves data quality and balance, and significantly reduces the time required for data gathering. On the other hand, you should consider the extensive computing power needed for this task, which can increase the project’s costs.

You can find an example of generating and implementing synthetic data in our Edge Face Mask Detection application. In this project, we have used a synthetic method to generate a labeled face mask dataset.

Labeling MethodIn-HouseOutsourcingAutomatedCrowdsourceSynthetic
Time RequiredHighAverageLowLowLow
CostHighAverageLowLowLow
Labels QualityHighHighAverage-Low*LowAverage
Security**HighAverageHighLowHigh
Table 1. Comparing various labeling methods.
* Depending on the type of automation, it can vary.
** Security refers to keeping the data safe and avoiding data leakage.

2.5. Hybrid Labeling Approaches

Hybrid methods can be a combination of all or some of the approaches mentioned above. They can be applied based on the company’s resources and their projects’ duration and goals. A hypothetical pipeline for a hybrid method can be like this:

– Start with an AI Data Labeling tool

– Crowdsource

– External experts

– Internal experts

3. Methodology

Every company might have a different methodology to solve its data labeling problem. We will present the experience of our friends working in large companies and ourselves. So, what we say in this blog is not the absolute truth. To start the data labeling process, you need to define the ontology of your labels. To do this, you can prepare a handbook to describe your use case and requirements and determine the ontology of your labels.

After defining your labels and what you expect of your data, you should follow these steps:

1- Create an instruction and workflow
Preparing a notebook that defines the labels, states the project’s details and expectations, and how to use the tool to label data and submit the labels.

2- Prepare infrastructure
Specifying where to store the data, e.g., cloud, in-house server, distributed storage, etc. Then you should prepare the required infrastructure for chosen data storage.

3- Setting up the tools
Select, provide, and set up the labeling tools you are going to work with, such as Labelbox, Label Studio, etc.

4- Labeling
Labelers start labeling the data using provided tools.

5- Assessment
Using data assessment methods, you should determine if labels have the desired quality or not.

4. Labeling Challenges and Solutions

Each data labeling method has its cost and challenges for a company. These challenges may vary depending on the company’s approach to this task, but in particular, the following items are some of the most common ones:

1- Cost

2- Time or Speed of Labeling

3- Quality of Labels

Data labeling is an expensive and time-consuming task. There are various ways to speed up the process, but it will affect the quality and cost of labeling. For example, increasing the number of labelers can save time, but it will decrease the consistency of the labels. Furthermore, almost all labeling approaches are prone to human errors. These errors can happen in the manual entry or coding phase, decreasing the quality of labeled data. We should also mention “Bias” in data labeling, which is a serious issue in AI nowadays.

As you can see, there is a very tight correlation between these challenges. So, companies should know about the various solutions and consider their use case and requirements to find the best answer.

5.1. Semi-automated Labeling

This method uses a combined system for labeling the data. It employs a pre-trained model (on a similar dataset) for assisting the human labeler. This model can also learn from the labeler’s decisions, suggest labels, and automate the basic functions of the labeling platform.

Using semi-automated labeling approaches can significantly reduce the project’s time and costs. However, the quality of labels can vary depending on the model’s accuracy and level of automation.

5.2. Multi-layer Labeling

Multi-layer labeling employs three labeling phases for the process; at the early level, we use pre-trained AI models for an automated labeling phase. Then we hire less-expensive workforces for quality control and labeling the provided labels in the previous step. Finally, we can recruit expensive domain specialists (experts) to do the final labeling round for us. This way, we can manage our financial and computational resources logically.

5.3. Active Learning

Active learning is a semi-supervised machine learning method that helps us label a set of data. In this procedure, we only need to target a small and important subset of available data for manual labeling. Using these manually labeled data, we train an AI model and run it on the remaining data to automatically infer their labels. After running the model, it will provide another subset of data in addition to the labels. These data are usually the ones that were more difficult to predict or had more uncertainty for the model. As a result, we have another valuable subset of the data to start labeling manually. Now, combining these two subsets of (labeled) data, we can go for another round of training and labeling.

Moreover, While re-labeling the provided subset of data, we can see if the labels match the actual value. This way, we can assess the model’s performance in each iteration. Therefore, only by manually labeling a small portion of the data we will have high-quality labels after a few iterations.

Active learning for improving data labeling process | Data labeling challenges by Galliot
Figure 3. Active Learning method for data labeling

5.4. Quality Control

To ensure our data quality, we need to set up a series of measurements to assess and improve our labels’ quality. There are different methods to improve the quality of labels. For example, we can give specific data to multiple labelers and accumulate the results. We can also use the data we already know about their tags to compare the results with actual labels to measure our labels’ quality.

6. Data Labeling Tools

If you have decided to roll up your sleeves and start in-house labeling, there are several tools in the market to help you in this process. Considering your project’s size, type, and requirements, you should compare and choose between these data labeling (annotation) platforms. By the way, we have mentioned a few tips based on our experience that might help you select a more suitable data labeling tool.

You should know that most labeling tools offer simple drawing features and a user-friendly graphical interface. However, you might need a more complicated type of annotation, which is essential for your project. I.e., you may need a segmentation tool for the pathological images.

Almost all of these tools accept image and video data types. So if your data is of another kind, such as audio, text, time series, etc., you must consider it.

In production, you will need tools that can assess your model so you can find the deficiencies of your model. Visualization tools can help you see your model’s behavior for evaluations like confidence assessment. For example, you can find out if your model does not work well in images taken at night.

If you search the web for data labeling / annotation tools, you will find several websites comparing these tools. Here are some good examples of these tools you can use in your projects:

As we mentioned earlier, Labelbox is one of the most comprehensive data labeling platforms with several features for various use cases. Amazon SageMaker Ground Truth offers a state-of-the-art data labeling service that helps you find raw data, add labels, and implement them in your ML model. You can also find Label Studio’s open-source data labeling tool very useful for your projects.

Below is a list of some other popular data labeling tools:
LabelMe
Roboflow
LionBridge AI
Amazon Mechanical Turk
CVAT by intel
VOTT
Dataturks
Playment
Clarifai
Datasaur

7. Beyond Data Labeling – Enabling Trustworthy AI Models with Better Data

7.1. Error Analysis

Many of us think that data labeling is all about indicating certain elements in given data for a specific task. For example, annotating individuals’ location in an image for a person detection task. However, working at the production level, we need to perform an error analysis phase after training a model. The error analysis specifies our model performance in different situations. For instance, in an object detection problem, these situations are different camera angles, different image lighting (day or night), the distance of objects from the camera, etc. Adding these parameters to our labels as metadata will significantly help us in the model’s error analysis.

7.2. Model Card

By considering these extra parameters (metadata / labels), we will know more clearly the situations in which our model has a good or bad performance. Therefore, we can prepare and enclose a Model Card when shipping our model to increase its transparency. Model Cards are something like a drug prescription, including the instructions for using the model, explaining its limitations, and the best and worst situations of using it.

Google model cards increase the AI model transparency and best way of use
Figure 4. Google Model Cards for increasing model transparency (source by google.com)

7.3. Datasheet for Datasets

In addition to Model Cards that increase the transparency of the model, there is a similar concept for datasets called Datasheet”. Using datasheets, we create a comprehensive documentation of various steps and the process of creating a dataset. This way, the dataset’s creators help the consumers know if it is the right choice for their use case, uncover any unintentional source of bias, etc.

Consequently, to increase transparency in your work, we suggest you consider the points mentioned above for better communication with your colleagues or external customers.

8. Conclusion

Data labeling has become an inevitable stage in the development of most AI and machine learning-related applications. Developing AI (especially Deep Learning) algorithms need an adequate amount of data. The more data the AI engine has, the more accurate it becomes.

In this article, we introduce the basic concepts related to different data labeling approaches and briefly explain the principles and limitations of each labeling method. We discussed the labeling’s cost, time, and quality challenges and introduced some available solutions for these challenges. We also listed the popular data labeling tools, providing a reference for readers who are interested in this field.

If you are interested in adopting AI in your business, contact us at hello@galliot.us for consultation. 

Subscribe
Leave us a comment

Get Started

Have a question? Send us a message and we will respond as soon as possible.

Get in touch

Have a question?
Send us a message and we will respond as soon as possible.