Manual Data Labeling for Vision-Based Machine Learning and AI

Project Wayfinder's perception and decision-making software—aimed at enabling autonomous flight—heavily relies on vision-based machine learning (ML) algorithms, which require large amounts of data. Raw data, however, is useless on its own without being curated and labeled. Properly labeling data is therefore key to identifying features of interest in order to encode their type and attributes into a format that can be readily processed by a ML algorithm. But how do you scale up manual, labor-intensive labeling?

In the first of our three blog post series on data labeling, we introduced the notions of data curation and data labeling, and how important the quality of the labels are to establish a ground truth that will maximize the performance of the ML algorithms, which is at the core of Wayfinder's vision-based perception and decision-making solution for autonomous flight.

The first and most well-known approach to labeling visual data is manual: people are tasked with manually identifying objects of interest in the image, adding metadata to each image corresponding to the nature and/or position of these objects. In this post, we will look at how a manual approach can be scaled up, the associated quality and scalability-related shortcomings, and the process the Wayfinder team has implemented to control label quality.

A data labeling industry is born

Data labeling to train machine learning models and algorithms has spawned a burgeoning new industry, with some predicting that data labeling for AI is set to become a billion-dollar market by 2023. Companies are increasingly turning toward outsourcing data labeling work to labor forces in China, India, and Malaysia.

Amazon Mechanical Turk is probably one of the most well-known online “crowd labeling” platforms: it takes a data set and distributes it to a myriad of online manual annotators who are paid to label a small chunk of data. The reward scheme is usually structured as a fixed amount of money per image, although in some cases, annotators are paid based on an hourly rate, with a bonus for quick results or a penalty for delays. All of the solutions follow the same strategy and suffer from the same problem: they try to scale by maximizing throughput over label quality. However, if you want to increase the quality of your data set, this adds to execution time (e.g. by adding quality control processes) at the expense of profitability.

While manual crowdsourced labeling may address the need of certain industries dealing with smaller data sets or those which are less sensitive to lower label quality, it becomes economically and logistically impractical when data volume exceeds certain limits and requires high quality labels in a realistic time frame. This equally holds true in cases where pixel accuracy labels need to be generated within a few weeks for a large volume of data (e.g. tens of terabytes): the exact threshold for decision-making will vary, depending on the type of data and the volume and accuracy of the labels to be generated.

Certain parameters simply cannot be labeled manually. This is the case for Wayfinder's auto-landing function, which must predict an aircraft’s lateral and vertical deviation and distance vis-a-vis a runway.

Controlling manual data labeling quality

The challenge of controlling the quality of manual labels for object detection is easily overlooked, in part because common vision-focused machine learning models have focused on detecting rather large objects in small images. A typical example is the detection of cats and dogs occupying more than 50% of an image of 1 megapixel or less (see example in figure below, left image).

Example of a common detection task (i.e. a large object in small image) vs. detection of a potential airborne threat from Wayfinder's collision avoidance system

Loosely adjusted bounding boxes exceeding the object size by 70 pixels in the horizontal and vertical dimensions would result in an error of only two percent of the true object size. Such an error margin remains manageable and may not translate into a significant decrease of detection performance if the sample population is large enough and there are no systematic biases in the background image captured within the bounding boxes (e.g. not having a green background for all dogs, and a white background for all cats).

Manual data labeling for autonomous flight

However, the acquisition of a runway six kilometers away from the aircraft, or the detection of a one-meter wide drone one kilometer away from an onboard camera is an entirely different problem, as illustrated in the visual comparison above. The camera resolution and resulting image size required for these tasks is typically ten to sixteen times larger than the one mentioned above. In addition, the size of the object to be detected can be as small as 20 by 20 pixels, representing less than 1 percent of the total image area. A two percent error applied to the airborne object detection represents a bounding box that shouldn’t exceed the size of the object by more than 3 pixels in the horizontal and vertical dimensions. Such high precision is extremely difficult to achieve by manual labeling; annotators must zoom into each image until the individual pixels are visible and a bounding box can be tightly fitted around the object with an accuracy of a few pixels.

The labeling precision required for Wayfinder’s application typically requires between 40 to 60 seconds per image per manual annotator and three passes over the data set to obtain the right label quality. A data set of 100,000 images requires approximately 1 to 1.5 months of work with a crew of 12 to 18 annotators working 24/7. Aside from the lack of scalability of the approach for this level of data quality, the means to control the quality of the label is also a challenge. Systematic manual verification is impractical, suffering from the exact same limitations as the initial labeling work (i.e. verifying or annotating data is a long and tedious process which is poorly performed by people). Batch rejection and relabeling based on statistical sampling also fails to address the problem, even when accounting for the skewed label error distribution over one or several batches processed sequentially by a single annotator (i.e. the end of batch or the last few batches have generally many more errors than the first ones). The annotators typically work over long sequences of data; extending their work hours to redo half of their batches will not boost their focus and the quality of the labels.

Our team has experimented with several different quality control approaches, and the only one that has proven to be effective when dealing with manual labels was the following iterative process:

  1. The whole data set of images is annotated using a large pool of manual annotators
  2. The quality of every label is evaluated using non-ML computer vision algorithms quantifying how far off each bounding box is from the target object (e.g. using histogram analysis and traditional image segmentation techniques)
  3. A corrected bounding box is auto-generated based on the result of step 2
  4. A note for the manual annotator is auto-generated based on the difference between the corrected bounding box and the one manually drawn initially (e.g. “check if new bounding box too small”)
  5. Step 1 is repeated with the whole data set, providing the auto-corrected bounding box and the note for each image to the annotators for manual verification
  6. Steps 1 to 5 are repeated until the variance over the manual labels and the auto-corrected ones has decreased below a desired threshold.

Manual versus auto data labeling

As you can see from the example above, manual labeling is not scalable past a certain data volume and poses a problem in achieving and maintaining the accuracy and the consistency of the type of labels required for Wayfinder's perception and decision-making applications for autonomous planes. With an automated process leveraging programmatic labeling and/or machine learning models, that process can be drastically improved, which we’ll talk more about in the next post of this series. More importantly, manual labeling is simply not an option in cases where precise aircraft parameters such as position and attitude need to be associated with a time-synced image or another data sample.

In our next blog post, we will explore auto-labeling strategies and the associated adoption challenges in relation to the maturity of the ML algorithms.

- Cedric Cocaud