Data Labeling & The Secret Language of Autonomous Flight: Part 1: The Role of Data Labeling in Machine Learning and AI

Building an autonomous aircraft that is efficient, reliable, safe, and capable of functioning at scale with thousands of other airborne autonomous vehicles is no small feat. The effort requires a number of complex technical components, from GPS and inertial navigation systems to a range of complex sensors that include video, radar, and laser rangefinders—to name just a few. It may come as a surprise, then, to learn that perhaps the most important technical process that goes into creating an autonomous vehicle is also a job that’s so simple, you could hire just about any high school student to do it: data labeling.

To make autonomous aircraft a reality, today’s engineers are building the most advanced data collection systems in history, leveraging a combination of real-world and simulated data to ensure their vehicles are able to meet the highest safety standards. Engineers use this data to train the algorithms that power their vehicles, providing those algorithms with millions upon millions of images to analyze so they can slowly begin to recognize the various obstacles and pathways that surround them. However, without labeling, this data becomes effectively useless—an undifferentiated mass of information that is wholly unintelligible to the untrained algorithm.

For Part I of our data labeling series, we’ll take a look at the role that data labeling plays in machine learning and artificial intelligence systems writ large.

What is data labeling?

According to TechTarget, data labeling (in the context of machine learning) is simply the process of detecting and tagging data samples via manual techniques, automated techniques, or a combination of the two. This detection and tagging of data samples is a form of “data preprocessing,” a term that refers to any work that must be performed on a set of raw data before it is ready for use. Data labeling is a crucial part of machine learning because it establishes the core classifications that form the basis of the algorithm’s eventual data processing.

For example, imagine a system that is being developed, or “trained,” to recognize simple speech commands—just a few dozen keywords like ”yes,” “no,” “up,” “down,” and basic numeral digits. The developers behind this program might provide that system with a large data set comprising thousands of short audio files, each one containing a recording of a different person saying one of the various keywords. However, if that data set is filled with raw files that haven’t yet been labeled, the system’s AI will have no frame of reference for determining which audio files correspond to which keywords.

Proper data labeling gives the machine learning algorithm the context it needs to begin analyzing the data set, and identifying key correlations between the audio files and the corresponding keywords. The same is true for our work in autonomous vehicle design, where proper data labeling will help the algorithm distinguish a stop sign from a tree, or differentiate between a car and a pedestrian. Put another way, effective data labeling provides the machine learning algorithm with an accurate representation of what is often referred to as “the ground truth,” i.e. information that can be observed directly in the real world, which the algorithm uses as a foundation for establishing reliable learning patterns.

The importance of data quality

With its analysis underway, our example machine learning algorithm can then begin building a predictive model that will allow it to process new, unlabeled voice commands, identify the correct corresponding keywords, and carry out whatever task is requested. Unfortunately, training an algorithm for a task as complex as automatic speech recognition (ASR) isn’t quite so simple as compiling a few audio files and jotting down the corresponding keywords. In fact, it is exceedingly rare that a machine learning algorithm can be trained effectively on a data set that includes only one label category. In the real world, this would be considered a very low quality data set, and it’s highly unlikely that an algorithm trained on this data would be able to accomplish much of anything.

A high quality (and more realistic) ASR data set might include labels that distinguish speech files from supplemental data samples containing music or ambient sound, so that the algorithm doesn’t get confused by background noise. It might include labels that differentiate speech from the ambient silence heard at the beginning and end of an audio file, so that the ASR system can better detect when a command is being issued. The developers might include time-aligned phonetic labels for each phoneme within a multisyllable word, ensuring that one can ask the algorithm for dessert recommendations without being sent to the middle of the Mojave desert. And there are labels for elements of speech that are even more esoteric, like lexical stress, prosody, and allophonic variation. (If that sounds complicated, remember that this is a relatively simple example, and that our work in processing data from autonomous aircraft sensors is even more complex—as we’ll explain in Part II of this blog series.)

These additional labels would add a great deal to the overall quality of the data set by delivering more accurate feature representation, which the Google Developers site defines as the mapping of data to useful features. Other key characteristics of a high quality data set include reliability and skew minimization. Reliability concerns the extent to which the data can be trusted, as determined by factors like the frequency of label errors, noisy features, duplicate examples, and more. Skew minimization refers to the extent to which a data set prepares an algorithm to deliver the same results both in training and in operation—with any gap between those two performances referred to as “skew.”

Why the data labeling business is booming

Data labeling is essential to creating high quality data sets, and errors in the data labeling process can easily undermine the data set as a whole. This means that data labeling will only grow more important as AI becomes a more common feature of modern life. The challenge is that data labeling is a highly labor-intensive and time-consuming process, and developers must often use a variety of approaches—some more effective than others—to get the job done.

For example, a developer team could handle data labeling internally, although they likely wouldn’t have time to do much of anything else. They could crowdsource their labeling work, collaborating with freelancers and volunteers through popular crowdsourcing platforms, though the quality of their results may vary. They can also generate labeled synthetic data with the same attributes as real data (synthetic labeling), or they can use scripts to programmatically label their data rather than doing so by hand (data programming), but both of these approaches have significant drawbacks that we’ll discuss in a future post in this series.

It’s no surprise that data labeling for AI is set to become a billion-dollar market by 2023. Increasingly, developers are turning to outsourced labor—either through temp employees or specialized outsourcing companies—for their data labeling needs. With low cost labor available in countries like China, India, and Malaysia, the growing data labeling market may yet prove to be an enormous boon for rural areas around the globe where work is difficult to come by.

Next up…

In Part II of this series, “Data Labeling in Autonomous Vehicle Development,” we’ll take a look at how data labeling helps prepare autonomous vehicles for real-world operation. Stay tuned!

- Cedric Cocaud