Automatic Data Labeling Strategies for Vision-Based Machine Learning and AI

In the first of our four blog series on data labeling, we introduced the notion of data curation, the necessity of data labeling, and the importance of maintaining tight control over label accuracy and consistency. In the second post, we discussed how manual labeling becomes unscalable past a certain data volume and how challenging this approach is for achieving and maintaining the quality level required for Wayfinder's perception and decision-making applications.

The only viable way to label data at scale with consistency and precision is to employ automated processes leveraging bootstrapped machine learning models and/or programmatic labeling. Programmatic labeling, which is also known as weak supervision, distant supervision or self supervision, relies on external information or heuristics to label data and create usable training datasets. Programmatic labeling is the only viable way in cases where precise aircraft parameters such as position and attitude need to be associated with a time-synced image or some other sensor output. In this blog post, we will explore automated labeling approaches and discuss the challenges in applying them.

A chicken and egg problem

As with almost every new player in the machine learning (ML) world, Project Wayfinder started collecting and labeling data manually. However, it quickly became obvious to us that even manual crowdsourced labeling wasn’t viable without automating part of the process, and that the key to a truly scalable and precise labeling process was fully automating the pipeline. The fundamental paradox of ML development is that while reliable ML algorithms are a powerful tool to label new data without human supervision, accurate labels are needed in the first place to train and mature ML algorithms.

It may be attractive to skip manual labeling and go straight to a fully automated labeling pipeline when beginning ML development for a new application, but taking that shortcut may prove unfeasible if there is no programmatic way to label data, or if no pre-labeled data is available. The automotive industry benefits from having commercial and publicly available datasets, making bootstrapped ML-based labeling possible for new entrants. The availability of such data for perception and decision-making applications simply doesn't exist yet for the aerospace industry.

So what's the fastest route to a scalable auto-labeling process when you start from a clean sheet?

Auto-labeling and ML development maturity

The holy grail of a scalable auto-labeling solution for autonomous vehicles is an automated solution whereby all data are automatically collected by a fleet of vehicles in operation. Low value data are screened out and high value data of cases that haven't been previously encountered are automatically identified, labeled and added to the curated dataset used for the next ML training cycle. In the absence of a purely programmatic solution that can provide labels for all vision-based autonomy functions of interest, this automated data selection and curation process can only be achieved using ML algorithms that are already very mature.

When we investigated different approaches to auto-labeling for autonomous vehicles, we noticed a strong correlation between the level of maturity of the ML algorithms being developed, the required focus of data collection efforts and the data labeling techniques more suited at each maturity level. The table below proposes a simplified summary mapping these three aspects.

A large volume of data is required for the purpose of developing safety critical applications like ours. Getting a sufficient volume of data to initiate ML development when no commercial or public dataset is available is a significant challenge. Synthetic data generation for vision-based ML development is critical at this stage. Computer graphics tools enable us to create a wide variety of scenes and situations with complete control and accurate knowledge of the position and orientation of all elements, allowing us to create hundreds of thousands of images with very accurate labels over a few hours or days (depending on the resolution and the computing capabilities).

At an early development stage, the bulk of the data collection effort will focus on quantity. The aim is to develop the first generation of ML algorithms. The resulting ML algorithms typically have low generalization capabilities, resulting in high false positive rates (e.g. spurious detection) and high false negative rates (e.g. missed detections) for any instance not seen in the training dataset. Using ML for bootstrapping auto-labeling is usually bound to fail for that reason.

On the other hand, programmatic labeling is a very attractive solution if it can be automated and scaled. Programmatic labeling relies on external information or heuristics to label data and create usable training datasets. For these reasons, Wayfinder's auto-labeling pipeline heavily relies on programmatic labeling for the development of its auto-landing guidance algorithm.

The point at which data collection can switch its focus from volume to data diversity typically means that the ML algorithm has achieved the desired performance for the nominal case(s) (e.g. auto-landing under a clear sky in the middle of the day), and now faces the greater challenge of achieving the same level of performance for general cases (e.g. auto-landing in a wide range of weather and light conditions). ML algorithms with intermediate levels of maturity are typically able to detect most objects of interest, but their lack of reliability makes it challenging to use in the labeling pipeline. A mixed approach can be used where, for example, images triggering low certainty ML detections are logged while labeling is done offline through programmatic means. The usual lack of reliable means to measure the learning value of the data remains a challenge at this stage of ML development. A missed or low-certainty detection in an image doesn't necessarily mean that the ML algorithm has encountered a special case. It could simply be a shortfall in performance due to an imbalance in training cases in the dataset.

Rare cases and anomalies become the most valuable data to collect when the ML algorithm has reached a high level of maturity. At this stage, the ML algorithms can usually detect all objects of interest reliably, and the confidence level provided for each output can be used as a reliable means to identify data with high-learning value. A completely autonomous data collection pipeline can be implemented following an approach such as active learning, where high-value data is automatically identified to be labeled and added to the training dataset and lower value data is discarded before spending any time and effort on storing and labeling them (leading to potentially substantial cost savings). Labeling can be automated using mixed approaches of programmatic algorithms and/or task-specific or object-specific ML algorithms.

First step toward an auto-labeling pipeline

At the early stages of developing ML vision-based autonomous functions for aerospace, we quickly came to the conclusion that we couldn't jump directly to an unsupervised auto-labeling and/or automated active learning pipeline. Our data collection is still focused on volume, and the two primary means of gathering labeled data is through synthetic data generation and programmatic labeling of real data. As there is no off-the-shelf solution available, our team is developing our own programmatic labeling pipeline.

Airborne object detection for detect and avoid applications is a good example of how such programmatic approaches can be used. General aviation aircraft are typically equipped with beacons broadcasting their position using equipment like ADS-B. The detection of an aircraft captured by onboard cameras can thus be compared after the flight to an online database that logs aircraft trajectories using their broadcasted positions (e.g. FlightRadar24). This data can then be compared to the position and orientation of the camera-equipped aircraft at the time the images were taken. The detection can then be flagged as a false negative if the algorithm failed to detect a known object that should have been in sight, or can confirm the position of a detected aircraft. This database matching approach can be scripted and performed entirely automatically.

In our next blog post, we will cover how we went about building our automated data-labeling pipeline, and dive into the details of how we're programmatically labeling images for the specific application of providing guidance for aircraft auto-landing.

- Cedric Cocaud