The Role of Data Curation and Data Labeling in Vision-Based Machine Learning and AI

While current commercial aircraft already have a high degree of automation, with autopilots that can fly without pilots actively controlling it, additional autonomous systems could potentially further improve aircraft safety—and we’re on it.

For a commercial aircraft to be more autonomous, it needs to be able to “see” its environment and make tactical decisions that allow pilots to focus their attention on the overall mission, such as which route to take or which airport to divert to. At Project Wayfinder, our team is working towards providing these capabilities while meeting the high safety and performance standards of commercial aircraft through a mix of machine learning (ML) and legacy algorithms.

Leveraging machine learning algorithms

ML is a key technology enabler for autonomy. However, ML algorithms require vast amounts of data, which is a challenge considering there are only limited relevant datasets available in the aerospace industry. Moreover, any amount of data is useless without labels which identify the features to be recognized by the algorithms. While collecting large amounts of data is a difficult task, labeling them at scale may prove to be an even greater challenge.

To tackle the challenge, Project Wayfinder is developing separate ML algorithms for perception and decision-making tasks, two core capabilities that allow the aircraft to see its surroundings, understand its environment and make decisions to ensure the aircraft finishes its mission safely and predictably. ML algorithms have the distinct advantage of being able to perform their function despite widely varying input conditions. For example, Wayfinder’s algorithm can detect a runway several kilometers away under large variations in cloud and light conditions, as demonstrated with the recent completion of closed loop flights as part of Airbus' autonomous taxiing, takeoff and landing (ATTOL) flight demonstrator project. In April 2020, Wayfinder’s software—embedded on an A350-1000 aircraft—was successfully used to guide the aircraft to a safe, fully autonomous landing.

Data curation

Raw data is unusable for the purpose of training ML algorithms. It must first be curated, which is the process of transforming raw data into machine-understandable data that can be used programmatically (i.e. ready to be used for machine learning) and creating a training dataset by picking a specific subset of data samples with the goal of optimizing the end performance of the trained ML model for a target function. The high-level steps of the curation process are as follows:

  1. Screen and correct or discard data that may have been corrupted during recording
  2. Tag data with metadata, listing relevant acquisition conditions, such as the time of day, lighting conditions, location, weather conditions, vehicle speed and sensor type
  3. Label each data sample by associating it with particular target features or parameters value that the ML algorithm will be trained to infer: e.g. distance of a runway seen in an image
  4. Construct a training dataset by selecting a subset of data which is statistically representative of the range of situations and conditions to be encountered for the target function: e.g. detecting a runway during an approach at all times of the day for a particular set of airports
  5. Verify that the resulting training set is unbiased along one or multiple dimensions of interest: e.g. having the same number of daytime approaches versus nighttime approaches

The resulting training set is what we call a curated data set. Labeling is one of the most important parts of the curation process, as it conditions our ability to measure the learning value of the data we have, and the level of quality of the training dataset we can ultimately create.

Data labeling

Vision-based machine learning is a key tech enabler for autonomous ground vehicles, as it is able to accurately single out objects of interest in an image, reliably classify them and determine their precise position within the image. Today, vision-based ML is used by all autonomous driving systems, including Tesla and Google’s Waymo, and it is a fundamental part of the autonomous functions that Wayfinder is developing.

Wayfinder’s data labelling process consists of three steps: first, we start with all the images of runways captured during approach and landings during our data collection campaign. Second, we draw the boundaries of the runway, the threshold, the centerline and the aiming points within each selected image, i.e. the features the ML algorithm is meant to detect (see Figure 1). Third, we save the pixel coordinates of the corners of each box and tag it to the image. The boxes' corner points are a means to encode the location of the runway and the other objects of interest in the image in a language that the algorithm can understand and use for learning. It is commonly referred to as the ground truth.

Fig. 1: Image labels of objects to be recognized by Wayfinder's auto-landing algorithm

The accuracy of the label is important because an ML algorithm learns from ground truth data, so the error distribution in a ground truth dataset will also be found in a well-trained algorithm. This means that the quality of the labels ultimately defines the quality of the trained ML algorithm. Any error induced by the data labeling process will thus negatively impact the performance of the algorithm. This means that half of the work in creating good machine learning models is in creating accurate unbiased labels.

However, manually labelling each image of a dataset that includes hundreds of thousands of images is simply impractical. It also introduces human errors, making the process far from providing perfect or even consistent results in many cases.

In our next blog post, we’ll explore the process of manual labeling, as well as the challenges of scaling up this approach for autonomous flight systems.

- Cedric Cocaud