Data Diversity for Machine Learning and AI

Large amounts of curated and labeled data are critical for the machine learning (ML) component of the perception and decision-making AI software for autonomous flight being developed at Acubed. Equally important is the diversity of those data sets, which should span a diverse set of expected scenarios such as night time and degraded visual conditions.

One of the objectives mentioned in our previous post was to pursue variable conditions during data collection. This is why we have expanded our flight tests to acquire data at night time and with degraded visual conditions such as sun in view (glare) and low ceilings (clouds). Below are some example images from various conditions: normal, night time, sun glare, clouds and fog.

Beyond environmental conditions such as weather, other dimensions of variation include airport, runway and flight path. To that end, we have acquired data from most of the local airports in the Bay Area. The flight paths also cover different types of approaches towards the runway. This data not only includes multiple instances at the same airport and runway, but also samples many scenarios to maximize the diversity of the data for Wayfinder AI algorithms. This is important both for the data set used to train the ML models and for the data set used to test their performance and estimate how our models perform in situations different to the ones on which they were trained. The image below shows the various flight paths overlaid on a satellite image of the Bay Area.

Below we also show a visualization of the diversity of flight approaches, in this case towards San Jose airport.

Besides providing a richer data set of ML experiments, this wealth of diverse data has also exercised our annotation pipeline, especially when labeling data under degraded visual conditions, along with other aspects of our workflow such as flight planning and data curation.

What is the Wayfinder team’s outlook for the rest of 2021?

  • Auto-labeling variable conditions data: We are extending our annotation pipeline to handle occlusion and other visibility factors in images.
  • Second-generation Acubed Flight Test Lab: We will also soon deploy a new and improved data collection platform.
  • Large-scale data acquisition plan: We have a bigger and better data acquisition project in the works.
  • Large-scale data simulation: Finally, our intention is to complement our flight test data collection with generating simulated data leveraging cutting edge computer graphics technologies.

Stay tuned for updates on our progress with variable condition data and next-generation data acquisition!

- Kinh Tieu and Konstantinos Balafas