Unveiling our Latest Research: Multi-Object Tracking and Segmentation from Automatic Annotations

Access to high-quality training data is one of the most important requirements to push the boundaries with machine learning in computer vision. Today we’re unveiling our latest piece of research, where we roll out an entirely new way to generate training data for multi-object tracking and segmentation. The approach turns raw street-level videos into training data with unprecedented quality—even compared to results based on human-annotated data. By allowing machines to generate training data, the cost for training computer vision models can go down substantially. We validate our approach for multi-object tracking and segmentation and obtain new state-of-the-art results. Here is how.

We are publishing several technical reports over the next few weeks, unveiling our latest research as we push the boundaries of what is possible in computer vision and, ultimately, AI-powered mapmaking. The first paper we are publishing, titled “Learning Multi-Object Tracking and Segmentation from Automatic Annotations”, presents two major contributions to the field. First, it identifies new ways of turning street-level imagery into training data for multi-object tracking and segmentation, i.e. tracking and segmenting several street objects in images over time. Second, it rolls out a new deep-learning-based tracking-by-detection approach, named MOTSNet, for learning multi-object tracking and detection. The new framework in combination with the machine-extracted training data leads to significantly better detection rates and, thus, better data.

Putting the machines to work: how algorithms turn imagery into training data for multi-object tracking and segmentation

Access to high-quality annotated data is one of the most important variables when it comes to training and improving deep learning algorithms. But the best-performing algorithms have typically been trained on human-labeled data, which is labor-intensive, unscalable, and expensive.

That’s why, in this paper, we set out to find a way to automatically generate training data for simultaneously tracking and segmenting numerous objects in videos, which is a new way of tracking objects in videos. Previously, tracking has been done solely through bounding boxes, but a new approach named MOTS (Multi-Object Tracking and Segmentation) was published at CVPR earlier this year. With that, each object is segmented, as opposed to just having a corresponding bounding box. This means greater precision and, ultimately, better data.

To automatically generate training data for multi-object tracking and segmentation, we started by training our panoptic segmentation algorithm on the Mapillary Vistas Dataset. We published Mapillary Vistas in 2017 and it still remains the world’s most diverse dataset for training algorithms in street-level object recognition. Once the algorithm has been trained on Vistas, we can apply it on any raw street-level video to automatically segment objects like cars, pedestrians, and cyclists in each video frame. We proceed by linking objects detected in subsequent video frames through motion cues provided by state-of-the-art optical flow models. We also show how to automatically generate training data for training optical flow models—it turns out that we can exploit point correspondences between different images as produced in structure-from-motion pipelines like OpenSfM.

With this data generation pipeline, we were able to produce high-fidelity annotations, quantitatively surpassing previous works—even some based on human-annotated data.

Improving object association over time: a new deep-learning approach for knowing what object goes where

Our second contribution in the paper rolls out a new deep-learning-based framework for tracking-by-detection. Allowing algorithms to know and understand what is where and at what time is one of the ultimate goals in computer vision. Researchers have made great progress over the past years, but joint tracking and segmentation of multiple objects over time in videos remain a significant challenge.

That’s why we built a new architecture that allows for improved object association over time, specifically developed for multi-object tracking and segmentation. Called MOTSNet, it introduces a new way to segment and track each detected object over time. By introducing a new so-called mask-pooling layer, we improve the association process for image-based detections over time. In other words, our mask-pooling layer learns how to represent objects and enables the algorithms to understand how they transform over time.

The combination of these contributions, where we train the new deep network —MOTSNet— with machine-generated training data, leads to significantly improved results on multi-object tracking and segmentation. We have achieved new state-of-the-art results, outperforming all previous approaches by a significant margin.

If you are interested in learning more, the paper is now live on arXiv and you can read it in its entirety here.

/Lorenzo Porzi, Computer Vision and Machine Learning Researcher

Continue the conversation