Introducing Seamless Scene Segmentation: Allowing Machines to Understand Street Scenes Better by Turning Two Models into One
The Conference on Computer Vision and Pattern Recognition (CVPR) is one of the world’s most prominent computer vision conferences. It’s where innovators and industry leaders meet every year to publish and showcase the latest research that pushes the industry, and the world at large, forward. All four papers that were submitted by Mapillary this year were accepted for publication. One of the papers addresses an industry-wide problem, and today we want to share it with you—we call it Seamless Scene Segmentation.
One of the most basic premises for teaching machines to see and understand street scenes is that they need to both recognize objects while understanding them in the context of their surroundings. This is normally done using two different models, one for each task. The model we outline in Seamless Scene Segmentation, however, joins these models together, does both at once, and allows us to save as much as 20% computing power as a result.
Instead of using individually trained segmentation models, the Seamless Scene Segmentation model takes an integrated approach and detects objects like people, cars, and map data like traffic signs in relation to its surroundings, as opposed to stand-alone objects. This increases efficiency, slashes computing power, and means we access more and better data.
Image from the paper, showing how the new Seamless Scene Segmentation model manages to both detect large pedestrian crossings, while still picking up fine details like traffic signs in the background of the image
The key to saving a large amount of computational costs was to introduce a shared backbone when designing and training a deep convolutional neural network that jointly solves the tasks of semantic segmentation and instance segmentation. When training the resulting multi-task learning problem, we also investigated different ways of sharing information between the two prediction modules, such that ground truth information from semantic segmentation (describing the scene and context) was able to leverage results for segmenting individual objects, and vice versa. We were able to experimentally validate this cross-pollination effect on several benchmark datasets and found consistently improved recognition results by treating these two artificially separated problems as a single, joint task.
Another image from the paper, showing how the Seamless Scene Segmentation model correctly detects tiny details in the image, such as traffic signs in the background of the image, while detecting dynamic objects like cars
As with all our models, Seamless Scene Segmentation is trained on our Mapillary Vistas Dataset. It remains the world’s most diverse, publicly available dataset for teaching machines to understand street scenes, and is free to use for research purposes. The Vistas Dataset consists of 25,000 high-resolution images from more than 190 countries with pixel-accurate and instance-specific human annotations, covering 152 object categories and accommodates for a variety of weather, season, time of day, camera, and viewpoint. It’s used by players ranging from Toyota Research Institute to AID (Audi’s subsidiary focusing on autonomous vehicles) in teaching their cars to see.
Seamless Scene Segmentation will soon be live on arXiv.org. We’re looking forward to publishing Seamless Scene Segmentation alongside our three other papers at CVPR later this year. We have a range of different activities lined up for CVPR and, as always, it will be a great opportunity to meet with our fellow researchers. Let us know if you’ll be around—it would be great to see you in the crowd.
/Lorenzo Porzi, Computer Vision and Machine Learning Researcher