Winning the CVPR Semantic Segmentation Challenge: How Mapillary Makes Computer Vision Algorithms Effective and Robust

Mapillary has won the Semantic Segmentation Challenge as part of the 2018 CVPR Robust Vision Workshop. Here's how we did it.

We are proud to announce that Mapillary has won the Semantic Segmentation Challenge at the Robust Vision Workshop 2018, co-organized by leading academic institutions like Stanford University, ETH Zurich, and the Tubingen Max-Planck Institute for Intelligent Systems. The workshop is co-located with CVPR 2018, the most important annual computer vision conference, to be held in Salt Lake City, UT, from June 18-23.

The purpose of the workshop is to assess the algorithms’ performance under adverse conditions. In other words, how well do the selected algorithms perform when they face, for instance, radically different lighting conditions, or when the images they’re looking at are distorted? This is becoming increasingly important as we push towards an autonomous future—we need autonomous cars to be able to deal with different kinds of data in a robust way.

The Semantic Segmentation Challenge Leaderboard

The Semantic Segmentation Challenge leaderboard

Mapillary handles a huge number of street-level images—more than 300 million as of last week. Whether the images are taken with professional rigs or smartphone cameras, most of them are high-resolution and of high quality. That said, we still want our algorithms to perform well in all kinds of street scenarios, regardless of the quality of the imagery. That’s why we decided to go up against the best of the best, and took part of the Robust Vision Competition.

As it turns out, models trained using our technology and the Mapillary Vistas, and fine-tuned on challenge data, give the best performance on all four benchmark datasets.

Here is how we designed and trained our models.

Example Example One of the images presented during the challenge. As you can see in the top image, there’s a truck unloading and blocking the street, right ahead of a curve. This is an unusual scenario, rarely seen. The bottom image shows how our algorithm segmented the objects in the image.

Semantic segmentation—the task of assigning a semantic label (like car or pedestrian) to each pixel in an image—is very resource-hungry, but one of the driving workhorses on our mission of understanding the world’s places. However, due to our recently developed technological innovation “In-Place Activated BatchNorm” (which is also one of three accepted papers we are presenting at CVPR ’18), we are able to mitigate the memory impact of our segmentation models by ~50% during training, allowing us to set new benchmarks on challenging datasets like Cityscapes and our very own Mapillary Vistas earlier this year.

The key contribution of the paper is memory optimization in modern deep learning architectures by dropping some of the intermediate results from the forward pass. Required information gets recovered during the backward pass by inverting the remaining results stored from the forward pass, with only a minor increase (0.8-2%) in computation time. Freeing up GPU memory helps to considerably improve results for dense prediction tasks like semantic segmentation, as this memory can be used to extend the models’ field of view and to increase the resolution/amount of training data.

Example Example A dark and rainy evening at a street conjunction. As you can see in the top image, our algorithms still detect every car, traffic light, and tree, even though the visibility and road conditions are far from ideal.

The availability of suitable training data is an equally important ingredient for winning any deep-learning powered challenge. Our Mapillary Vistas is one of the most comprehensive, publicly available datasets with unmatched levels of annotation richness for a total of 20,000 high-resolution street-level images. The Vistas dataset comprises street-level imagery from all over the world and exhibits large variability in terms of lighting, object appearance, capture sensor types, noise properties, season and weather, so it was a natural choice for us to build our prize-winning models based on it.

Finally, we applied some of the usual bells and whistles (multi-scale- and multi-model testing) on the test data, with the final combination allowing us to significantly surpass all competitors on the leaderboard.

Participating in workshops like Robust Vision is a great opportunity for us to stay at the cutting edge of research and keep in touch with fellow researchers. It helps us to constantly evolve our technology for all people building on our services and data. We appreciate feedback in the comments section below and in case you are attending CVPR and want to say hi, do not hesitate to visit us at our booth (#907) or at our posters.

/Peter and the Mapillary Research Team

Continue the conversation