Introducing the Mapillary Planet-Scale Depth Dataset for Single-View Depth Estimation

The Mapillary Planet-Scale Depth dataset is unique in its scale and diversity, thanks to the millions of images that our community uploads to Mapillary every day.

Today we are announcing the release of Mapillary Planet-Scale Depth (MPSD), the most diverse publicly available dataset for training single-view depth networks.

Using this dataset we have achieved a new state of the art in the task of single-view depth. The work of this dataset has been accepted as an Oral paper at the European Conference of Computer Vision (ECCV), where we will also present 3 other papers this year.

Single-view depth is a machine learning application within computer vision. Single-view depth networks are a type of convolutional neural network used to extract 3D information from single images, instead of relying on specialized hardware such as stereo cameras or LiDARs.

Single view depth Single-view depth networks predict the distance to the camera for every pixel.

Mapillary Plane-Scale Depth (MPSD) is a large-scale dataset of RGB + depth image pairs for training single-view depth neural networks. Curated from Mapillary, it contains diverse images with varying camera intrinsics, broad geographical coverage and scene characteristics.

Just like humans do when looking at the world with one eye shut, given sufficient and corresponding training data, single-view depth networks learn to understand what is the real size of objects in the image in order to predict how far away they are.

Until now, single-view depth networks were trained using photo collections of commonly visited tourist spots or small datasets captured with professional camera + LiDAR rigs, drastically limiting variety. MPSD has images from all continents, recorded from many cameras and in varying imaging conditions, allowing networks trained on MPSD to perform very well on other datasets, even without fine-tuning.

Network predictions Predictions of a network trained on MPSD. The images are from a different dataset (KITTI). This illustrates how networks trained on MPSD can ‘generalize’ to other images, including dynamic objects like the cyclist.

Building Mapillary Planet-Scale Depth

To create MPSD, we first built an image filtering module to gather image sequences from all over the world that satisfy a few criteria e.g. length of the sequences and camera calibration. We then used our open-source SfM library, OpenSfM, to create a 3D model of each of the selected sequences. The GPS data that is available alongside the images was used to obtain the ‘metric’ depth (that is, the depth in meters). This was not available in similar datasets until now.

Sample images These are some sample images included in MPSD, as well as the corresponding depth points that can be used to train single-view depth networks.

State-of-the-art Depth Estimation

To demonstrate the strengths of MPSD, we have evaluated and benchmarked using varying setups for single-view depth estimation. To start with, it is common to fine-tune convolutional neural networks on the training set of the dataset being evaluated. We used MPSD as a pre-training dataset before fine-tuning our network on the popular KITTI benchmark, and obtained the best score to date on their public leaderboard.

More interestingly, we believe that fine-tuning on such small datasets is not representative of real-life performance. For this reason, we also evaluated our network, trained on MPSD, on other datasets without any fine-tuning. Networks trained on MPSD obtain remarkable results without any fine-tuning on the target datasets.

Benchmarked datasets Networks trained on MPSD obtain the best performance on all the benchmarked datasets. An asterisk indicates a result where the network is also fine-tuned on the target dataset. The size and diversity in MPSD allows networks to generalize much better to all of the datasets we tried on.

Cityscapes chart It is often the case that single-view depth networks can only predict relative depth (e.g. ‘object A is twice as far away from the camera as object B’). MPSD contains metric, not just relative depth. This means that networks trained on MPSD can predict the depth in actual meters (‘object A is 15 meters away from the camera, and object B is 30 meters away from the camera’). This chart shows this effect. A network trained only on MPSD is used to predict depth on two popular datasets (Cityscapes and KITTI). The scale of the predictions is close to 1.0 for both datasets, indicating that the depth is accurate and in meters.

The massive scale and diversity of MPSD allow it to be used as a training set for single-view depth networks, obtaining state of the art results both in the fine-tuned and non-fine-tuned regimes. This would not have been possible without the contributions of our community, who upload millions of interesting street-level images every day.

For more details, please read our paper “Mapillary Planet-Scale Depth Dataset” published at ECCV 2020 along with three other Mapillary papers. The dataset will be available for download soon.

/Manuel López Antequera, Computer Vision Engineer

Continue the conversation