Training Machines to Attain a 3D Understanding of Objects from Single, 2D Images

We sit down with Peter Kontschieder, the Director of Research at Mapillary, to talk about “Disentangling Monocular 3D Object Detection”, the latest academic paper to be published by Mapillary’s Research team. Peter tells us about how 3D object detections made in single 2D images have the ability to improve mapmaking and push down the cost of autonomous vehicles, and how the team unveiled a fundamental flaw in the metric used by the most dominant benchmarking dataset in this area.

We recently announced that Mapillary will publish four papers at CVPR in June this year, one of which will be presented at the conference. CVPR, or the Conference on Computer Vision and Pattern Recognition, is one of the world’s most prominent computer vision conferences, so this is great news not just for Mapillary and our Research team, but it also shows the importance of the imagery dataset that’s been contributed to by people all across the world.

Between then and now, Mapillary’s Research team has published another technical report. Today we sit down with the team’s Director of Research, Peter Kontschieder, to talk about how the paper outlines a way for machines to attain a 3D understanding of objects in single, 2D images, and how the paper unveils a major glitch in the dominant benchmarking dataset for 3D understanding in 2D images.

Disentangling monocular 3D object detection

Named “Disentangling Monocular 3D Object Detection”, the paper delves into the ill-posed problem of enabling a 3D understanding of objects in a 2D image.

“The paper is about monocular, RGB image-based 3D object detection. It’s quite a mouthful, but what it means is that you try to teach a machine a three-dimensional understanding of objects in a single image. For this paper, we looked at cars, pedestrians, and bicyclists. From a single 2D image, we want the machine to be able to produce a 3D bounding box around an object to show its height, depth, and width, its position in the scene, including depth, as well as its orientation. This is a very difficult problem to try and solve from just one image”, Peter says.

The paper taps into a particularly hot topic as cameras are quickly rising as a potential alternative to LiDAR for autonomous vehicles to understand their surroundings.

Peter explains: “Cameras are super cheap. That’s why recording RGB image data is a viable alternative to LiDAR, which is very expensive, when it comes to autonomous vehicles. It’s also why we’re seeing an explosion of cameras being mounted and installed everywhere. Typically though, cameras are passive devices that have a lens and a sensor that allow you to just capture an image, unlike LiDAR which also has the ability to actively sense the depth of a scene. Cameras don’t, instead they mimic the way a single human eye works. But most humans have two eyes, and that’s what allows us to perceive depth. Since cameras only have one eye, 3D object detection becomes a challenging issue. That’s what we address in this paper. With just one eye, or one image, how can we still perceive depth?”

Mapillary's approach to 3D bounding boxesThis animation shows how object detectors learn to generate a red 3D bounding box with respect to box extent, location, and scene depth in order to match the actual target box in green. Previous works (top) produce unrealistic deformations during the learning process while Mapillary’s approach (bottom) converges faster and maintains plausible box configurations by first rotating and scaling the initial box before correcting the scene depth.

3D detections from a single image input have remained a largely unexplored topic, even though it has the potential to have a significant impact on everything from computer vision-powered mapmaking to autonomous vehicles. In the case of Mapillary, enabling 3D detections in the context of single images will improve the data that we provide to mapmakers, cities, and carmakers everywhere.

“One thing we always aim for at Mapillary is better accuracy of our detection algorithms. To do this, we try to distill as much complementary information as possible about the objects in the images. In other words, information like how far away the object is helps us to sharpen our detection abilities and improve our data.”

There has been a small number of papers published on 3D detections in the context of single image inputs in the past, but the results attained in Mapillary’s Disentangling Monocular 3D Object Detection (see table below) outperform the related and directly comparable works by far.

Mapillary's detection results compared to others on the KITTI3D benchmarkMapillary’s results on KITTI3D are highlighted in green above, in comparison to the results of related works in gray part at the bottom. All the entries on top, before the gray section starts, are Mapillary’s, with the MonoDIS method producing the best results by far. The max value is 100% and higher is better.

Virtually all papers that have explored this area in the past have used the dominant benchmarking dataset, KITTI3D, to understand and benchmark their results, as can be seen in the table above. KITTI3D has also been used to evaluate more than 200 other works for 2D object detection. As part of the research, Peter and his team discovered a fundamental flaw in how the benchmarking dataset evaluates research results for 3D detections.

Unveiling weaknesses in the benchmarking dataset metric

“Some of these things are pretty complex and bugs are part of the process. That’s why when we started getting the first experimental results of our method, we thought that there might be a bug in our code. Turns out, it wasn’t actually a bug on our end. It wasn’t even a bug in the code on KITTI3D’s side.

What Peter and his team discovered is that the underlying mechanism to compute the performance on the KITTI3D dataset suffers from a dramatic glitch. To actually determine the performance, the evaluation server receives a list of detection results for each test image, together with a value indicating the confidence per prediction, where 1.0 means 100% confident. In below image there’s an example of 3D box predictions and their confidence values found by a detection algorithm. The predictions with confidence values >0.8 seem reasonable, and in this example, you can easily separate them from the obviously wrong ones by thresholding at a confidence value, for instance, 0.5. So far, so good.

The glitch appears when cranking up the threshold so that only the single, best prediction survives. For example, in the image below, the threshold could be set to 0.99925, and only the car with the light green box and confidence 0.9993 on the right would remain. By assuming that there is no other test image with equal or higher detection confidence, the final list provided to the evaluation server would only comprise of this single, green bounding box. When submitting this (very short) result list to the server, it would actually provide an overall KITTI3D algorithm performance of 1/11~ 9.09%! In other words, although the test dataset contains thousands of images with thousands of cars in them, the KITTI3D evaluation metric can be fooled by providing just one prediction from the image that the machine is most confident about.

3D bounding boxes on a KITTI3D test imageAbove shows an exemplary KITTI3D test image with superimposed, 3D box detection results and associated prediction confidence values.

As put by Peter: “This is explosive stuff. Hundreds of academic papers have used the KITTI3D dataset and relied on the provided results, meaning that their research was affected since the metric used for benchmarking on the dataset is skewed.”

How come this hasn’t been discovered before? The original definition of the metric has predominantly been used for measuring 2D detector performances. Since many 2D detection algorithms perform very well, the nature of the bug hasn’t had much of an impact in evaluating 2D detection methods.

Peter explains: “The bug hasn’t been discovered before since it only has minor impact on algorithms that are performing in the 80-90% area. The field at large is really well developed when it comes to 2D detection algorithms and most recent results on KITTI for 2D detections hover somewhere above 85%, so it’s no wonder that this hasn’t been discovered before.“

As an added bonus, Peter and his team propose a solution to the glitch as part of their research. The full paper is available on our website and on arXiv, where you’ll find more details about both the glitch, the proposed solution, and how to best train machines to attain a 3D understanding of objects from single, 2D images.

The entire Mapillary Research team will be at CVPR in Long Beach this year. You can see the full Mapillary schedule below. The team would love to talk to you—find them at one of the talks or at booth #1104.

Mapillary schedule at CVPR 2019

/Sandy, Head of Communications

Continue the conversation