Achieving New State-of-the-Art in Monocular 3D Object Detection Using Virtual Cameras

We are introducing a new way of doing 3D object detection from single 2D images. The architecture is called MoVi-3D and is a new, single-stage architecture for 3D object detection. Starting from a single 2D image, it uses geometrical information to create a set of virtual views of the scene where the detection is performed using a lightweight infrastructure.

Autonomous driving is completely reliant on technology that is able to detect objects in street scenes. We’ve seen a lot of improvement in 2D object detection technology over the past decade, but 3D detection, which is what autonomous technology needs, is still a significant challenge. That’s why autonomous vehicles today rely on LiDAR, an expensive piece of technology that can accurately estimate the distance of objects in its surroundings.

LiDAR is not the way forward for rolling out autonomous vehicles broadly though, as it’s simply too expensive. The combination of cameras and computer vision has the potential to remove the reliance on LiDAR and bring down the costs substantially. For that to work and be cost-effective, we need robust computer vision technology that can accurately detect objects using only cheap cameras.

Today we present a new way to do 3D object detection in 2D images. We introduce a new type of training and inference scheme, termed virtual cameras, as well as a new lightweight and single-stage architecture which we’ve named MoVi-3D.

Existing methods usually perform training and inference using a single and fixed view of the scene, captured by a single and fixed camera. This has the drawback that near and far away objects have consistently different dimensions in the image, causing the complexity of the task to increase. Instead of relying entirely on this fixed view, we exploit it along with available information about the camera to create a set of virtual views, generated by a set of virtual cameras that we can virtually place anywhere across the scene. Differently from fixed views, in our virtual views the dimensions of the objects remain constant regardless of their distance, facilitating the detection task.

2019-12-19-vitrual-cameras
Our Virtual Camera idea for single image 3D object detection: The image on the right is taken from a fixed camera positioned at the black triangle shown on the bottom left. After positioning virtual cameras to the colored locations in front of the fixed camera, we obtain corresponding, virtual views on the right

On top of this, we demonstrate that the use of virtual cameras enables us to reduce the complexity of the architecture. That’s why we introduce MoVi-3D, a lightweight architecture that has the ability to perform the detection on multiple categories of objects in a single-stage.


Our Mo-Vi3D results on two KITTI3D video sequences. The model performs the 3D object detection by relying only on the RGB image shown on the bottom right. To have a better understanding of the quality of our results, we also visualize the same scene from the Bird’s Eye View which is shown on the left and from the LiDAR point cloud, which is visualized on the top right.

We’ve run MoVi-3D on the popular KITTI3D dataset and, despite its simplicity, it is currently the world’s best-performing method for monocular 3D object detection on several object classes. For cars, which is the most important and represented category in KITTI3D, we improve the 3D average precision by +12.3% (moderate difficulty) and +24.8% (hard difficulty). In the Birds Eye View average precision metric, we improve by +24.6% (moderate difficulty) and +33.5% (hard difficulty).

The approach is published in a paper named Single-Stage Monocular 3D Object Detection with Virtual Cameras. It was recently published on arXiv and you can read it here.

/Andrea, PhD Researcher

Continue the conversation