Reconstructing a 3D world from 2D images is not as straightforward for a computer as it is for humans due to the fact that some objects are moving around in the real world. Understanding an image scene through semantic segmentation improves the 3D reconstruction, resulting in more accurate map data and better navigation in the image viewer.

For all images uploaded to Mapillary, we use a technology called Structure from Motion to reconstruct 3D structures and improve camera positions from 2D projections of those structures. The aim is to understand the appearance of important objects and structures like buildings, roads, and traffic signs in the 3D scenes, and precisely position them to generate map data.

While images uploaded to Mapillary generally depict street scenes with mostly static objects, many of them also contain moving objects or objects that are part of the scene only temporarily. Such objects include vehicles, bicyclists, and pedestrians, and also less obviously moving objects like clouds, camera mounts, dashboards, car hoods, or windshield reflections. In the case of 360° panoramas, car roofs, helmets, and even the mapper herself (if the camera is not held high enough above the head while walking) could take up a large part of the image.

A typical street scene where we can see quite a few moving or temporary objects.

The problem

Moving objects confuse 3D reconstruction algorithms. It is unclear whether it is the camera that is moving, the objects, or both. This sometimes leads to erroneous camera motions when the algorithms reconstruct the trajectory of the camera with respect to a moving object. What we want is the trajectory of the camera with respect to the ground.

The initial step in the SfM pipeline is to detect keypoints in the 2D images; then match the features of those keypoints with the keypoint features of neighboring images to obtain keypoint correspondences. If features are matched correctly, these correspondences can be triangulated in 3D space and the relative camera positions can be determined.

Keypoints can be detected everywhere in an image: on static objects like buildings but also on moving vehicles and clouds. If a moving object has many keypoint correspondences across multiple images, it can take over and become the significant factor leading to a faulty 3D reconstruction.

Keypoints Keypoint detections in an image from San Francisco. Keypoints are detected on buildings but also on cars along the side of the road‚ the bus in front of the camera, and in the sky.

An example

Let's take a look at an example. The sequence below shows nine images taken roughly ten meters apart along a straight road in San Francisco. The mapper was unfortunate to end up behind a bus that takes up a large part of the images.

Sequence A sequence of nine images from San Francisco. The sequence is ordered in rows and the first image is the top left and the last is the bottom right. The rear of a bus is visible in all of the images.

When reconstructing the 3D scene of this sequence, some problems occur. The rear of the bus gets many feature correspondences along the whole sequence and becomes significant in the reconstruction. For a human, it is easy to see that we are following a bus along a road but for the 3D reconstruction algorithm it is ambiguous. Are we moving back and forth in relation to a static bus, with moving objects on each side (the buildings)? In this case, the reconstruction of the bus prevails over the reconstruction of the buildings and it is perceived as a static object.

Reconstruction problems like this lead to faulty camera positions, wrongly placed map objects, and also to navigation arrows in the MapillaryJS viewer pointing in the wrong direction.

Reconstruction Reconstruction of the San Francisco bus sequence from above. The camera positions are not forming a straight line as expected (since the mapper moved forward along a straight road) but instead positioned in a somewhat random pattern. Some 3D points representing building walls on the left and right of the cameras can be seen, but the significant structure is the reconstruction of the rear of the bus at the far end of the scene. The backlights and the red border at the top of the bus are clearly visible.

Reconstruction side Reconstruction of the San Francisco bus sequence from the side. The reconstruction of the rear of the bus can be seen in the leftmost part of the scene. The camera positions have seemingly random altitudes.

The solution

With the help of sematic understanding of the captured scene, it is possible to resolve the problem of most moving and temporary objects. By making use of highly accurate machine-genereated semantic segmentations, we can determine which keypoints belong to potentially moving objects. We can then do the reconstruction using only keypoints on static object classes like buildings and roads. The result is a 3D reconstruction that depicts the important non-moving parts of the scene and the camera motion with respect to them.

Segmentation mask Creating a keypoint mask. Left: original sequence image. Middle: sematic segmentation for the image. Right: binary mask with white for included segmentation classes and black for excluded. Excluded classes are vehicle, pedestrian, and sky, among others.

Fixed reconstruction Reconstruction after masking feature points for excluded classes. The cameras are correctly placed along a straight line with roughly the same distance between them. The 3D points representing the bus are gone and the walls of the buildings are correctly reconstructed.

Summary

Mapillary data is rich and diverse, since anyone can upload any images taken in any conditions anywhere. This makes the task of understanding the underlying 3D world complex; for example, because of non-static scenes with moving objects. Combining semantic understanding with Structure from Motion can help leviate some of the difficulties and improve the 3D reconstruction results, leading to more accurate map data and better navigation in the image viewer.

/Oscar

Tags for this post: computervision
comments powered by Disqus