Extending Object Detections to Scene Classes

Today, we are announcing the extension of machine-generated detections to scene classes. The new scene classes cover transportation infrastructure such as gas stations, toll stations, and parking lots and will help cities and community groups speed up their mapping efforts.

Using computer vision and machine learning to scale map generation from street-level imagery is at the core of Mapillary’s platform and services. We first introduced automatic detection of objects within our imagery in 2015. Our object recognition algorithms provide detections for over 100 classes of objects and more than 1,500 types of traffic signs. By using detections in multiple images we also calculate the position of the detected objects and are able to provide map data as point features, traffic signs, and linear features.

The types of detections covered by our algorithms so far covered mostly well-localized objects: things that you can place as a point or a line on a map. Think of things such as fire hydrants, benches, and utility poles. However, in order to ultimately fully describe a street level scene, and place anything visible in the scene on a map, we also need to cover larger objects (gas stations, toll stations, parking lots etc.). We may also want to cover more abstract attributes such as the type of land cover or area (urban, rural, forest etc.). In addition to not being localized as a point on a map, these things are often also not easily outlined in an image.

Introducing Scene Classes Beta

Today, we take a first step in this direction by introducing the beta version of an initial set of new detections for larger types of traffic infrastructure. The new detections include:

  • Gas stations
  • Toll stations
  • Parking lots
  • Roundabouts
  • Intersections
  • Tunnel Entries
  • Tunnel Exits
  • Train stations

These new detections can be accessed in the same way you access existing detections, via the map data menu on the mapillary web app.

Gas stations Activate scene classes from the Map Data menu. This example shows gas stations.

Intersections An new intersection detected that's not yet on the map.

We also updated the UI for the map data screens in the mapillary web app. A new menu item in the sidebar allows you to see which scene classes were detected in a specific image. We also added the list of object detections in the same place in addition to highlighting their pixel segmentation in the image. A count badge highlights the number of detections for any selected image.

Image UI showing object detections The new sidebar menu lists the newly introduced scene classes, as well as the already previously available object detections.

Scene Understanding as Image Classification

To enable the new detections described above, we had to tackle an image classification task which is different to the existing computer vision approaches on our platform. The crucial difference is that we want to make predictions on the whole image content at once rather than subdividing the image into different semantic parts. Instead of inferring a list of objects together with their outlines from an image, we pose the problem as an image classification task, which makes binary decisions of whether an object or attribute is present in the image or not, without taking note of its position.

Image classification may seem like the easier task as we do not care about the position, however, there are other challenges when it comes to training machine learning models for classification tasks since there is less supervision in the training data. In particular, we never tell the model where it has to look for an object or a sign of presence of a certain attribute during training; instead, we only give the information that something is present or not and the model has to learn how to distinguish those cases on its own. Ultimately, this leads to higher demand in the number and variance of training images to make sure that the model does not get confused by certain frequently occurring but unrelated co-occurrences in the training data and that we have enough challenging negative signals in the dataset. This is especially true as we always aim for models that work all across the world.

Training a robust model that works globally is tricky. One of the difficulties lies in the requirement of a diverse training dataset. This is actually not because of the image annotation itself which is mostly a matter of assigning tags to images. The trickiest part is to curate the annotation and the training cycles to efficiently obtain hard examples progressively. We soon realised that a proper selection of training images covering negative samples is a key ingredient. Therefore, we use an iterative approach to improve our training data:

  1. Bootstrap dataset with a few positive and negative samples
  2. Train a classification model
  3. Record predictions on a huge set of unlabeled images across the world
  4. Select new images to annotate with decisions based on the predictions
    1. Cover all ranges of predicted scores
    2. Focus on samples with high entropy in the prediction
    3. Focus on negative samples with high scores (hard negatives)
    4. Ensure geographical diversity
  5. Continue with Step 2.

Bootstrap dataset

Following this approach, we are constantly working on improving our models in detecting these new types of objects and attributes. In addition, we are working on extending the list of objects and attributes. In the future, we will also look into spatial aggregations of certain attributes (eg. identifying urban areas) and use inferred attributes to derive a measure for the quality of images.

This type of scene understanding required a whole new set of algorithms in our computer vision system that are closer to scene classification rather than object detections. While we work on expanding the set of capabilities, we are looking forward to your feedback on our beta release of the classes introduced above. We would like to thank our community again for their contribution of a globally diverse set of street-level images that enable this computer vision capability in our system.

/Christian, Computer Vision Engineer

Continue the conversation