We are happy to announce one of the first new contributions driven by our AI Lab in Graz, Austria: adding a feature that provides semantic segmentations for a select number of object categories. In this blog post we will discuss what this is all about and why we think that this development will be useful for our members and customers.
With over 20M Mapillary photos processed so far, we see this advancement as the beginning of our quest to apply broader object recognition globally to all Mapillary photos.
Semantic segmentation is a long-studied problem in computer vision and one of the grand challenges to be solved when it comes to automatically understanding the world we live in. Essentially, semantic segmentation allows us to assign a categorical tag (also called a label) to each pixel in an image. For humans this is relatively easy, as we can typically understand the content of an image in terms of the objects it contains. For instance, most of us can identify and locate cars, pedestrians, houses, etc. in images without having to think twice about it.
Teaching machines to see (and, furthermore, understand) what is happening in an image makes for a completely different story, however. These days, we take it for granted that an image will have a resolution of, say, 10 megapixels, which to a machine means that there are essentially 10 million points arranged on a 2-dimensional grid to sort through. But how can we help a computer orient itself in such a vast pool of data and actually provide us with a semantic annotation for each pixel in an image?
The last five years have given rise to a technology called “deep learning,” which uses so-called “convolutional neural networks” under the hood. For those not familiar with the field, here is a quick intro (see also this article on deep learning in a nutshell). Deep learning is a promising concept in machine learning that essentially aims to teach a computer to solve a problem it has not seen before by developing an algorithm that allows the computer to become familiar with solutions to similar problems. Traditionally, a human programmer had to tell the computer what kind of features to look for in order to solve the problem. Deep learning, by contrast, enables you to skip that step, because by “studying” similar problems and solutions, the models are capable of learning to focus on key features by themselves—similarly to how scientists think a human brain does. Hence the name “neural network”.
While this sounds super-fancy at first glance, and for some might conjure images of machines taking over the world, it is essentially a revival of a technology that has been known since the ‘80s (where K.I.T.T. in Knight Rider would have taken over the world). However, with some important twists in the algorithms, tons of data, and modern computers (specifically, graphics cards) being orders of magnitudes more powerful than K.I.T.T. could have dreamed of, modern deep learning has now beaten all other approaches by a wide margin.
Diving into a bit more into detail, a convolutional neural network comprises a number of layers, each of which serves a particular function that can be expressed in a mathematical way. A layer could hold parameters (a type of memory into which the information from the training images gets absorbed during the learning process) or not, and the associated function has to be differentiable, such that we know which “direction” the optimization algorithm should take.
Taking a look at the illustration below, we see a typical network layout used for image classification, where the ultimate goal is to obtain an educated guess from the network about the most dominant object category in the image. As can be seen, the original input image enters on the left side, and the data is transformed by gradually passing through several layers (convolutional (CONV), rectified linear (RELU), pooling (POOL), and fully connected (FC)). We can think of these layers as ways to manipulate their respective inputs until the final layer provides a probability distribution for what type of object the input image shows, based on the set of object categories the network has been trained on.
An example of a convolutional network comprising several layers (image credit Andrew Karpathy, Stanford class cs231n)
One of the most fascinating things about deep networks is how they actually learn—i.e., how they start making sense of what the pixels in an image are all about and how they memorize this information. The training process is an iterative procedure, where training images together with their ground truth tags (i.e., a given signal that says “this particular image contains a car”) are presented to the network many times and ideally in many different versions. In the example image below, this could have been one of thousands of images for the object category “car” that the network saw during training. At some point, it will figure out what makes a car a car, and ideally learn to recognize this in previously unseen images as well. For more details on the learning part of the process, we refer the interested reader to look out for lectures on optimization with stochastic gradient descent, and backpropagation (or the chain rule).
For our new semantic segmentation feature in Mapillary we are using a slightly different architecture and some additional layers, but in the end the basic idea remains the same as the example discussed above.
Enough with the techy part—how can we see the results on Mapillary? When browsing our viewer, simply click on the filter and enable the segmentation option “Show identified photo features”. You will see the list of object categories on the left. Hovering over a specific category, you can see all the corresponding segments highlighted in the image. Currently we support 12 object categories that are most commonly seen in a road scene.
The photo features filter in the Mapillary viewer
We believe that having semantic segmentation on street-level photos will open up many exciting avenues moving forward. One immediate application for Mapillary is to use semantic segmentation to improve other computer vision tasks. In 3D reconstruction, for instance, the presence of moving objects can decrease the quality of the reconstruction. By ignoring matches between moving objects (e.g., clouds and vehicles), we have seen significant improvement in our 3D reconstruction pipeline. Another interesting application is to enrich mapping or navigation data by investigating the spatial layout and presence of different categories. For instance, we can get a rough estimate of vegetation density or the availability of sidewalks in certain parts of cities.
We are very excited to launch this first version of semantic segmentation so that we can listen to your feedback and understand what other object categories you are interested in detecting and segmenting. This is only the beginning of our bigger picture to represent, understand, and make our world more accessible to a broad community.