Mapillary Research has developed a novel approach to training recognition models to handle up to 50% more training data than before in every single learning iteration. With this technology, we can improve over the winning semantic segmentation method of this year’s Large-Scale Scene Understanding Workshop on the challenging Mapillary Vistas Dataset, setting a new state of the art.

Deep learning is after us. Everywhere. It’s a rapidly developing technology, crucial for yielding state-of-the-art performance in audio, image, and video recognition. At Mapillary, we use computer vision for extracting map data from street-level images, so we’re also heavy users of deep learning. Since our platform is device-agnostic, we deal with data from virtually any imaging sensor. What we observe is that the provided data can get humongously large not only in terms of the number of images but also in image resolution.

Semantic segmentation is one of the core recognition problems for Mapillary. It helps us understand images on a pixel level, forming the basis of true scene understanding. As an example, you can browse the beautiful city of Graz on Mapillary and see segmentation results obtained from the current production pipeline.

Semantic segmentation on Mapillary

When doing semantic segmentation, we face two major challenges. First, we need to train recognition models that can absorb all the relevant information from our training data. Second, once we have these models, we apply them to new and previously unseen images so they can recognize all the objects we are interested in.

To address the first challenge, we have developed a novel memory-saving approach to training recognition models. Applying this to our semantic segmentation models allows us to handle more and much larger input training data samples. Just to give an idea of the absolute terms: our previous models were trained using Caffe and we could only use a single crop of pixel size 480x480 per GPU when training on Mapillary Vistas. Now we have migrated to PyTorch, which together with our memory-saving idea drastically increases data throughput to handling 3 crops per GPU, each of size 776x776.

This means that we can pack about eight times more data on our GPUs during training than we could before, improving over the winning approach of this year’s Large-scale Scene Understanding Workshop and obtaining the so far highest reported score on the Mapillary Vistas validation set in a lean, single-model and single-scale test setting. All details can be found in our arXiv paper and you can also take a look at the implementation of our new in-place activated batch norm layer—but we’ll provide a short overview in this post on how our new approach yields this result.

Semantic segmentation from applying the new training approach

A simplified view on our idea

There are several modern deep learning network architectures that are essentially all assembled by the following layers: Batch normalization, Non-linearity, and Convolutions. For the sake of completeness, here is a brief (reduced) description of them. Batch normalization (BN) performs a data whitening operation based on the batch content, followed by scaling and shifting with parameters learned during the training procedure. This basically helps to take a normalized view on the data such that value ranges are comparable, allowing for larger learning rates during training.

Non-linearities (𝜙) apply a non-linear transformation to the data. Very primitive but highly effective ones are e.g. Rectified Linear Units (ReLU), completely suppressing all negative inputs while preserving positive ones as they are.

Finally, Convolution layers (Conv) typically comprise many small filters (e.g. of size 5x5 or 3x3 pixels) that are learned during training to respond to the input data in particular ways. For example, filters could detect corners or oriented lines early in the neural network, while layers further up in the network could be assembling these low-level primitives to shapelets, fragments of objects, and ultimately, objects as a whole. The number of filters is used to steer the amount of produced intermediate features, which is also related to the amount of memory used. Convolution layers are where the actual brain (or artificial intelligence) analogy comes from.

In Figure 1 we show how the above mentioned layers are typically connected and how information flows during forward (green) and backward (blue) passes in modern deep network building blocks, respectively. Since in this post we promise to significantly reduce memory consumption during training, we need to understand how memory is utilized and stored. Taking a look at Figure 1, we can see how in the forward pass the input data x goes through BN and becomes y, which then passes non-linearity 𝜙 and becomes z, before it ultimately goes through Conv and results as u.

Standard approach

Figure 1. Standard forward and backward implementations of batch normalization, non-linearity and convolution layers, requiring two storage buffers x and z

In standard implementations, the dashed blocks show intermediate results which have to be stored because they are needed in the backward pass. Storage buffers for x and z are rather costly in terms of memory consumption (for simplicity, we ignore 𝜇B and 𝜎B here), so we decided that we have to get rid of one of them. Since the building block from Figure 1 can get repeated much more than 100 times in modern networks, saving one buffer per building block results in massive overall memory reduction.

Our proposed solution allows us to recover necessary quantities by re-computing them from saved intermediate results in a computationally very efficient way. In essence, we can save ~50% of GPU memory in exchange for minor computational overhead of only 0.8–2.0%. In Figure 2, you can see how saving a single buffer z enables us to recover the needed quantities by inverting non-linearity 𝜙 to obtain y, and by furthermore computing the partial derivatives needed during the backward pass in BN𝛾,𝛽 directly as a function of y.

New approach

Figure 2. Our proposed in-place activated batch normalization approach, requiring only a single storage buffer z

The exact details can be found in the technical report, and you can see more examples of the positive effects of our approach in the segmented videos below. We’re keen to hear what you think so leave us a comment or get in touch via email.

/Peter & Team Research

Tags for this post: computervision
comments powered by Disqus