Learning with Verification: Improving Object Recognition with the Community’s Input
For Mapillary to automatically detect objects in images, we train our deep neural networks using diverse manually annotated datasets. It’s very challenging to collect a training dataset for detecting street-level objects globally, because it’s difficult to cover the variation in visual appearances and combinations of objects of interest. Due to their costs, manually annotated datasets are usually not complete enough for addressing real-world problems.
Annotating images manually is a very time-consuming task. For example, annotating our Mapillary Vistas Dataset took more than 10 person-years from professionally trained annotators. That’s why we wanted to investigate a way to reduce the manual annotation effort significantly by verifying machine-generated detections instead of annotating images completely from scratch.
Large-scale verifications
To get human-verified detections, we set up challenges for our community to find errors in our machine-generated detections. In particular, we asked them to check if the machine-generated detections were correct. Our community could simply answer with ”Yes”, ”No”, or ”I don't know”, and collect points for each verification.
From this challenge, we received 590K answers from 381 individual contributors from our community. 297K of them were approvals of detections, and 293K were rejections. The detections originated from 195K images altogether. The number of verifications per class are shown in the following figure.
To obtain consistent annotations, we required the detections to be either approved or rejected by two independent contributors. This led to a total number of 114K positive (approvals) and 137K negative (rejections) training samples to be used for improving our algorithms.
Since many objects in an image are usually correctly detected, we don't verify all detections but rather pick random detections in images from all over the world. Doing so, we end up with partly annotated images where for a few detections, we know for sure if they belong to a certain object class or not.
The method
To improve our deep neural networks, we developed a method to incorporate both fully and partly annotated images. This is different from fully supervised machine learning methods because we have to deal with image areas that we don't have annotations for, and verified detections for which we only know if it belongs to a certain class or not (instead of what class it belongs to).
We started with the approach of treating rejections as background and approvals as foreground examples. However, this approach is not ideal in case the predicted class is wrong and the correct class would be any of the others we are looking for.
Because of that, we also developed a loss function which targeted the potential confusion between classes. Experimentally, this loss function was not able to outperform the simple approach of treating rejected bounding boxes as background.
After inspecting the verifications visually, we noticed that most rejections are for detections where the object actually doesn't belong to any of the classes we're looking for. The similar performance of the two methods can be related to the fact that most rejections are valid background examples. Future research is needed to verify this.
To evaluate the effectiveness of using verifications, we chose the recognition tasks of object detection and panoptic segmentation. As a baseline implementation, we used our in-house version of Seamless Scene Segmentation for prototyping, which is trained on the Mapillary Vistas Dataset (MVD). For the baseline experiment, we chose Resnet-18 as the backbone and kept other parameters as default.
To train with verifications, we follow an alternating training scheme. First, we compute the gradients, using a batch of fully annotated images only. Second, we compute gradients using a batch of partly annotated images. Third, we update the weights, using the summed gradients. During training, we make use of crop-based training. The batches are distributed over 8 GPUs during training.
The results
We split the verification dataset into 179K training and 15K test images. The test images are selected to represent each class equally if sufficient examples are available. In the following charts, we show the estimated precision, recall, and specificity when training with MVD only, and with MVD plus additional verifications. The quantitative results show that we improve on all three metrics for not only object detection but also for panoptic segmentation on the verification test set.
Below you can see some visual examples of where our recognition algorithms improve by making use of simple yes/no annotations. Left without verifications, right after re-training with verifications.
Next steps
For further improvement, we would like to focus on efficient ways to refine the human-verified detections. Specifically, we want to resolve ambiguous detections where a human also has difficulty to identify the object, as well as assign the correct class to rejected detections in a semi-supervised manner.
All these improvements for object detection and panoptic segmentation on image level let us increase the quality of our machine-generated map features for the OpenStreetMap community and everyone else. We’re planning to deploy the deep neural networks re-trained with verifications to the Mapillary platform in the future.
We’ll keep collecting verification data for further improvements so we encourage you to set up your own verification projects or join and help out in those listed on the Mapillary Marketplace. As of recent, we also have a verification API that lets you approve or reject any detection, even outside verification projects.
A big thank you to our community who contributed with more than half a million verifications to help improve our recognition algorithms, leading to better map data for everyone and more accurate maps everywhere!
/Gerhard, Recognition Team Lead