Welcome back to deep learning. So today we want to discuss the single shot detectors
and how we can actually approach real-time object detection.
Okay, so fourth part of segmentation and object detection, the single shot detectors. So can't
we just use the region proposal network as a detector in a you look only once fashion?
And this is the idea of YOLO that is a single shot detector. You only look once, you combine
the bounding box prediction and the classification into a single network. And this is done by
subdividing the image essentially into S times S cells. And for every cell you do in
parallel the class probability map computation and you produce bounding boxes and confidence.
And this then gives you for each cell B bounding boxes and a confidence score and the class
confidence that is produced from a CNN. So the CNN predicts S times S times 5B plus C
values where C is the number of classes. In the end to produce the final object detection
you compute the overlap of the bounding box with the respective class probability map
and this then allows you to compute the average within this bounding box to produce the final
class of that respective object. And this way you are able to solve complex scenes like
this one and this is really real time. So there is YOLO 9000 which is an improved version
of YOLO which is advertised as better, faster and stronger. So it's better because the batch
normalization is used and they also do high risk classification to improve the mean average
precision by up to 6%. The anchor boxes that are found by the clustering over the training
data improves the recall by 7% and training over multiple scales allows YOLO 9000 to detect
object at different resolutions more easily. It's faster because it's using a different
CNN architecture that speeds up the forward pass and it's stronger because it has this
hierarchical detection on a tree that allows to combine different object detection data
sets. All in this allows YOLO 9000 to detect up to 9000 classes in real time or faster.
There's also the single shot multi box detector in reference 24. It's a popular alternative
to YOLO, a single shot detector like YOLO but only one forward pass through the CNN.
It's called multi box because this is the name of the bounding box regression technique
in reference 15 and it's obviously an object detector. It differs from YOLO in several
aspects but shares the same core idea. Now you have this problem with multiple resolutions
and in particular if you think about tasks like histological images that have a very
very high resolution then you can also work with detectors like RetinaNet and RetinaNet
is essentially using a ResNet CNN decoder so very similar to what we've already seen
in image segmentation and then it's using a feature pyramid net that allows you to couple
the different feature maps that are produced with the original input images that are generated
from the decoder. So you could say it's very similar to a UNet but in contrast to UNet
it does a class and box prediction using a subnet on each of the scales of the feature
pyramid net. So you could say it's a single shot detector that uses UNet simultaneously
to the class and box prediction. Also it uses the focal loss that we will talk about in
a couple of slides. Let's look a bit at the trade of speed and accuracy. You can see that
generally networks that are very accurate are not so fast. So here you see on the x-axis
the GPU time and on the y-axis the overall mean average position and you can see that
you can combine the architectures like single shot detectors, a regional fully connected
network or ideas like FASTA or CNN in combination with different feature extractors like Inception
Resonant, Inception and so on. And this allows us to produce many different combinations
and you can see that if you spend more time on the computation then you typically can
also increase the accuracy and this is reflected in this graph. The class imbalance is key
to tackle the speed accuracy trade-off. All of those single shot detectors evaluate many
hypothesis locations. Most of them are really easy negatives. So this imbalance is not addressed
by the current training and in classical methods we typically dealt with this with hard negative
mining and now the question is can we change the loss function to pay less attention to
easy examples. And this idea exactly brings us to the focal loss and here we can essentially
Presenters
Zugänglich über
Offener Zugang
Dauer
00:08:11 Min
Aufnahmedatum
2020-10-12
Hochgeladen am
2020-10-12 22:06:31
Sprache
en-US
Deep Learning - Segmentation and Object Detection Part 4
In this video, we look at some ideas on how to perform object detection really quickly. This leads to single shot detectors of which YOLO is one of the most popular ones. If you are in need of multi-scale object detection, Retina-Net is a popular choice.
For reminders to watch the new video follow on Twitter or LinkedIn.
Additional References
nnU-Net: Self-adapting Framework for U-Net-Based Medical Image Segmentation
X-ray-transform Invariant Anatomical Landmark Detection for Pelvic Trauma Surgery
Retina-net Figure by Marc Aubreville
Further Reading:
A gentle Introduction to Deep Learning