27 - Deep Learning - Architectures Part 2 [ID:16105]

50 von 57 angezeigt

Welcome back to deep learning and today I want to talk about part two of the architectures.

I think we can start. I have beer and I prepared another bottle. What?

And now we want to go even a bit deeper in the second part, some deeper models.

So we see that going deeper really was very beneficial for the error rates.

So we can see with the results on ImageNet in 2011 with the shallow support vector machine.

You see that the error rates were really high, 25 percent.

AlexNet already almost cut it to half in 2012. Then Zyler 2013, next winner with again eight layers.

Then VGG in 2014, 19 layers already better. GoogleNet in 2014, 22 layers also almost the same performance.

So you can see that the more we increase the depth, the better seemingly the performance gets.

And we can see there's only a little bit of margin left in order to beat human performance.

Humans are a low bar to exceed.

Depth seems to be a key role in building good networks.

Well, why could that be the case?

One reason why those deeper networks may be very efficient is something that we call exponential feature reuse.

So here you can see if we only had two features, if we stack neurons on top, you can see that the number of possible paths is exponentially increasing.

So with two neurons, I have two paths. With another layer of neurons, I have two to the power of two paths.

And then next layer, two to the power of three paths, two to the power of four paths and so on.

So deeper networks seem somehow be able to reuse information from the previous layers.

And we can also see that if we look in what they are doing, then we also get these visualization results that they increasingly build more abstract representations.

So we somehow see a kind of modularization happening and we think that deep learning works because we are able to have different parts of the function at different positions.

So we are disentangling somehow the processing into simpler steps.

And then we essentially train a program with multiple steps that is able to describe more and more abstract representations.

So here we see the first layers, they do maybe edges and blobs. Let's say layer number three, there's textures, layer number five, object parts, layer number eight, already object classes.

These images here are created from visualizations from AlexNet.

So you can see that this somehow seems to be happening really in the network. And this is also probably a key reason why deep learning works so well, that we are able to disentangle the function and that we try to compute different things at different positions.

I think of deep learning as basically learning programs that have more than one step.

Well, we want to go deeper and one technology that has been implemented there is again the inception modules and the improved inception modules now essentially replace those filters that we've seen with the five by five convolutions and three by three

convolutions into multiple of those convolutions. So that was the first idea that the inception module instead doing a five by five convolution, you do two three by three convolutions in a row.

That already saves a couple of computations, and you can then replace five by five filters by stacking filters on top. And we can see that this actually works for a broad variety of filters that you can actually separate them into several steps after another, and you can cascade them.

Filter cascading is something that you would also discuss in the typical computer vision class.

So inception V2 then already had 42 layers, and they start with essentially three by three convolutions and three modified inception modules like the one that we just looked at.

Then in the next layer, an efficient grid size reduction is introduced that is using strided convolutions so you have one by one convolutions for channel compression, and then a three by three convolution with stride one followed by a stride two.

And this essentially effectively replaces the option of the different pooling operations.

And the next idea then was to five times introduce modules of flattened convolutions and here the idea is to express the convolutions no longer in 2D convolutions, but instead you separate them into convolution into X and Y direction and you alternatingly produce those two convolutions.

So you can see here we start with one by one convolutions in the left branch, then we do a one times N convolution followed by an N time one convolution followed by a one time N convolution and so on.

And this allows us essentially to break down kernels into this direction and then this direction. So you know because you alternatingly change the orientation of the convolution, you essentially breaking it up into 2D convolutions by forced separable computation.

We can also see that separation of convolution filters also works for broad variety of filters.

Of course, this is a restriction. It doesn't allow all of the possible computations but remember, we have in the earlier layers full three by three convolutions, so they can already learn how to adopt the later layers, such that they can then be processed by the separable

convolutions. So that's very efficient. And this then leads to Inception v3. So the third version of Inception is that they use essentially Inception v2 and introduce Arm-as-Prop for the training procedure, batch normalization also in the fully connected layers of the auxiliary

classifiers and labels moving regularization.

Label smoothing regularization is a really cool trick. So let's spend a couple of more minutes looking into that idea.

Now, if you think about how our label vectors typically look like.

We have one hot encoded vectors, which means that our label is essentially a direct distribution. And this essentially says okay this one element is correct and all others are wrong.

And we typically use a softmax. So this means that our activations have a tendency to go towards infinity.

And this is not so great because we continue to learn larger and larger weights, making them more and more extreme.

So we can prevent that if you use weight decay, because this will prevent our weights from growing dramatically.

And we can also use in addition label smoothing. And the idea of label smoothing is now that instead of using only the direct pulse, we kind of smear the probability mass onto the other classes and this is very helpful in particular in things like

the image net where you have rather noisy labels. So you remember the cases that were not entirely clear.

And in these noisy label cases, you can see that this label smoothing can really help.

And the idea is that you multiply your direct distribution with one minus some small number epsilon and you then distribute the epsilon that you deducted from the correct class to all the other classes in an equal distribution.

Teil einer Videoserie :

Deep Learning

Presenters

Prof. Dr.-Ing. Andreas Maier

Zugänglich über

Offener Zugang

Dauer

00:10:03 Min

Aufnahmedatum

2020-05-18

Hochgeladen am

2020-05-19 00:36:27

Sprache

en-US

Deep Learning - Architectures Part 2

This video discusses the success of deeper models including Inception V2 and V3. One key technology that is introduced in V3 is label smoothing regularization.

Video References:
Lex Fridman's Channel
Morf's Channel

Further Reading:
A gentle Introduction to Deep Learning

Tags

Per RSS abonnieren