29 - Deep Learning - Architectures Part 4 [ID:16278]

50 von 54 angezeigt

Welcome back to Deep Learning. And as promised in the last video, we want to go ahead and talk a bit about more sophisticated architectures than the residual networks that we've seen in the previous video.

I don't think that having more depth in the network in the sense of instead of 100 layers, we have 10,000 is going to solve our problem.

Okay, what do I have for you?

Well, of course we can use this recipe of the residual connections also with our inception network and then this leads to inception ResNet.

And you see that the idea of residual connections is so easy that you can very easily incorporate it into many of the other architectures.

And this is also why, you know, we present these couple of architectures here because they're important building blocks towards building really deep networks.

The stuff that works best is really simple.

You can see here that the inception and ResNet architectures really help you also to build very powerful networks.

And I really like this plot because you can learn a lot from it.

So you see on the y-axis the performance in terms of top one accuracy and you see on the x-axis the number of operations.

So this is measured in gigaflops.

Also, you see the number of parameters of the models indicated by the diameter of the circle.

Here you can see that VGG16 and VGG19, they are at the very far right. So they are very computationally expensive and their performance is kind of good, but not as good as other models that we've seen here in this class.

You also see that AlexNet is on the bottom left, so it doesn't have too many computations.

Also in terms of parameters, it's quite a bit large, but the performance is not too great.

And now you see if you do batch normalization, network and network, you get better.

And then there's the GoogleNet and ResNet 18 that have an increased top one accuracy.

And we see that we can now go ahead, build deeper models, but not get too many new parameters. And this helps us to build more effective and more performing networks.

Of course, then after some time, we also start increasing the parameter space. And you can see that one of the best performances are here obtained with Inception V3 or Inception V4 networks or also ResNet 100.

Okay, well, what are other recipes that can help you building better models?

One thing that we've seen quite successfully is increasing the width of the residual networks.

So there's wide residual networks, they decrease the depth, but they increase the width of the residual blocks.

Then you also use dropout in these residual blocks and you can show that a 16 layer deep network with similar number of parameters can outperform a thousand layer deep network.

So here the power is not from depth, but from the residual connections and the width that's introduced.

And there is very little theory behind the best solutions that we have at the moment.

There's also things like ResNet, where all of the previous recipes have been built together.

It allows aggregated residual transformations. So you can see that this is actually equivalent to early concatenation.

So we can replace it with early concatenation. And then the general idea is that you do group convolution.

So you have the input and output chains divided into groups and then the convolutions are performed separately within every group.

Now this has similar flops and number of parameters as a ResNet bottleneck block, but it's wider and it's a sparsely connected module.

So this is quite popular. Then of course you can combine this. So you have a ResNet of ResNets.

You can even build more residual connections in here with dense nets.

So here you try to connect almost everything with everything. You have densely connected convolutional neural networks.

It has feature propagation. It has feature reuse. It very much also alleviates the vanishing gradient problem.

And with up to 264 layers, you actually need one third less parameters for the same performance than ResNet due to the transition layers using one by one convolutions.

Also a very nice interesting idea that I would like to show you is the squeeze and excitation networks.

And this is the ImageNet challenge winner in 2017 and it had 2.3% top 5 error.

And the model now is that you want to explicitly model the channel interdependencies,

which means essentially that you have some channels that are more relevant depending on the content.

And the idea is if you have dark features, they will not be very interesting when you are trying to look at cars.

So how is this implemented? Well, we add a trainable module that allows the rescaling of the channels depending on the input.

So we have the maps, the feature maps here shown. And then we have the side branch.

And the side branch is mapping down only to a single dimension. And the single dimension is then multiplied with the different feature maps,

allowing for some feature maps to be suppressed depending on the input and other feature maps to be scaled up depending on the input.

Then we essentially squeeze, we compress each channel into one value by global average pooling.

So this is how we construct the feature importance. And then we excite, where we use fully connected layers in the sigmoid function in order to excite only the important ones.

And by the way, this is very similar to what we would be doing in gating in the long short term memory cells, which we'll talk about probably in one of the videos next week.

And we can, neural networks, you can write them down in five lines of pseudocode.

Then we scale. So we scale the input maps with the output. And then we can combine this, of course, with most other architectures with inception, with ResNet, with ResNect.

So plenty of different options that we could go for. And to be honest, I don't want to show you another architecture here.