Welcome back to deep learning. So today I want to talk to you about the actual
pooling implementation. And the pooling layers are one essential step in many
deep networks. The main idea behind this is that you want to reduce the
dimensionality across the spatial domain. So here we see this small example
where we summarize the information in the green rectangles, the blue rectangles,
the yellow and the red ones to only one value. So we have a two-by-two input that
has to be mapped onto a single value. Now this of course reduces the number of
parameters required, it introduces some hierarchy and allows you to work with
spatial abstraction and it reduces computational costs and overfitting, but
we need some basic assumptions of course here. And one of the assumptions is that
the features are hierarchically structured, that by pooling we're
reducing the output size and introduce this hierarchy that should be
intrinsically present in the signal. We talked about this, the eyes that are
composed of edges and lines and then faces are composed of eyes, nose and so
on. This has to be present in order to make pooling a sensible operation to be
encoded into your network. So here you see a pooling of a three-by-three layer
and here we choose max pooling. So in max pooling only the highest number in the
receptive field will actually be propagated into the output. Obviously we
can also work with striding and the stride typically equals the neighborhood
size such that we get a reduced output dimension. One problem here is of course
that the maximum propagation adds an additional non-linearity and therefore
we also have to think about how to resolve this step in the gradient
procedure and what we do is we introduce essentially again a sub gradient concept
where we simply propagate into the cell that has produced the maximum output. So
you could say the winner takes it all. Now an alternative to this is average
pooling and here we compute simply the average over the neighborhood. It does
not consistently perform better than max pooling and then of course in the back
propagation path the error is simply shared in equal parts and back
propagated to the respective units. There are many more pooling strategies like
fractional max pooling, LP pooling, stochastic pooling, spatial pyramid
pooling, generalized pooling. There's a whole different set of strategies about
this. One alternative that we already talked about is strided convolution.
This became really popular because then you don't have to encode the max pooling
as an additional step and you reduce the number of parameters. Typically people
now use strided convolutions with s greater than 1 in order to implement
convolution pooling at the same time. So let's recap how our convolutional
networks are doing. So we talked about the convolution producing feature maps,
the pooling reducing the size of the respective feature maps, then again
convolutions, again pooling until we end up at an abstract representation and
then we had these fully connected layers in order to do the classification.
Well actually we can kick out this last block because we've seen that if we
replace this with a reformatting into a channel direction and then we can
replace it with a one-by-one convolution and just apply this in order to get our
final classification. So we reduce the number of building blocks further. We
don't even need our fully connected layers here anymore. Now everything then
becomes fully convolutional and we can express essentially the entire chain of
operations by convolutions and pooling steps. So we don't even need the fully
connected layers anymore. The nice thing about using the one-by-one convolutions
is if you combine this with something that is called global average pooling
then you can essentially also process input images of arbitrary size. So the
Presenters
Zugänglich über
Offener Zugang
Dauer
00:08:48 Min
Aufnahmedatum
2020-05-30
Hochgeladen am
2020-05-30 20:16:49
Sprache
en-US
Deep Learning - Activations, Convolutions, and Pooling Part 4
This video presents max and average pooling, introduces the concept of fully convolutional networks, and hints on how this is used to build deep networks.
References:
[1] I. J. Goodfellow, D. Warde-Farley, M. Mirza, et al. “Maxout Networks”. In: ArXiv e-prints (Feb. 2013). arXiv: 1302.4389 [stat.ML].
[2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, et al. “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”. In: CoRR abs/1502.01852 (2015). arXiv: 1502.01852.
[3] Günter Klambauer, Thomas Unterthiner, Andreas Mayr, et al. “Self-Normalizing Neural Networks”. In: Advances in Neural Information Processing Systems (NIPS). Vol. abs/1706.02515. 2017. arXiv: 1706.02515.
[4] Min Lin, Qiang Chen, and Shuicheng Yan. “Network In Network”. In: CoRR abs/1312.4400 (2013). arXiv: 1312.4400.
[5] Andrew L. Maas, Awni Y. Hannun, and Andrew Y. Ng. “Rectifier Nonlinearities Improve Neural Network Acoustic Models”. In: Proc. ICML. Vol. 30. 1. 2013.
[6] Prajit Ramachandran, Barret Zoph, and Quoc V. Le. “Searching for Activation Functions”. In: CoRR abs/1710.05941 (2017). arXiv: 1710.05941.
[7] Stefan Elfwing, Eiji Uchibe, and Kenji Doya. “Sigmoid-weighted linear units for neural network function approximation in reinforcement learning”. In: arXiv preprint arXiv:1702.03118 (2017).
[8] Christian Szegedy, Wei Liu, Yangqing Jia, et al. “Going Deeper with Convolutions”. In: CoRR abs/1409.4842 (2014). arXiv: 1409.4842.
Further Reading:
A gentle Introduction to Deep Learning