So welcome everybody to our next lecture on deep learning. Today we want to talk about
optimization. Now let's have a look a bit on the gradient descent methods in a little bit more detail.
So we've seen that the gradient is essentially optimizing the empirical risk and here
in this figure you see that we do one step each towards this local minimum and we have
this predefined learning rate eta. So the gradient is of course computed with respect to every sample
and this is then guaranteed to converge to a local minimum. Now of course this means that for every
iteration we have to use all m samples and this is called batch gradient descent. So you have to
look in every iteration for every update at all m samples. These may be really large in particular
if you look at big data computer vision problems and so on. So this is of course a preferred option
for convex optimization problems because we have a guarantee here that we find the global minimum
and then every update is guaranteed to decrease the error. Of course for non convex problems we
have a problem anyway and we may have really big memory limitations. So this is why people like to
prefer other things like the stochastic online gradient descent, the SGD and here they use just
one sample and then immediately update. So this is no longer necessarily decreasing the empirical
risk in every iteration and it may also be very inefficient because of the latency transfer
to the graphical processing unit. But however if you use just one sample you can do many things
in parallel. So it's highly parallelizable. A compromise between the two is that you can use
a mini batch stochastic gradient descent and here you use B and B may be a number much smaller than
the entire training data set of random samples that you essentially choose them randomly from
the entire training data set and then you evaluate the gradient on the subset B and this is then
called a mini batch. Now this mini batch can be evaluated really quickly and you may also think
about parallelization approaches and so on because you can do several mini batches in parallel and
then just do the weighted sum and update. So small batches then are useful because they offer a kind
of regularization effect. This then typically results in smaller eta. So if you use a mini
batch stochastic gradient descent typically smaller values of eta is sufficient and it also
regains efficiency and typically this is the standard case in deep learning. So a lot of
people work with mini batch stochastic gradient descent. Well the question is how can this even
work and our optimization problems is non-convex so there's an exponential number of local minima
and there's an interesting paper from 2015 and 2014. They show that the networks that we are
typically working with they are high dimensional functions there are many local minima but the
interesting thing is that those local minima are close to the global minimum and actually many of
those are equivalent. What is probably more of a problem are saddle points and another thing might
also be that the local minima might even be better than the global minimum because the global minimum
is attained on your training set but in the end you want to apply your network to a test data set
that may be different and actually a global minimum on your training data set may be related to an
overfit. Maybe this is even worse for the generalization of the train network. There is one
more possible answer and this is a paper from 2016. He's suggesting over provisioning so there
is essentially many different ways how a network can approximate the desired relationship and you
essentially just need to find one. You don't need to find all of them a single one is sufficient. So
this has also been verified experimentally by experiments with random labels and here the idea
is that you essentially randomize the labels you don't use the original ones but you just randomly
assign any classes and if you then show that the way you're experimenting still solves the problem
then you you are creating an overfit. Now let's have a look at the choice of Ida and what we've
already seen that if you have a small learning rate then we may stop even before we reach
convergence. If you have a too large learning rate and we might be ending jumping back and forth
and not even finding the local minimum but only with the appropriate learning rate you will be
able to find your minimum and actually when you're far away from the minimum you want to be able to
make big steps and the closer you get to the minimum the smaller steps want to do. So in practice
you work with decay of the learning rate so you adapt your Ida gradually so you start with let's
say 0.01 and then you divide by 10 in every X epochs so this helps you to bring you really to
Presenters
Zugänglich über
Offener Zugang
Dauer
00:22:16 Min
Aufnahmedatum
2020-05-30
Hochgeladen am
2020-05-30 16:16:36
Sprache
en-US
Deep Learning - Loss and Optimization Part 3
This video discusses details on optimization and different options in gradient descent procedure such as momentum and ADAM.
References
[1] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Secaucus, NJ, USA: Springer-Verlag New York, Inc., 2006.
[2] Anna Choromanska, Mikael Henaff, Michael Mathieu, et al. “The Loss Surfaces of Multilayer Networks.” In: AISTATS. 2015.
[3] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, et al. “Identifying and attacking the saddle point problem in high-dimensional non-convex optimization”. In: Advances in neural information processing systems. 2014, pp. 2933–2941.
[4] Yichuan Tang. “Deep learning using linear support vector machines”. In: arXiv preprint arXiv:1306.0239 (2013).
[5] Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. “On the Convergence of Adam and Beyond”. In: International Conference on Learning Representations. 2018.
[6] Katarzyna Janocha and Wojciech Marian Czarnecki. “On Loss Functions for Deep Neural Networks in Classification”. In: arXiv preprint arXiv:1702.05659 (2017).
[7] Jeffrey Dean, Greg Corrado, Rajat Monga, et al. “Large scale distributed deep networks”. In: Advances in neural information processing systems. 2012, pp. 1223–1231.
[8] Maren Mahsereci and Philipp Hennig. “Probabilistic line searches for stochastic optimization”. In: Advances In Neural Information Processing Systems. 2015, pp. 181–189.
[9] Jason Weston, Chris Watkins, et al. “Support vector machines for multi-class pattern recognition.” In: ESANN. Vol. 99. 1999, pp. 219–224.
[10] Chiyuan Zhang, Samy Bengio, Moritz Hardt, et al. “Understanding deep learning requires rethinking generalization”. In: arXiv preprint arXiv:1611.03530 (2016).
Further Reading:
A gentle Introduction to Deep Learning