So welcome everybody to our next lecture on deep learning. Today we want to talk about
optimization. We have a search technique which is just local search gradient descent. Now let's
have a look a bit on the gradient descent methods in a little bit more detail. So we've seen that
the gradient is essentially optimizing the empirical risk. And here in this
figure you see that we do one step each towards this local minimum and we have
this predefined learning rate either. So the gradient is of course computed with
respect to every sample and this is then guaranteed to converge to a local minimum.
The stuff that works best is really simple. Now of course this means that for
every iteration we have to use all M samples and this is called batch gradient
descent. So you have to look in every iteration for every update at all M
samples. These may be really large in particular if you look at big data
computer vision problems and so on. With say image recognition or something like that?
So this is of course a preferred option for convex optimization problems
because we have a guarantee here that we find the global minimum and then every
update is guaranteed to decrease the error. Of course for non convex problems
we have a problem anyway and we may have really big memory limitations. So this is
why people like to prefer other things like the stochastic online gradient
descent, the SGD and here they use just one sample and then immediately update.
So this is no longer necessarily decreasing the empirical risk in every
iteration and it may also be very inefficient because of the latency
transfer to the graphical processing unit but however if you use just one
sample you can do many things in parallel. So it's highly parallelizable.
A compromise between the two is that you can use a mini-batch stochastic gradient
descent and here you use B and B may be a number much smaller than the entire
training data set of random samples that you essentially choose them randomly
from the entire training data set and then you evaluate the gradient on the
subset B and this is then called a mini-batch. Now this mini-batch can be
evaluated really quickly and you may also think about parallelization
approaches and so on because you can do several mini-batches in parallel and
then just do the weighted sum and update. So small batches then are useful
because they offer a kind of regularization effect. This then typically
results in smaller eta. So if you use a mini-batch stochastic gradient descent
typically smaller values of eta is sufficient and it also regains
efficiency and typically this is the standard case in deep learning. So a lot
of people work with mini-batch stochastic gradient descent. The question is how
can this even work and our optimization problems is non-convex. So there's an
exponential number of local minima and there's an interesting paper from 2015
and 2014. They show that the networks that we are typically working with they
are high dimensional functions, there are many local minima but the interesting
thing is that those local minima are close to the global minimum and
actually many of those are equivalent. What is probably more of a problem are
saddle points and another thing might also be that the local minima might
even be better than the global minimum because the global minimum is attained
on the training set but in the end you want to apply your network to a test
data set that may be different and actually a global minimum on your
training data set may be related to an overfit. Maybe this is even worse for the
generalization of the train network. And that was my 1987 diploma thesis which
was all about that. There is one more possible answer and this is a paper from
2016. He's suggesting over provisioning. So there is essentially many different
ways how a network can approximate the desired relationship and you essentially
Presenters
Zugänglich über
Offener Zugang
Dauer
00:23:03 Min
Aufnahmedatum
2020-04-26
Hochgeladen am
2020-04-26 20:06:08
Sprache
en-US
Deep Learning - Loss and Optimization Part 3
This video discusses details on optimization and different options in gradient descent procedure such as momentum and ADAM.
Video References:
Lex Fridman's Channel
References
[1] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Secaucus, NJ, USA: Springer-Verlag New York, Inc., 2006.
[2] Anna Choromanska, Mikael Henaff, Michael Mathieu, et al. “The Loss Surfaces of Multilayer Networks.” In: AISTATS. 2015.
[3] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, et al. “Identifying and attacking the saddle point problem in high-dimensional non-convex optimization”. In: Advances in neural information processing systems. 2014, pp. 2933–2941.
[4] Yichuan Tang. “Deep learning using linear support vector machines”. In: arXiv preprint arXiv:1306.0239 (2013).
[5] Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. “On the Convergence of Adam and Beyond”. In: International Conference on Learning Representations. 2018.
[6] Katarzyna Janocha and Wojciech Marian Czarnecki. “On Loss Functions for Deep Neural Networks in Classification”. In: arXiv preprint arXiv:1702.05659 (2017).
[7] Jeffrey Dean, Greg Corrado, Rajat Monga, et al. “Large scale distributed deep networks”. In: Advances in neural information processing systems. 2012, pp. 1223–1231.
[8] Maren Mahsereci and Philipp Hennig. “Probabilistic line searches for stochastic optimization”. In: Advances In Neural Information Processing Systems. 2015, pp. 181–189.
[9] Jason Weston, Chris Watkins, et al. “Support vector machines for multi-class pattern recognition.” In: ESANN. Vol. 99. 1999, pp. 219–224.
[10] Chiyuan Zhang, Samy Bengio, Moritz Hardt, et al. “Understanding deep learning requires rethinking generalization”. In: arXiv preprint arXiv:1611.03530 (2016).
Further Reading:
A gentle Introduction to Deep Learning