12 - Deep Learning - Loss and Optimization Part 3 [ID:16878]

50 von 182 angezeigt

So welcome everybody to our next lecture on deep learning. Today we want to talk about

optimization. Now let's have a look a bit on the gradient descent methods in a little bit more detail.

So we've seen that the gradient is essentially optimizing the empirical risk and here

in this figure you see that we do one step each towards this local minimum and we have

this predefined learning rate eta. So the gradient is of course computed with respect to every sample

and this is then guaranteed to converge to a local minimum. Now of course this means that for every

iteration we have to use all m samples and this is called batch gradient descent. So you have to

look in every iteration for every update at all m samples. These may be really large in particular

if you look at big data computer vision problems and so on. So this is of course a preferred option

for convex optimization problems because we have a guarantee here that we find the global minimum

and then every update is guaranteed to decrease the error. Of course for non convex problems we

have a problem anyway and we may have really big memory limitations. So this is why people like to

prefer other things like the stochastic online gradient descent, the SGD and here they use just

one sample and then immediately update. So this is no longer necessarily decreasing the empirical

risk in every iteration and it may also be very inefficient because of the latency transfer

to the graphical processing unit. But however if you use just one sample you can do many things

in parallel. So it's highly parallelizable. A compromise between the two is that you can use

a mini batch stochastic gradient descent and here you use B and B may be a number much smaller than

the entire training data set of random samples that you essentially choose them randomly from

the entire training data set and then you evaluate the gradient on the subset B and this is then

called a mini batch. Now this mini batch can be evaluated really quickly and you may also think

about parallelization approaches and so on because you can do several mini batches in parallel and

then just do the weighted sum and update. So small batches then are useful because they offer a kind

of regularization effect. This then typically results in smaller eta. So if you use a mini

batch stochastic gradient descent typically smaller values of eta is sufficient and it also

regains efficiency and typically this is the standard case in deep learning. So a lot of

people work with mini batch stochastic gradient descent. Well the question is how can this even

work and our optimization problems is non-convex so there's an exponential number of local minima

and there's an interesting paper from 2015 and 2014. They show that the networks that we are

typically working with they are high dimensional functions there are many local minima but the

interesting thing is that those local minima are close to the global minimum and actually many of

those are equivalent. What is probably more of a problem are saddle points and another thing might

also be that the local minima might even be better than the global minimum because the global minimum

is attained on your training set but in the end you want to apply your network to a test data set

that may be different and actually a global minimum on your training data set may be related to an

overfit. Maybe this is even worse for the generalization of the train network. There is one

more possible answer and this is a paper from 2016. He's suggesting over provisioning so there

is essentially many different ways how a network can approximate the desired relationship and you

essentially just need to find one. You don't need to find all of them a single one is sufficient. So

this has also been verified experimentally by experiments with random labels and here the idea

is that you essentially randomize the labels you don't use the original ones but you just randomly

assign any classes and if you then show that the way you're experimenting still solves the problem

then you you are creating an overfit. Now let's have a look at the choice of Ida and what we've

already seen that if you have a small learning rate then we may stop even before we reach

convergence. If you have a too large learning rate and we might be ending jumping back and forth

and not even finding the local minimum but only with the appropriate learning rate you will be

able to find your minimum and actually when you're far away from the minimum you want to be able to

make big steps and the closer you get to the minimum the smaller steps want to do. So in practice

you work with decay of the learning rate so you adapt your Ida gradually so you start with let's

say 0.01 and then you divide by 10 in every X epochs so this helps you to bring you really to

Teil einer Videoserie :

Deep Learning - Plain Version

Presenters

Prof. Dr. Andreas Maier

Zugänglich über

Offener Zugang

Dauer

00:22:16 Min

Aufnahmedatum

2020-05-30

Hochgeladen am

2020-05-30 16:16:36

Sprache

en-US

Deep Learning - Loss and Optimization Part 3

This video discusses details on optimization and different options in gradient descent procedure such as momentum and ADAM.

References

[1] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Secaucus, NJ, USA: Springer-Verlag New York, Inc., 2006.
[2] Anna Choromanska, Mikael Henaff, Michael Mathieu, et al. “The Loss Surfaces of Multilayer Networks.” In: AISTATS. 2015.
[3] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, et al. “Identifying and attacking the saddle point problem in high-dimensional non-convex optimization”. In: Advances in neural information processing systems. 2014, pp. 2933–2941.
[4] Yichuan Tang. “Deep learning using linear support vector machines”. In: arXiv preprint arXiv:1306.0239 (2013).
[5] Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. “On the Convergence of Adam and Beyond”. In: International Conference on Learning Representations. 2018.
[6] Katarzyna Janocha and Wojciech Marian Czarnecki. “On Loss Functions for Deep Neural Networks in Classification”. In: arXiv preprint arXiv:1702.05659 (2017).
[7] Jeffrey Dean, Greg Corrado, Rajat Monga, et al. “Large scale distributed deep networks”. In: Advances in neural information processing systems. 2012, pp. 1223–1231.
[8] Maren Mahsereci and Philipp Hennig. “Probabilistic line searches for stochastic optimization”. In: Advances In Neural Information Processing Systems. 2015, pp. 181–189.
[9] Jason Weston, Chris Watkins, et al. “Support vector machines for multi-class pattern recognition.” In: ESANN. Vol. 99. 1999, pp. 219–224.
[10] Chiyuan Zhang, Samy Bengio, Moritz Hardt, et al. “Understanding deep learning requires rethinking generalization”. In: arXiv preprint arXiv:1611.03530 (2016).

Further Reading:
A gentle Introduction to Deep Learning

Tags

Per RSS abonnieren