12 - Deep Learning - Plain Version 2020 [ID:21059]
50 von 180 angezeigt

Welcome everybody to deep learning. So today we want to look into optimization and we want

to check out different versions of gradient descent procedures and how to deal with different

sets of parameters. So we've seen that the gradient descent is essentially optimizing

the empirical risk. Here in this figure you see that we do one step each towards this local minimum.

We have this predefined learning rate eta, so the gradient is of course computed with respect

to every sample and this is then guaranteed to converge to a local minimum. Now of course

this means that for every iteration we have to use all samples and this is called batch gradient

descent. So you have to look in every iteration for every update at all samples. It may be really

many samples in particular if you look at big data in computer vision problems. This is of course

a preferred option for convex optimization problems because we have a guarantee here

that we find the global minimum. Every update is guaranteed to decrease the error.

Of course for non-convex problems we have a problem anyway. Also we may have memory limitations.

This is why people like to prefer other approaches like stochastic gradient descent. Here they use

just one sample and then they immediately update. So this is no longer necessarily decreasing the

empirical risk in every iteration and it may also be very inefficient because of the latency transfers

to the graphical processing unit. However if you just use one sample you can do many things in

parallel so it's highly parallelizable. A compromise between the two is to use the so-called

mini-batch stochastic gradient descent. Here use B and B may be a number much smaller than the entire

training data set of random samples that you essentially choose then randomly from the entire

training data set. Then you evaluate the gradient on the subset B and this is then called a mini-batch.

Now this mini-batch can be evaluated really quickly and you may also use parallelization

approaches because you can do several mini-batch steps in parallel. Then you just do the weighted

sum and update. So small batches are useful because they offer a kind of regularization effect.

This then typically results in smaller EDA. So if you use mini-batch in gradient descent

typically smaller values of EDA are sufficient and it also regains efficiency.

Typically this is the standard case in deep learning.

So a lot of people work with this meaning that the gradient descent is effective. But the question

is how can this even work? Our optimization problem is non-convex. There's an exponential number of

local minima and there's an interesting paper from 2015 where they showed that the methods

that we're typically working with are high dimensional functions. There are many local minima in this

environment. The interesting thing is that those local minima are close to the global minimum and

actually many of those are equivalent. What is probably more of a problem are saddle points. Also

the local minima might even be better than the global minimum because the global minimum is

attained on your training set but in the end you want to apply your network to a test data set

that may be different. Actually a global minimum on your training data set may be related to an

overfit. Maybe this is even worse for the generalization of the train network.

One more possible answer to this is a paper from 2016. The authors are suggesting over-provisioning

as there are many different ways of how a network can approximate the desired relationship.

You essentially just need to find one. You don't need to find all of them. A single one is sufficient.

Liang et al. verified this experimentally by experiments with random labels.

Here the idea is that you essentially randomize the labels and you don't use the original ones.

You just randomly assign any classes and if you then show that your experiment still solves the

problem then you're creating an overfit because your labels don't contain any information at all.

Let's have a look at the choice of Ida. What we've already seen that if you have a

small learning rate we may stop even before we reach convergence. If you have a too large

learning rate we might be ending jumping back and forth and not even finding the local minimum.

Only with an appropriate learning rate you will be able to find the minimum.

Actually when you're far away from the minimum you want to be able to make big steps and the

closer you get to the minimum you want to make smaller steps. If you want to do so in practice

you work with the decay of the learning rate. So you adapt your Ida gradually so you start with

Teil einer Videoserie :

Zugänglich über

Offener Zugang

Dauer

00:22:33 Min

Aufnahmedatum

2020-10-11

Hochgeladen am

2020-10-11 16:56:20

Sprache

en-US

Deep Learning - Loss and Optimization Part 3

This video discusses details on optimization and different options in gradient descent procedure such as momentum and ADAM.

For reminders to watch the new video follow on Twitter or LinkedIn.

References

[1] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Secaucus, NJ, USA: Springer-Verlag New York, Inc., 2006.
[2] Anna Choromanska, Mikael Henaff, Michael Mathieu, et al. “The Loss Surfaces of Multilayer Networks.” In: AISTATS. 2015.
[3] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, et al. “Identifying and attacking the saddle point problem in high-dimensional non-convex optimization”. In: Advances in neural information processing systems. 2014, pp. 2933–2941.
[4] Yichuan Tang. “Deep learning using linear support vector machines”. In: arXiv preprint arXiv:1306.0239 (2013).
[5] Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. “On the Convergence of Adam and Beyond”. In: International Conference on Learning Representations. 2018.
[6] Katarzyna Janocha and Wojciech Marian Czarnecki. “On Loss Functions for Deep Neural Networks in Classification”. In: arXiv preprint arXiv:1702.05659 (2017).
[7] Jeffrey Dean, Greg Corrado, Rajat Monga, et al. “Large scale distributed deep networks”. In: Advances in neural information processing systems. 2012, pp. 1223–1231.
[8] Maren Mahsereci and Philipp Hennig. “Probabilistic line searches for stochastic optimization”. In: Advances In Neural Information Processing Systems. 2015, pp. 181–189.
[9] Jason Weston, Chris Watkins, et al. “Support vector machines for multi-class pattern recognition.” In: ESANN. Vol. 99. 1999, pp. 219–224.
[10] Chiyuan Zhang, Samy Bengio, Moritz Hardt, et al. “Understanding deep learning requires rethinking generalization”. In: arXiv preprint arXiv:1611.03530 (2016).

Further Reading:
A gentle Introduction to Deep Learning

Einbetten
Wordpress FAU Plugin
iFrame
Teilen