12 - Deep Learning - Loss and Optimization Part 3 [ID:14187]
50 von 239 angezeigt

So welcome everybody to our next lecture on deep learning. Today we want to talk about

optimization. We have a search technique which is just local search gradient descent. Now let's

have a look a bit on the gradient descent methods in a little bit more detail. So we've seen that

the gradient is essentially optimizing the empirical risk. And here in this

figure you see that we do one step each towards this local minimum and we have

this predefined learning rate either. So the gradient is of course computed with

respect to every sample and this is then guaranteed to converge to a local minimum.

The stuff that works best is really simple. Now of course this means that for

every iteration we have to use all M samples and this is called batch gradient

descent. So you have to look in every iteration for every update at all M

samples. These may be really large in particular if you look at big data

computer vision problems and so on. With say image recognition or something like that?

So this is of course a preferred option for convex optimization problems

because we have a guarantee here that we find the global minimum and then every

update is guaranteed to decrease the error. Of course for non convex problems

we have a problem anyway and we may have really big memory limitations. So this is

why people like to prefer other things like the stochastic online gradient

descent, the SGD and here they use just one sample and then immediately update.

So this is no longer necessarily decreasing the empirical risk in every

iteration and it may also be very inefficient because of the latency

transfer to the graphical processing unit but however if you use just one

sample you can do many things in parallel. So it's highly parallelizable.

A compromise between the two is that you can use a mini-batch stochastic gradient

descent and here you use B and B may be a number much smaller than the entire

training data set of random samples that you essentially choose them randomly

from the entire training data set and then you evaluate the gradient on the

subset B and this is then called a mini-batch. Now this mini-batch can be

evaluated really quickly and you may also think about parallelization

approaches and so on because you can do several mini-batches in parallel and

then just do the weighted sum and update. So small batches then are useful

because they offer a kind of regularization effect. This then typically

results in smaller eta. So if you use a mini-batch stochastic gradient descent

typically smaller values of eta is sufficient and it also regains

efficiency and typically this is the standard case in deep learning. So a lot

of people work with mini-batch stochastic gradient descent. The question is how

can this even work and our optimization problems is non-convex. So there's an

exponential number of local minima and there's an interesting paper from 2015

and 2014. They show that the networks that we are typically working with they

are high dimensional functions, there are many local minima but the interesting

thing is that those local minima are close to the global minimum and

actually many of those are equivalent. What is probably more of a problem are

saddle points and another thing might also be that the local minima might

even be better than the global minimum because the global minimum is attained

on the training set but in the end you want to apply your network to a test

data set that may be different and actually a global minimum on your

training data set may be related to an overfit. Maybe this is even worse for the

generalization of the train network. And that was my 1987 diploma thesis which

was all about that. There is one more possible answer and this is a paper from

2016. He's suggesting over provisioning. So there is essentially many different

ways how a network can approximate the desired relationship and you essentially

Teil einer Videoserie :

Zugänglich über

Offener Zugang

Dauer

00:23:03 Min

Aufnahmedatum

2020-04-26

Hochgeladen am

2020-04-26 20:06:08

Sprache

en-US

Deep Learning - Loss and Optimization Part 3

This video discusses details on optimization and different options in gradient descent procedure such as momentum and ADAM.

Video References:
Lex Fridman's Channel

References

[1] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Secaucus, NJ, USA: Springer-Verlag New York, Inc., 2006.
[2] Anna Choromanska, Mikael Henaff, Michael Mathieu, et al. “The Loss Surfaces of Multilayer Networks.” In: AISTATS. 2015.
[3] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, et al. “Identifying and attacking the saddle point problem in high-dimensional non-convex optimization”. In: Advances in neural information processing systems. 2014, pp. 2933–2941.
[4] Yichuan Tang. “Deep learning using linear support vector machines”. In: arXiv preprint arXiv:1306.0239 (2013).
[5] Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. “On the Convergence of Adam and Beyond”. In: International Conference on Learning Representations. 2018.
[6] Katarzyna Janocha and Wojciech Marian Czarnecki. “On Loss Functions for Deep Neural Networks in Classification”. In: arXiv preprint arXiv:1702.05659 (2017).
[7] Jeffrey Dean, Greg Corrado, Rajat Monga, et al. “Large scale distributed deep networks”. In: Advances in neural information processing systems. 2012, pp. 1223–1231.
[8] Maren Mahsereci and Philipp Hennig. “Probabilistic line searches for stochastic optimization”. In: Advances In Neural Information Processing Systems. 2015, pp. 181–189.
[9] Jason Weston, Chris Watkins, et al. “Support vector machines for multi-class pattern recognition.” In: ESANN. Vol. 99. 1999, pp. 219–224.
[10] Chiyuan Zhang, Samy Bengio, Moritz Hardt, et al. “Understanding deep learning requires rethinking generalization”. In: arXiv preprint arXiv:1611.03530 (2016).

Further Reading:
A gentle Introduction to Deep Learning

Tags

Optimization loss artificial intelligence deep learning machine learning pattern recognition Feedforward Networks Gradient descent
Einbetten
Wordpress FAU Plugin
iFrame
Teilen