23 - Deep Learning - Common Practices Part 2 [ID:15973]

50 von 79 angezeigt

Welcome everybody to deep learning. So today we want to look into further common practices

and in particular in this video we want to discuss architecture selection and hyperparameter

optimization. That is really the pinnacle of that where you then not only learn

how to improve on that problem and on that but you also improve the way the machine improves

and you also improve the way it improves itself. And that was my 1987 diploma thesis which was all

about that. Remember the test data is still in the vault. We are not touching it. However we need to

set our hyperparameter somehow and you've already seen that there's an enormous amount of hyperparameters.

You have the architecture, number of layers, number of nodes per layer, activation functions,

then you have all the parameters and the optimization, the initialization, the loss function,

the optimizers like still has the gradient descent momentum, add-on, learning rate, decay,

and batch size. And in regularization you have different regularizers, L2, L1 loss, batch

normalization, dropout and so on. And you want to somehow figure out all the parameters for those

different kinds of procedures. Now let's choose architecture and loss function. And the first step

would be think about the problem and the data. How could features look like? What kind of spatial

correlation do you expect? What data augmentation makes sense? How will the classes be distributed?

What is important regarding the target application? Then you start with simple architectures and loss

functions and of course you do your research. Try well-known models first and foremost. They are

being published. There's so many papers out there. There is no need to do everything on yourself.

So one day in the library can save hours and weeks and months of experimentation. Do the research. It

will really save you time. And often they just don't publish a single paper. But in the very good

papers it's not just the scientific results but they also share source code, sometimes even data.

Try to find those papers. This can help you a lot with your own experimentation.

Also because I'm lazy so you know. So then you may want to change, adapt this architecture

to found in literature. And if you change something, find good reasons why this is an appropriate change.

There's quite a few papers out there that seem to introduce random changes into the architecture.

And then later it turns out that the observations that they made were essentially random and they

were just lucky or experimented enough on their own data in order to get the improvements.

So far that hasn't really held up on other data sets. Typically there's also a reasonable argument

why the specific change should give an improvement in performance. What is clear to me is that

engineers and companies and labs and grad students will continue to tune architectures and explore

all kinds of tweaks to make the current state of the art ever slightly better. Next you want to do

your hyperparameter search. So you remember learning rate, decay, regularization, dropout and so on.

These have to be tuned but still the networks can take days or weeks to train.

And you have to search for these hyperparameters and we recommend using a log scale. So for example

for IDA here you go for 0.1, 0.01, 0.001. You may want to consider grid search or random search.

So in grid search you would have equal distance steps and if you look here at reference two

they have shown that the random search has really advantages over the grid search. First of all it's

easier to implement and second it has a better exploration of the parameters that have a strong

influence on the result. So you may want to look into that and then adjust your strategy accordingly.

So hyperparameters are highly interdependent. You may want to use a course to find search therefore.

So you optimize on a very core scale in the beginning and then make it finer and finer.

You may only train the network for a few epochs and then bring all the hyperparameters in sensible

ranges and then you can refine using random and grid search. A very common technique that can give

you a little bit of boost of performance is ensembling. So this is also something that can

really help you to get this additional little bit of performance that you still need. So far we have

only considered a single classifier but ensembling has the idea by using many of those classifiers.

If we assume n classifiers that are independent performing a correct prediction will be at a

probability of 1 minus p. Now the probability of seeing k errors is n choose k p to the power of k

1 minus p to the power of n minus k and this is a binomial distribution. So the probability of a

majority meaning k greater than n over 2 to be wrong is the sum over n choose k p to the power of

Teil einer Videoserie :

Deep Learning

Presenters

Prof. Dr.-Ing. Andreas Maier

Zugänglich über

Offener Zugang

Dauer

00:09:49 Min

Aufnahmedatum

2020-05-16

Hochgeladen am

2020-05-16 14:06:16

Sprache

en-US

Deep Learning - Common Practices Part 2

This video discusses the use of validation data and how to choose architectures and hyper parameters and discuss ensembling.

Video References:
Lex Fridman's Channel
Dragon Ball Scene

Further Reading:
A gentle Introduction to Deep Learning

Tags

Per RSS abonnieren