Welcome everybody to deep learning. So today we want to look into further common practices
and in particular in this video we want to discuss architecture selection and hyperparameter
optimization. That is really the pinnacle of that where you then not only learn
how to improve on that problem and on that but you also improve the way the machine improves
and you also improve the way it improves itself. And that was my 1987 diploma thesis which was all
about that. Remember the test data is still in the vault. We are not touching it. However we need to
set our hyperparameter somehow and you've already seen that there's an enormous amount of hyperparameters.
You have the architecture, number of layers, number of nodes per layer, activation functions,
then you have all the parameters and the optimization, the initialization, the loss function,
the optimizers like still has the gradient descent momentum, add-on, learning rate, decay,
and batch size. And in regularization you have different regularizers, L2, L1 loss, batch
normalization, dropout and so on. And you want to somehow figure out all the parameters for those
different kinds of procedures. Now let's choose architecture and loss function. And the first step
would be think about the problem and the data. How could features look like? What kind of spatial
correlation do you expect? What data augmentation makes sense? How will the classes be distributed?
What is important regarding the target application? Then you start with simple architectures and loss
functions and of course you do your research. Try well-known models first and foremost. They are
being published. There's so many papers out there. There is no need to do everything on yourself.
So one day in the library can save hours and weeks and months of experimentation. Do the research. It
will really save you time. And often they just don't publish a single paper. But in the very good
papers it's not just the scientific results but they also share source code, sometimes even data.
Try to find those papers. This can help you a lot with your own experimentation.
Also because I'm lazy so you know. So then you may want to change, adapt this architecture
to found in literature. And if you change something, find good reasons why this is an appropriate change.
There's quite a few papers out there that seem to introduce random changes into the architecture.
And then later it turns out that the observations that they made were essentially random and they
were just lucky or experimented enough on their own data in order to get the improvements.
So far that hasn't really held up on other data sets. Typically there's also a reasonable argument
why the specific change should give an improvement in performance. What is clear to me is that
engineers and companies and labs and grad students will continue to tune architectures and explore
all kinds of tweaks to make the current state of the art ever slightly better. Next you want to do
your hyperparameter search. So you remember learning rate, decay, regularization, dropout and so on.
These have to be tuned but still the networks can take days or weeks to train.
And you have to search for these hyperparameters and we recommend using a log scale. So for example
for IDA here you go for 0.1, 0.01, 0.001. You may want to consider grid search or random search.
So in grid search you would have equal distance steps and if you look here at reference two
they have shown that the random search has really advantages over the grid search. First of all it's
easier to implement and second it has a better exploration of the parameters that have a strong
influence on the result. So you may want to look into that and then adjust your strategy accordingly.
So hyperparameters are highly interdependent. You may want to use a course to find search therefore.
So you optimize on a very core scale in the beginning and then make it finer and finer.
You may only train the network for a few epochs and then bring all the hyperparameters in sensible
ranges and then you can refine using random and grid search. A very common technique that can give
you a little bit of boost of performance is ensembling. So this is also something that can
really help you to get this additional little bit of performance that you still need. So far we have
only considered a single classifier but ensembling has the idea by using many of those classifiers.
If we assume n classifiers that are independent performing a correct prediction will be at a
probability of 1 minus p. Now the probability of seeing k errors is n choose k p to the power of k
1 minus p to the power of n minus k and this is a binomial distribution. So the probability of a
majority meaning k greater than n over 2 to be wrong is the sum over n choose k p to the power of
Presenters
Zugänglich über
Offener Zugang
Dauer
00:09:49 Min
Aufnahmedatum
2020-05-16
Hochgeladen am
2020-05-16 14:06:16
Sprache
en-US
Deep Learning - Common Practices Part 2
This video discusses the use of validation data and how to choose architectures and hyper parameters and discuss ensembling.
Video References:
Lex Fridman's Channel
Dragon Ball Scene
Further Reading:
A gentle Introduction to Deep Learning