11 - Deep Learning - Loss and Optimization Part 2 [ID:16880]

50 von 147 angezeigt

Welcome back to deep learning. So let's continue with our lecture and we want to talk now about loss and optimization.

Today we want to talk a bit about loss functions and optimization and I want to look into a couple of more optimization problems.

And one of those optimization problems we've actually already seen in the perceptron case.

You remember that there we were minimizing a sum over all of the misclassified samples

and we were choosing this because this we could somehow get rid of the sine function

and only look into the samples that were relevant for misclassification

and also note that here we don't have a 0,1 category but a minus 1,1.

Because this allows us to multiply with the class label this will then always result in a negative number for all misclassified samples

and then we add this negative sign in the very beginning such that we always end up with a positive value

and the smaller this positive value is the smaller our loss will be so we seek to minimize this function.

We don't have the sine function in this criterion because we found an elegant way to formulate this loss function without the sine function.

Now if it were in we would run into problems because this would only count the number of misclassifications

and we would not differentiate whether it's far away decision boundary or close to the decision boundary.

We would simply add up with a count and then if you look at the gradient the gradient would essentially vanish everywhere.

So it's not an easy optimization problem we don't know in which direction to go so we can't find a good optimization.

What did we about this last time? Well we somehow relaxed this and there's also ways how to relax this

and one way to go ahead is to include the so-called hinge loss.

Now with the hinge loss we can relax this 0,1 function into something that behaves linear on a large domain

and the idea is that we essentially use a line, a line that hits the x-axis at 1 and the y-axis also at 1

and if we do it this way then we can simply rewrite this using the max function

so the hinge loss is then a sum over all the samples that essentially receive 0 if our value is larger than 1.

So we have to rewrite the right hand part. So we reformulate this a little.

We take 1 minus ym times y hat and here you can see that we will have the same constraint.

If we have opposite sides of the boundary this term will be negative and by design it will of course be flipped

so that we end up having large values for a high number of misclassifications.

We got rid of the problem of having to find the set m.

Now we can take the full set of samples by using this max function because everything that will fulfill this constraint

will automatically be clamped to 0 so it will not influence this loss function.

So that's a very interesting way of formulating the same problem.

We get implicitly the situation that we only consider the misclassified samples in this loss function

and you could say or can be shown that the hinge loss is a convex approximation of the misclassification loss that we considered earlier.

One big thing about this kind of optimization problems is of course the gradient

and this loss function here has a kink. The derivative is not continuous in the point 1.

So it's unclear what the derivative at point 1 is and now you could say okay I can't compute the derivative of this function so I'm doomed.

Luckily subgradients save the day.

So let's introduce this concept and in order to do so we have a look at convex differentiable functions

because on those we can say that at any point f of x we can essentially find a lower bound of f of x

that is indicated by some f of x0 plus the gradient at f of x0 multiplied with the difference from x to x0.

So let's look at a graph to show this concept.

If you look at this function here you can see that I can take any point x0 and compute the gradient

or in this case it's simply the tangent that is constructed by doing so

and you will see that at any point the tangent will be a lower bound to the entire function.

It doesn't matter where I take this point if I follow the tangential direction I'm always constructing a lower bound.

Now this kind of definition is much more suited towards us.

So let's expand now on the gradient and go into the direction of subgradients.

In subgradients now we define something which keeps this property but is not necessarily a gradient.

So a vector g is a subgradient of a convex function f at some point x0 if we have the same property.

So if we follow the subgradient direction multiplied with the difference between x and x0 then we always have a lower bound.

And the nice thing with this is that we essentially can relax the requirement of being able to compute a gradient.

There could be multiple of those g's that fulfill this property.

Teil einer Videoserie :

Deep Learning - Plain Version

Presenters

Prof. Dr. Andreas Maier

Zugänglich über

Offener Zugang

Dauer

00:17:50 Min

Aufnahmedatum

2020-05-30

Hochgeladen am

2020-05-30 17:16:37

Sprache

en-US

Deep Learning - Loss and Optimization Part 2

This video explains hinge loss and its relation to support vector machines. It enables to embed optimization constraints into loss functions.

Further Reading:
A gentle Introduction to Deep Learning
User Loss

Tags

Per RSS abonnieren