Welcome back to deep learning. So let's continue with our lecture and we want to talk now about loss and optimization.
There's an old saying and I don't know who brought it up first, which says there's nothing more practical than a good theory.
Today we want to talk a bit about loss functions and optimization and I want to look into a couple of more optimization problems.
And one of those optimization problems we've actually already seen in the perceptron case.
You remember that we were minimizing a sum over all of the misclassified samples and we were choosing this because this we could somehow get rid of the sine function
and only look into the samples that were relevant for misclassification and also note that here we don't have a 0, 1 category but a minus 1, 1.
Because this allows us to multiply with the class label, this will then always result in a negative number for all misclassified samples.
And then we add this negative sign in the very beginning such that we always end up with a positive value.
And the smaller this positive value is, the smaller our loss will be. So we seek to minimize this function.
We don't have the sine function in this criterion because we found an elegant way to formulate this loss function without the sine function.
Now, if it were in, we would run into problems because this would only count the number of misclassifications
and we would not differentiate whether it's far away decision boundary or close to the decision boundary.
We would simply add up with a count and then if you look at the gradient, the gradient would essentially vanish everywhere.
So it's not an easy optimization problem. We don't know in which direction to go. So we can't find a good optimization.
Yeah, what did we about this last time? Well, we somehow relaxed this and there's also ways how to relax this.
And one way to go ahead is to include the so-called hinge loss.
Now, with the hinge loss, we can relax this zero one function into something that behaves linear on a large domain.
And the idea is that we essentially use a line, a line that hits the x-axis at one and the y-axis also at one.
And if we do it this way, then we can simply rewrite this using the max function.
So the hinge loss is then a sum over all the samples that essentially receive zero if our value is larger than one.
So we have to rewrite the right hand part. So we reformulate this a little.
We take one minus ym times y hat. And here you can see that we will have the same constraint.
If we have opposite sides of the boundary, this term will be negative and by design it will, of course, be flipped
so that we end up having large values for a high number of misclassifications.
We got rid of the problem of having to find the set m.
Now we can take the full set of samples by using this max function because everything that will fulfill this constraint will automatically be clamped to zero.
So it will not influence this loss function. So that's a very interesting way of formulating the same problem.
We get implicitly the situation that we only consider the misclassified samples in this loss function.
And you could say or can be shown that the hinge loss is a convex approximation of the misclassification loss that we considered earlier.
One big thing about this kind of optimization problems is, of course, the gradient.
This loss function here has a kink. The derivative is not continuous in the point one.
So it's unclear what the derivative at point one is.
And now you could say, OK, I can't compute the derivative of this function, so I'm doomed.
Luckily, subgradients save the day.
So let's introduce this concept. And in order to do so, we have a look at convex differentiable functions.
Because on those we can say that at any point f of x, we can essentially find a lower bound of f of x that is indicated by some f of x zero plus the gradient at f of x zero multiplied with the difference from x to x zero.
So let's look at a graph to show this concept.
If you look at this function here, you can see that I can take any point x zero and compute the gradient or in this case, it's simply the tangent that is constructed by doing so.
And you will see that at any point, the tangent will be a lower bound to the entire function.
Doesn't matter where I take this point. If I follow the tangential direction, I'm always constructing a lower bound.
Now, this kind of definition is much more suited towards us.
So let's expand now on the gradient and go into the direction of subgradients.
In subgradients, now we define something which keeps this property, but is not necessarily a gradient.
So a vector g is a subgradient of a convex function f at some point x zero if we have the same property.
So if we follow the subgradient direction, multiplied with the difference between x and x zero, then we always have a lower bound.
And the nice thing with this is that we essentially can relax the requirement of being able to compute a gradient.
There could be multiple of those g's that fulfill this property.
So this is not unique. And this subset of all of these subgradients is then called the subdifferential.
So the subdifferential is then a set of subgradients that all fulfill the property above.
And if f is differentiable at x zero, we can simply say that the set containing all subdifferentials is simply the set containing the gradient.
Presenters
Zugänglich über
Offener Zugang
Dauer
00:18:19 Min
Aufnahmedatum
2020-04-22
Hochgeladen am
2020-04-22 04:56:22
Sprache
en-US
Deep Learning - Loss and Optimization Part 2
This video explains hinge loss and its relation to support vector machines. It enables to embed optimization constraints into loss functions.
Video References:
Lex Fridman's Channel
Further Reading:
A gentle Introduction to Deep Learning
User Loss