11 - Deep Learning - Plain Version 2020 [ID:21055]
50 von 147 angezeigt

Welcome back to deep learning. So let's continue with our lecture. Today we want to talk a bit

about loss functions and optimization. I want to look into a couple of more optimization problems

and one of the optimization problems that we've already seen is the perceptron case.

You remember that we were minimizing a sum over all of the misclassified samples. Here we were

choosing this because we could somehow get rid of the sine function and only look into the samples

that were relevant for misclassification. Also note that here we don't have a 0,1 category but

a minus 1,1 because this allows us to multiply with the class label. This then will always result

in a negative number in misclassified samples. Then we add this negative sign in the very beginning

such that we always end up with a positive value. The smaller this positive value the smaller our

loss will be. So we seek to minimize this function. We don't have the sine function in this criterion

because we found an elegant way to formulate this loss function without it. Now if it were in we

would run into problems because this would count only the number of misclassifications and we would

not differentiate whether it's far away from the decision boundary or close to the decision boundary.

We could simply end up with a count. If we look at the gradient it would essentially vanish everywhere.

So it's not an easy optimization problem. We don't know in which direction to go so we can't find a

good optimization. What did we do about this last time? Well we somehow need to relax this and there

are also ways how to fix this. One way to go ahead is to include the so-called hinge loss. Now with

the hinge loss we can relax this zero one function into something that behaves linearly on a large

domain. The idea is that we essentially use a line that hits the x-axis at one and the y-axis also

at one. If we do it this way we can simply rewrite this using the max function. So the hinge loss is

then a sum over all the samples that essentially receive zero if our value is larger than one. So

we have to rewrite the right hand part to reformulate this a little. Here we take one minus y subscript m

times y hat. Here you can see that we will have the same constraint. If we have opposite sides of

the boundary this term will be negative and by design it will of course be flipped such that we

end up having large values for high numbers of misclassifications. We got rid of the problem of

having to find the set of misclassifications capital M. Now we can take the full set of samples by using

this max function because everything that will fulfill this constraint will automatically be

clamped to zero. So it will not influence this loss function. Thus it's a very interesting way of

formulating the same problem. We have implicitly the situation that we only consider the misclassified

samples in this loss function. It can be shown that the hinge loss is a convex approximation

of the misclassification loss that we considered earlier. One big thing about this kind of optimization

problem is of course the gradient. The loss function here has a kink. Thus the derivative is not

continuous in the point x equals one. So there is unclear what the derivative is and now you could say

okay I can't compute the derivative of this function so I'm doomed. Luckily subgradients save the day.

So let's introduce this concept and in order to do so we have a look at convex differentiable functions.

On those we can say that at any point f of x we can essentially find a lower bound of f of x

that is indicated by some x of x zero plus the gradient at f of x zero multiplied with the

difference from x to x zero. So let's look at a graph to show this concept. If you look at this

function here you can see that I can take any point x zero and compute the gradient or in this case

it's simply the tangent that is constructed. By doing so you will see that any point of the

tangent will be a lower bound to the entire function. It doesn't matter where I take this point

if I follow the tangent in its direction I'm always constructing a lower bound. Now this kind

of definition is much more suited for us. So let's expand now on the gradient and go into

the direction of subgradients. In subgradients we define something which keeps this property

but is not necessarily a gradient. So a vector g is a subgradient of a convex function f of x zero

if we have the same property. So if we follow the subgradient direction multiply it with the

difference between x and x zero then we always have a lower bound. The nice thing with this is

that we essentially can relax the requirement of being able to compute a gradient. There could be

multiple of those g's that fulfill this property so g is not required to be unique. The set of all

these subgradients is then called the subdifferential. So the subdifferential is then a set of subgradients

Teil einer Videoserie :

Zugänglich über

Offener Zugang

Dauer

00:18:52 Min

Aufnahmedatum

2020-10-10

Hochgeladen am

2020-10-10 14:06:19

Sprache

en-US

Deep Learning - Loss and Optimization Part 2

This video explains hinge loss and its relation to support vector machines. It enables to embed optimization constraints into loss functions.

For reminders to watch the new video follow on Twitter or LinkedIn.

Further Reading:
A gentle Introduction to Deep Learning
User Loss

Einbetten
Wordpress FAU Plugin
iFrame
Teilen