Only that plausible as it is, this is not really what we want.
Because error rates are not really what we want.
Not every error is equal.
Think about spam classification.
It's much worse if you lose that important business email that invites you to some great thing.
And you never even know.
That might give you lost opportunities, give you a really hard hit.
Whereas if you have to suck at one more spam, okay, it's not nice, but we can survive.
So different errors may be coupled to different changes in utility of the whole situation.
If you remember, we want to behave rationally.
So we want to really maximize expected utility.
So every error actually has to be somehow threaded through or judged with respect to utility.
And machine learning, of course, knows that.
And so there's a way of dealing with these, and that's really via what we call loss functions.
Loss functions is really how much, if you make an error, how much utility do you lose?
And in its most general form, we have a function L that in the situation where the real value at point X is Y,
and we make an error by predicting Y hat instead.
So we're measuring the utility loss by predicting at point X, Y hat instead of Y.
If you have a utility function, that's easy, right, because if you know the utility of X and Y and Y hat,
then you can just compute this.
Think of Y and Y hat being states.
So very often it's the case that this loss function is independent of the particular example,
and then we'll just leave out this first argument here and just have a loss function Y and Y hat.
And that's usually what we do in models at least.
If we have a, for instance, with the spam example, if we have a spam and classify it as ham, meaning the user has to read another email
and find out that's not for me, then it's a minor annoyance, say one.
And if we do it the other way, if we drop a message that is actually legit,
then we say have a much higher value here, say 10, for some value of 10.
Really depends a lot on the email.
This model assumption that leaves out the X here is wrong.
But having a model that kind of has to look into what the text of the message is, is difficult to make, kind of AI complete.
And so this is a reasonable model.
Where are you?
So where does that lead us?
Well, first a couple of things.
So if you have zero on the diagonal of this function, obviously.
If you're exactly correct, there's no utility loss, obviously.
And then you can go and then look at various ways of measuring loss.
One is you can take the absolute value of the difference, gives you kind of a Manhattan distance, if you think about it.
You can have the square of the difference, if you think of that in multiple dimension, that kind of gives you Euclidean distance.
And then, of course, there is the zero one loss, where you basically say a loss is a loss and we don't distinguish them.
And there is an error and we don't distinguish them in terms of utility.
And that basically gives you back the error rate.
So if you think about it, what you want to do, if you want to maximize expected utility,
you want to minimize the expected utility loss over all of our pairs in our function.
So what do we do? Very simple.
We think of all the pairs again as random variables because we haven't seen them yet.
So we basically can have a probability distributions over the input output values.
And then we can compute the loss of stuff we haven't seen yet, but we might know the prior of.
We can compute that just by summing up over all pairs.
Presenters
Zugänglich über
Offener Zugang
Dauer
00:31:59 Min
Aufnahmedatum
2021-03-30
Hochgeladen am
2021-03-30 16:47:43
Sprache
en-US
Definition of different loss functions and the concept of regularization. Also, the Minimal Description Length and the scale of Machine Learning is discussed.