5 - 25.4. Using Information Theory (Part 1) [ID:30373]
50 von 203 angezeigt

Now, how do we actually operationalize that?

What's the math behind this?

And the answer comes from what we call information theory.

Or in a way, the theory of surprisingness.

So the intuition that we have is that information answers questions.

So that's the one information.

And the other information is if I don't know, let me put it the other way around.

If I know what the answer to a question is when I ask it, I'm not actually getting a lot of information.

So in the questionnaire, I'm not actually getting any information.

Because I know the information, I know it beforehand, so there's nothing new for me.

I gain nothing.

You may gain something, which is why I'm doing it, but I gain nothing.

So the idea is that information gain has something to do with what you know before and what you know afterwards.

And information is when you actually gain knowledge.

So if we want to measure this, then we restrict ourselves initially to bullying questions.

Is it raining or is the sun shining?

I flip a coin, heads or tails.

And the thing you want to realize is that the amount of information you're gaining really depends on the priors.

So flipping an unloaded coin, a normal coin, you're going to gain, which has a prior of one half, one half, you're gaining actually one bit.

Because beforehand, you were clueless, you knew nothing, and you know zero or one afterwards.

If we toss a loaded coin, which lands on, what did I say, heads 99% of the time,

then from the information that you're getting out of this coin toss, you're getting much less information because you knew much, much, much, much more.

You knew the number, it's going to be heads 99% of the time.

And if it turns out to be heads, you're getting one hundredth of a bit.

So, and if it's a fully loaded coin that always lands on heads, you're actually not getting any information.

So we need to have a function for measuring bits based on the priors that tells me how much information I'm getting out of the answer to this question.

And the conventional answer here is to have the information function over a, over a n-fold distribution.

It's just the sum over, the weighted sum over the binary logs of the probability.

And you weight them with the negative probabilities themselves.

And this is constructed such that you're getting the number one here for an unloaded coin and the number zero for a completely determined situation.

Okay, let's check that. So if we have a fair coin toss, then we have a half, a half here.

Then the entropy of a half, a half is the sum here of minus a half log two of a half, log two of a half comes out to minus one.

So we have minus, minus one, minus, minus, minus one, and then we come out to one.

The right thing is here, but this is actually minus one log of a half.

And then we have plus a half, plus a half that gives us one bit.

Okay, and for the loaded coin, we have much less than one bit.

Okay, so if you apply that to this, then what you want to have is you want to prefer the attributes that give you a lot of information.

Okay, and we can now measure that.

We have a three-tuple, two over twelve, which is one over six, four over twelve, which is one over three, and a half.

That doesn't add up to one. Where's the problem?

No, it actually does add up to one. Yes, I'm sorry.

Privately, I don't really believe in numbers greater than three, so twelve is out of my territory.

So this one has much better entropy than that one.

And that is something we can measure. We can actually go back, well, no, here to our examples and look at the kind of finite probability distributions of those.

And for every of these attributes, we can measure entropy and use the entropy as the notion of best here.

So choose attribute means from my remaining example set, compute the entropy of every attribute, and then choose the best one.

The most discriminating one, the one that looks like this.

And that actually gives us a full test.

So let's look at the details.

So say we have P positive and negative examples at the root, then the distribution is, of course, P over P plus N and N over P plus N.

Teil eines Kapitels:
Chapter 25. Learning from Observations

Zugänglich über

Offener Zugang

Dauer

00:27:33 Min

Aufnahmedatum

2021-03-30

Hochgeladen am

2021-03-30 16:17:24

Sprache

en-US

A formula to compute the information gain of an information is given. Also, the learning curves are discussed as a performance measurement. 

Einbetten
Wordpress FAU Plugin
iFrame
Teilen