13 - Pattern Recognition [PR] - PR 10 [ID:22124]
50 von 102 angezeigt

Welcome back to Pattern Recognition. So today we want to look into a more simple classification

method that is called the Naive Bayes Classifier.

So the Naive Bayes Classifier takes quite a few simplifying assumptions. Still it's

widely and successfully used and it often also outperforms much more advanced classifiers.

It can be appropriate in the presence of high dimensional features. So if you have really high

dimensions because of the curse of dimensionality and the sparsity of training observations,

it may make sense to simplify your model with Naive Bayes. Sometimes it's even called Idiot's Bayes

So let's look into the problem that it tries to tackle. So typically from the class

dependent probability density function we can do the following factorization. So you see we have

our observations x and if they are class dependent and then we can rewrite the vector x in its

components. So we have the observations x1 to xd and they are all given the class y. Now I can

factorize this further so which means that I can compute the class conditional probability for x1

and then I still need to multiply with the respective other probabilities and you can see

that I can apply the same trick again. So then I see that x2 is going to be dependent on y and x1

and we can write this up into the following product here. So you see how we start building

this on top and you see that we have all the different interdependencies and this is essentially

nothing else than constructing a full covariance matrix here if you consider for example the

Gaussian case. So what do we do in Naive Bayes? Well Naive Bayes makes a very strong assumption

and it assumes naively the independency of the dimensions. So all the components of the feature

vector x are assumed to be mutually independent and this then means that we can rewrite the actual

class conditional probability for x as simply the product over the individual dimensions of x. And

if we now apply this in Bayes rule then you see that we still want to maximize our posterior

probability with respect to y and now we apply Bayes rule. We've seen that we can get rid of the

prior of x because in the maximization of y we are independent of x. So this part is not considered

here and then we can see that we essentially can break this down to the prior of y times the

component wise class conditional probabilities. So this is a fairly simple assumption and why would

we want to do that? Well let's go back to our Gaussian and if we now describe a 100 dimensional

feature vector x that lives in a 100 dimensional space then if you belong to class y and this is

normally distributed in all components that are mutually dependent you can see that we need a

mean vector with a dimensionality of 100 and we need a covariance matrix of a dimensionality of

100 times 100. So this is fairly big you can then even simplify this a little bit further because

our covariance matrix actually doesn't have complete degrees of freedoms but we actually

have a triangular matrix because some of these components have to appear again and this means

that we have essentially 100 unknowns in the mean vector and 100 times 100 plus 1 over 2 and this

gives a total number of unknowns of 5150. Now let's assume that they are mutually independent. This

means that we still need to have a mean vector with 100 components but we can break down our

covariance matrix now and we see that we only have to estimate a single variance for every component

of the vector. So this is then a much simpler version and this brings us then down that we

only have 100 plus 100 unknowns that need to be estimated and this reflects to quite a bit of

reduction in terms of parameters. So in this plot we are actually showing the number of parameters

on the y-axis and the dimension of the feature vector on the x-axis and you can see here that

with Naive Bayes of course this is a linear relationship while in a Gaussian with full

covariance we have something that is growing at a quadratic rate. Let's look into the example and

the effect of the modeling. Here you see this example with two Gaussian distributions and they

are both now using a full covariance matrix. If I break this down you can see here our decision

boundary in black. Now if we use the Naive Bayes it breaks down to the following decision boundary

so you can see it's more coarse it's not such a great fit but it still does the trick and you

can see as well that the estimated covariance parameters are also much simpler because it's

only two parameters per distribution. Also we can consider the logit transform so you remember if

we want to look at the decision boundary then we take the posterior probabilities and divide

Teil einer Videoserie :

Zugänglich über

Offener Zugang

Dauer

00:13:44 Min

Aufnahmedatum

2020-10-28

Hochgeladen am

2020-10-28 22:07:03

Sprache

en-US

In this video, we have a look at the naive Bayes classifier.

This video is released under CC BY 4.0. Please feel free to share and reuse.

For reminders to watch the new video follow on Twitter or LinkedIn. Also, join our network for information about talks, videos, and job offers in our Facebook and LinkedIn Groups.

Music Reference: Damiano Baldoni - Thinking of You

Einbetten
Wordpress FAU Plugin
iFrame
Teilen