Welcome back to Pattern Recognition. So today we want to look into a more simple classification
method that is called the Naive Bayes Classifier.
So the Naive Bayes Classifier takes quite a few simplifying assumptions. Still it's
widely and successfully used and it often also outperforms much more advanced classifiers.
It can be appropriate in the presence of high dimensional features. So if you have really high
dimensions because of the curse of dimensionality and the sparsity of training observations,
it may make sense to simplify your model with Naive Bayes. Sometimes it's even called Idiot's Bayes
So let's look into the problem that it tries to tackle. So typically from the class
dependent probability density function we can do the following factorization. So you see we have
our observations x and if they are class dependent and then we can rewrite the vector x in its
components. So we have the observations x1 to xd and they are all given the class y. Now I can
factorize this further so which means that I can compute the class conditional probability for x1
and then I still need to multiply with the respective other probabilities and you can see
that I can apply the same trick again. So then I see that x2 is going to be dependent on y and x1
and we can write this up into the following product here. So you see how we start building
this on top and you see that we have all the different interdependencies and this is essentially
nothing else than constructing a full covariance matrix here if you consider for example the
Gaussian case. So what do we do in Naive Bayes? Well Naive Bayes makes a very strong assumption
and it assumes naively the independency of the dimensions. So all the components of the feature
vector x are assumed to be mutually independent and this then means that we can rewrite the actual
class conditional probability for x as simply the product over the individual dimensions of x. And
if we now apply this in Bayes rule then you see that we still want to maximize our posterior
probability with respect to y and now we apply Bayes rule. We've seen that we can get rid of the
prior of x because in the maximization of y we are independent of x. So this part is not considered
here and then we can see that we essentially can break this down to the prior of y times the
component wise class conditional probabilities. So this is a fairly simple assumption and why would
we want to do that? Well let's go back to our Gaussian and if we now describe a 100 dimensional
feature vector x that lives in a 100 dimensional space then if you belong to class y and this is
normally distributed in all components that are mutually dependent you can see that we need a
mean vector with a dimensionality of 100 and we need a covariance matrix of a dimensionality of
100 times 100. So this is fairly big you can then even simplify this a little bit further because
our covariance matrix actually doesn't have complete degrees of freedoms but we actually
have a triangular matrix because some of these components have to appear again and this means
that we have essentially 100 unknowns in the mean vector and 100 times 100 plus 1 over 2 and this
gives a total number of unknowns of 5150. Now let's assume that they are mutually independent. This
means that we still need to have a mean vector with 100 components but we can break down our
covariance matrix now and we see that we only have to estimate a single variance for every component
of the vector. So this is then a much simpler version and this brings us then down that we
only have 100 plus 100 unknowns that need to be estimated and this reflects to quite a bit of
reduction in terms of parameters. So in this plot we are actually showing the number of parameters
on the y-axis and the dimension of the feature vector on the x-axis and you can see here that
with Naive Bayes of course this is a linear relationship while in a Gaussian with full
covariance we have something that is growing at a quadratic rate. Let's look into the example and
the effect of the modeling. Here you see this example with two Gaussian distributions and they
are both now using a full covariance matrix. If I break this down you can see here our decision
boundary in black. Now if we use the Naive Bayes it breaks down to the following decision boundary
so you can see it's more coarse it's not such a great fit but it still does the trick and you
can see as well that the estimated covariance parameters are also much simpler because it's
only two parameters per distribution. Also we can consider the logit transform so you remember if
we want to look at the decision boundary then we take the posterior probabilities and divide
Presenters
Zugänglich über
Offener Zugang
Dauer
00:13:44 Min
Aufnahmedatum
2020-10-28
Hochgeladen am
2020-10-28 22:07:03
Sprache
en-US
In this video, we have a look at the naive Bayes classifier.
This video is released under CC BY 4.0. Please feel free to share and reuse.
For reminders to watch the new video follow on Twitter or LinkedIn. Also, join our network for information about talks, videos, and job offers in our Facebook and LinkedIn Groups.
Music Reference: Damiano Baldoni - Thinking of You