6 - Pattern Recognition [PR] - PR 4 [ID:21824]
50 von 73 angezeigt

Welcome back everybody to pattern recognition. So today we want to continue talking about the

Bayesian classifier and today we want to introduce the optimality of the Bayesian classifier.

So the Bayesian classifier can now be summarized and constructed via the Bayesian decision rule.

So we essentially want to decide for the optimal class that is given here by Y star and Y star is

determined by the decision rule. Now what we want to do is we want to take the class that maximizes

the probability given our observation X. We can express this with the Bayes rule by constructing

the actual prior for the class Y and then we need in addition the actual probability for the

observation given class Y divided over the probability of observing the actual evidence.

Now since this is a maximization over Y we can actually get rid of the fraction here and we can

remove P of X because it's not changing the position of our maximum so we can simply neglect

it for the maximization. So this can then also be reformulated in the so-called log likelihood

function and here we use the trick that we apply the logarithm to this multiplication and this

allows us to decompose above multiplication into a sum of two logarithms. You will see that we will

use this trick quite frequently in this class so this is kind of an important observation so you

should definitely be aware of this and this will be very relevant for probably any kind of exam

that you're facing. So you should memorize this slide very well. Now this gives us the optimal

decision according to the Bayes rule and there are some hints so typically the key aspect for

getting a good classifier is finding a good model for the posterior probability P of Y given X.

Then you typically have a fixed dimension in X and this then gives rise to simple classification

schemes and then X is not necessarily a subset of the high dimensional space R D but it may be

features of varying dimension for example you observe that in sequences or set of features.

So for example if you have a speech signal then you don't have a fixed dimension in D but the

same thing can be said faster or also slower so this means then also that the number of observation

changes and this implies essentially a change in the dimensionality of your problem. So these are

sequence problems where you then have to pick a kind of measure that will enable you to compare

things of different dimensionality. We typically analyze our problems either with generative

modeling or with discriminative modeling. In the generative modeling you typically have

a prior probability of the class Y and then the distribution of your feature vectors X given the

respective class. So here you have a class conditional modeling that is able to describe

your entire feature space and then you do the discriminative modeling in comparison. So here

you directly model the probability of the class given the observations and this then enables us

to find the decision very quickly. So essentially we are modeling the decision boundary in this case.

Let's look a bit into the optimality of the Bayesian classifier and if you attended

introduction to pattern recognition you already have seen the formal proof that the Bayesian

classifier is the optimal classifier if you have a zero one loss function and a false decision.

Here we only recall that you can associate any decision to a kind of risk or loss and this loss

function then tells you what kind of damage is done if you decide for the wrong class. So essentially

the most frequent used example here is the zero one loss function and this essentially says you

have a loss of one if you do a misclassification it means that you treat all misclassifications

equally and they have the same cost and the correct classifications essentially have a cost of zero.

So this is a very typical kind of loss function that is quite frequently used and in particular

people like to use it if they don't know exactly what kind of cost is incurred by the misclassification

and in these cases you can then choose this kind of loss function. Now let's see what happens if we

use this loss function. You can now argue with the minimization of the average loss and this is

essentially what we want to do with the Bayesian classifier. So we see we can write up the average

loss as the expected value over the classes and then we have the loss times the probability

of the respective class given our observation. So this is now essentially the average loss for a

given observation x. Now we want to decide for this observation. This means that we want to

minimize the average loss with respect to the classes and this is essentially a minimization

of our average loss over the classes. So we can now plug in the original definition of our average

Teil einer Videoserie :

Zugänglich über

Offener Zugang

Dauer

00:10:45 Min

Aufnahmedatum

2020-10-26

Hochgeladen am

2020-10-26 08:26:55

Sprache

en-US

In this video, we look into the optimality of the Bayes Classifier.

This video is released under CC BY 4.0. Please feel free to share and reuse.

For reminders to watch the new video follow on Twitter or LinkedIn. Also, join our network for information about talks, videos, and job offers in our Facebook and LinkedIn Groups.

Music Reference: Damiano Baldoni - Thinking of You

Einbetten
Wordpress FAU Plugin
iFrame
Teilen