The following content has been provided by the University of Erlangen-Nürnberg.
So welcome everybody to Pattern Recognition. Welcome back in 2013. Maybe you were hoping
that Professor Honecker will show up again. I have to disappoint you. I think for the
next two weeks he will not show up. So I will give the lecture today and tomorrow and I
hope that Andreas Meyer will give the lecture next week. So today we are talking about one
of the most exciting algorithms in pattern recognition. It's about parameter estimation
and the algorithm is called EM algorithm, expectation maximization. We will first talk
about parameter estimation in general. There are two well-known techniques in statistics,
maximum likelihood estimation and maximum a posteriori maximization map. And then we
will apply it to Gaussian mixture models. First we will apply it to a simple Gaussian
distribution and then to Gaussian mixture models. Then we will focus on the missing
information principle, derive the EM algorithm, show the advantages and drawbacks of this
algorithm and apply it to constraint optimization. So this algorithm was developed with one goal
or with two goals. The first goal is that we want to deal with the high dimensional
parameter space, so many parameters that we want to estimate and we want to deal with
latent or hidden data or sometimes it's also called incomplete data. So that's data that
you cannot observe hidden. For example, if we have a Gaussian mixture model, we have
a mixture of several Gaussian distributions and we have one probability density function
and according to this PDF we generate data. So we see the data, the data is observable,
what we don't see is what component actually generated this data, which of the mixtures
generated it. So from statistics there are two well-known techniques, maximum likelihood
estimation, ML estimation and the other one is maximum a posteriori estimation. We will
focus on the first one, maximum likelihood estimation. We set up a likelihood function
and we maximize this likelihood function with respect to the parameters. The observations
are kept fixed and we assume that all our observations are mutually statistically independent.
What does this mean? So let's say we have the probability p of all our samples. Our
samples are called x1, x2 and let's say we have m different samples and we want to compute
the probability that this set of sample is generated and the probability density function
depends on certain parameters and I call this set of parameters theta. Okay? And that's
what we want to maximize and we choose these parameters such that the output probability
of all our samples is maximized. So we are looking for those parameters that give the
highest probability. So our estimate for the parameters that we are looking for is arc
max of this probability. And now we say our samples are statistically independent. That
means the probability of one of the samples does not depend on all the other samples.
So I can write this probability in another way. This is again arc max and that's the
probability of a single sample xi given the parameters and if I want to get the probability
of all samples I compute the product over all samples. That means statistically independent.
And if you compute this probability or you want to maximize it and you want to find the
position of the maximum it doesn't matter whether you apply a monotonic function or
not. So you can take the logarithm of it. You change the value of the maximum but you
don't change the position of the maximum. So you just use the logarithm of this product
and you still get the same estimate for the parameters. And the benefit of taking the
logarithm is that you can get the same result as the other samples. So you can get the same
result as the other samples. So you can get the same result as the other samples.
And the benefit of taking the logarithm is that you can now put the logarithm inside
the product and instead of having a product you have a sum. And handling sums is in general
much easier than handling products. Okay. And this function here is called log like
likelihood function. So this one here. The other technique maximum a posteriori estimation
assumes that we have a probability density function of our parameters. So we know what
values for our parameters are more likely than other values. And we can treat the parameters
Presenters
Zugänglich über
Offener Zugang
Dauer
00:51:07 Min
Aufnahmedatum
2013-01-07
Hochgeladen am
2013-01-10 13:47:33
Sprache
en-US