Welcome back everybody to pattern recognition. So today we want to look into a probabilistic
estimation technique that allows us to estimate hidden information, the so-called expectation
maximization algorithm.
So let's look into our slides. So the topic today is the expectation maximization algorithm
and we want to use it for parameter estimation. So the goal is the derivation of a parameter
estimation technique that will be able to deal with high dimensional parameter spaces,
latent and hidden variables as well as incomplete data. So far we've seen the following techniques.
We have seen the maximum likelihood estimation which is essentially assuming that all observations
are mutually statistically independent, the observations are kept fixed and then the log
likelihood function is optimized with respect to the parameters. The other technique that we looked
into was the maximum a posteriori estimation and there we essentially assume that there is a
probability density function of our parameters and in order to do it of course we have to know
how this function looks like in order to determine its parameters. So generally we can now formulate
this as x as an observed random variable and some parameter set theta. So the estimates of theta are
denoted by theta hat and x can then be an event that is assigned to the random variable capital x.
This gives us then the maximum likelihood estimation where we have the maximum over our likelihood
function that then determines the optimal parameters theta hat and typically we can write this up as a
joint probability function. Now the other one is the maximum a posteriori estimation and here we
determine our parameter theta hat as the maximization over the probability of theta given some x and this
can then be written by the base rule and marginalization such that we then finally end up as the maximization
of the logarithm of the prior of theta plus the logarithm of x given theta. Now we essentially
have a generative model where we know the distribution of x given theta and we are only
interested in determining the optimal parameters. So theta is considered as a random variable
and its probability density function is known. Let's look into an example using maximum likelihood
estimation. So here we assume a Gaussian and you remember the Gaussian and the normal distribution
are of course very relevant also for the exam. So you see that we have this classical formulation
as it popped out at various instances in our class and then we observe random vectors x1 to xm that
form our training data. So based on these training data we now have to estimate the mean vector mu
and the covariance matrix sigma. How can we do that? Well we use the idea of maximum likelihood
estimation and we assume that our samples are mutually independent which essentially allows
us to write everything up as a large product and this large product is then maximized with respect
to mu and sigma. We can essentially use the log likelihood function and if we do so then we can
even pull in the logarithm our product turns into a sum and this then allows us to essentially also
write this up generally as the maximization of the log likelihood function where we then have the
observations as input and we seek to maximize it with respect to our mu and sigma.
And here is our definition of the log likelihood function that is simply the sum
over the respective logarithmized probabilities. If you do so then you have to compute the partial
derivative in order to compute the point of optimality and necessary conditions for these
parameters are of course that the partial derivative with respect to mu equals to zero and the partial
derivative with respect to sigma equals to zero. So we can actually do that, apply the definition
of our Gaussian and then we see that the prior term essentially can be removed because it's not
relevant for the optimization with respect to mu here and then we essentially just remain with the
exponent and this one we can then set to zero and then we can also see if we solve above equation
then we essentially get the sum over all the observations divided by the number of observations
as the mean vector. Now we can also estimate the covariance matrix and we do the same trick we
compute the partial derivative with respect to the log likelihood function and then we determine
its zero crossings and we can see that the covariance matrix then can be estimated
as one over m where m is the number of training samples and then we have two vectors essentially
and it's actually twice the same vector it's x i minus mu hat times x i minus mu hat transpose
so these are both column vectors so we essentially have degree one matrices and then a sum over all
Presenters
Zugänglich über
Offener Zugang
Dauer
00:16:16 Min
Aufnahmedatum
2020-11-14
Hochgeladen am
2020-11-14 22:37:37
Sprache
en-US
In this video, we have a look at the expectation maximization algorithm using Gaussian Mixture Models.
This video is released under CC BY 4.0. Please feel free to share and reuse.
For reminders to watch the new video follow on Twitter or LinkedIn. Also, join our network for information about talks, videos, and job offers in our Facebook and LinkedIn Groups.
Music Reference: Damiano Baldoni - Thinking of You