Welcome back to pattern recognition. So today we want to talk a bit about regression and
different variants of regularizing regression.
So you see here that this is the introduction to regression analysis. The idea here is that we
want to essentially construct a linear decision boundary and in our two class set this would be
then sufficient to differentiate whether we are on the one side or the other side of the decision
boundary and therefore we compute the signed distance alpha transpose x plus alpha zero to
that particular hyperplane and then the sine function just tells us whether we are on the
one side or the other side of this decision boundary as we've seen in the previous adidas
example. So now the idea is that we can compute this decision boundary simply by least square
estimation and the idea here that we want to follow is that we essentially convert our entire
training set into a matrix and a vector and here the matrix is given by the individual feature
vectors transposed and we add a column of one in the very end. So this gives us the matrix x and
this is then a matrix in an m times d plus one dimensional space. So d is the dimensionality of
our feature vector and m is the number of training observations and this is then related to some
vector y and in this vector y we essentially have the class memberships of the respective
feature vector. Now we want to relate those two with a parameter vector theta and theta now is
composed of the normal vector alpha and our bias alpha zero. Now how can we do the estimation?
Well we can simply estimate theta to solve the linear regression problem so we essentially take
the matrix x multiply it with theta and then subtract it from y and take the two norm to the
power of two and compute the minimum over this norm with respect to theta and you can see then
we can break up this norm and build the least square estimator so we can write also this square
either as a sum over the individual components or we can break up the L2 norm squared with this
product of the two terms and transpose so this is essentially an inner product and it doesn't matter
which solution you take if you now take the partial derivative with respect to theta and do
the math then you will find the solution of the so-called pseudo inverse so this is the very
typical Moore-Penrose pseudo inverse that is given as x transpose x and this inverted times x
transposed times y so this will essentially be a solution that will give us an inverse which will
enable us to compute the parameter vector theta from our observations x and y. One thing that you
have to keep in mind here is that this will work if the column vectors of x are mutually independent
if you have dependency in the column vectors you will not have full rank and then you will run into
problems one way how to solve this then is you take the singular value decomposition then you can
still do it and another way is of course that you apply regularization and we will talk about the
ideas of regularization in the next couple of slides. So the questions that we want to ask is
why should we actually prefer the Euclidean norm the L2 norm why is that actually the case will
different norms then lead to different results which norm and decision boundary is actually the
best one and can we somehow incorporate prior knowledge in this linear regression and this
then leads to the concept of ridge regression or regularized regression where we extend the
objective function with an additional term and for example you can constrain using the Euclidean
length of our parameter vector theta. So you could then say that you have a linear regression with
the log likelihood penalized by minus lambda theta transpose theta so the L2 norm of theta and you
set lambda to be greater to zero or alternatively as an actually we will see equivalent formulation
you could say we extend our estimation with a prior distribution on the parameter vector theta
and we assume theta to be distributed in a Gaussian way where we essentially have a zero mean and a
diagonal covariance matrix with some factor tau square as the variance. Okay so this is a bit
much let's look into the details how we can actually see that this is true we start with
the regularized regression and we use the Lagrangian formulation here so you see that our
constraint that our norm should be small is simply added with the Lagrangian multiplier in here and
then we can see that we can further rearrange this so we can again express the norms as inner
products using this transpose notation and this then allows us to compute the partial derivative
with respect to theta and again I can recommend to do the math on your own you can have a look at
Presenters
Zugänglich über
Offener Zugang
Dauer
00:18:04 Min
Aufnahmedatum
2020-11-06
Hochgeladen am
2020-11-06 23:47:25
Sprache
en-US
In this video, we introduce basic concepts on how to regularize regression using L2 and L1 norms.
This video is released under CC BY 4.0. Please feel free to share and reuse.
For reminders to watch the new video follow on Twitter or LinkedIn. Also, join our network for information about talks, videos, and job offers in our Facebook and LinkedIn Groups.
Music Reference: Damiano Baldoni - Thinking of You