20 - Pattern Recognition [PR] - PR 16 [ID:23033]
50 von 146 angezeigt

Welcome back to pattern recognition. So today we want to talk a bit about regression and

different variants of regularizing regression.

So you see here that this is the introduction to regression analysis. The idea here is that we

want to essentially construct a linear decision boundary and in our two class set this would be

then sufficient to differentiate whether we are on the one side or the other side of the decision

boundary and therefore we compute the signed distance alpha transpose x plus alpha zero to

that particular hyperplane and then the sine function just tells us whether we are on the

one side or the other side of this decision boundary as we've seen in the previous adidas

example. So now the idea is that we can compute this decision boundary simply by least square

estimation and the idea here that we want to follow is that we essentially convert our entire

training set into a matrix and a vector and here the matrix is given by the individual feature

vectors transposed and we add a column of one in the very end. So this gives us the matrix x and

this is then a matrix in an m times d plus one dimensional space. So d is the dimensionality of

our feature vector and m is the number of training observations and this is then related to some

vector y and in this vector y we essentially have the class memberships of the respective

feature vector. Now we want to relate those two with a parameter vector theta and theta now is

composed of the normal vector alpha and our bias alpha zero. Now how can we do the estimation?

Well we can simply estimate theta to solve the linear regression problem so we essentially take

the matrix x multiply it with theta and then subtract it from y and take the two norm to the

power of two and compute the minimum over this norm with respect to theta and you can see then

we can break up this norm and build the least square estimator so we can write also this square

either as a sum over the individual components or we can break up the L2 norm squared with this

product of the two terms and transpose so this is essentially an inner product and it doesn't matter

which solution you take if you now take the partial derivative with respect to theta and do

the math then you will find the solution of the so-called pseudo inverse so this is the very

typical Moore-Penrose pseudo inverse that is given as x transpose x and this inverted times x

transposed times y so this will essentially be a solution that will give us an inverse which will

enable us to compute the parameter vector theta from our observations x and y. One thing that you

have to keep in mind here is that this will work if the column vectors of x are mutually independent

if you have dependency in the column vectors you will not have full rank and then you will run into

problems one way how to solve this then is you take the singular value decomposition then you can

still do it and another way is of course that you apply regularization and we will talk about the

ideas of regularization in the next couple of slides. So the questions that we want to ask is

why should we actually prefer the Euclidean norm the L2 norm why is that actually the case will

different norms then lead to different results which norm and decision boundary is actually the

best one and can we somehow incorporate prior knowledge in this linear regression and this

then leads to the concept of ridge regression or regularized regression where we extend the

objective function with an additional term and for example you can constrain using the Euclidean

length of our parameter vector theta. So you could then say that you have a linear regression with

the log likelihood penalized by minus lambda theta transpose theta so the L2 norm of theta and you

set lambda to be greater to zero or alternatively as an actually we will see equivalent formulation

you could say we extend our estimation with a prior distribution on the parameter vector theta

and we assume theta to be distributed in a Gaussian way where we essentially have a zero mean and a

diagonal covariance matrix with some factor tau square as the variance. Okay so this is a bit

much let's look into the details how we can actually see that this is true we start with

the regularized regression and we use the Lagrangian formulation here so you see that our

constraint that our norm should be small is simply added with the Lagrangian multiplier in here and

then we can see that we can further rearrange this so we can again express the norms as inner

products using this transpose notation and this then allows us to compute the partial derivative

with respect to theta and again I can recommend to do the math on your own you can have a look at

Teil einer Videoserie :

Zugänglich über

Offener Zugang

Dauer

00:18:04 Min

Aufnahmedatum

2020-11-06

Hochgeladen am

2020-11-06 23:47:25

Sprache

en-US

In this video, we introduce basic concepts on how to regularize regression using L2 and L1 norms.

This video is released under CC BY 4.0. Please feel free to share and reuse.

For reminders to watch the new video follow on Twitter or LinkedIn. Also, join our network for information about talks, videos, and job offers in our Facebook and LinkedIn Groups.

Music Reference: Damiano Baldoni - Thinking of You

Einbetten
Wordpress FAU Plugin
iFrame
Teilen