23 - Pattern Recognition [PR] - PR 19 [ID:23065]
50 von 193 angezeigt

Welcome everybody to pattern recognition. So today we want to look a bit into the first

starts of neural networks and in particular we want to look into the Rosenblatt perceptron

and look into its optimization and the actual convergence proofs and convergence behavior.

So let's start looking into the Rosenblatt perceptron. So this was already developed in 1957

and the main idea behind the perceptron is that we want to compute a linear decision boundary and

we assume that the classes are linearly separable and then we are able to compute a linear separating

hyperplane that minimizes the distance of misclassified feature vectors to the decision

boundary. So this is the main idea that is behind the Rosenblatt perceptron. So in the

following we assume that the class numbers are plus and minus one and the decision boundary is

a linear function and it's given as y star equals to sine of alpha transpose x plus alpha zero so

you could say the alpha is the normal vector of the hyperplane and alpha zero is essentially

the offset that is moving the plane away from the origin. So we essentially compute the signed

distance of the vector x to this hyperplane and then we only map to the sine of the signed distance

which is then either minus one or plus one. So we're essentially only interested whether we are on

the one side or the other side of the plane and this can be mapped by the sine function. Now if

we want to optimize this problem and essentially determine optimal parameters for alpha and alpha

zero then we can determine them by the following minimization problem. So we define some function

D of alpha zero and alpha and this is given as minus the sum over the set M where M is the set

of misclassified vectors and then we compute Yi so this is essentially the ground truth label

times alpha transpose Xi plus alpha zero. So if you look at this equation carefully you understand

that Yi is the ground truth label so it's either one or minus one and you see here then if I compute

alpha transposed x i plus alpha zero I'm computing the signed distance. So we know that for a misclassified

sample the two will have exactly the opposite sign. So you see that we essentially will get a

negative value for the bracket then we will have a positive value for Yi because only in this case

this would be misclassified and of course we can essentially flip the sign on both samples and we

will also get a misclassification then. So only in the case where both have the same sign we would

have a correct classification. So this means that everything every element of the sum has a negative

sign and this is why we are multiplying with minus one in front of the sum because only then we will

get positive values. So now you see that if we minimize this we're essentially minimizing the

loss that is caused by all of these misclassifications. So we essentially see now that the elements of the

sum depend on the set of misclassified feature vectors and this essentially can change in every

iteration. So every time I change the decision boundary I'm changing the set of misclassified

samples. So this is a huge problem because the set will change probably in every iteration.

So the cardinality of m is a discrete variable and we kind of have competing variables. We have the

continuous parameters of the linear decision boundary and the discrete cardinality of m.

So let's look at this objective function that we seek to minimize. We already explained this in

detail and if we now want to minimize we of course need to compute the gradient of this objective

function and you see if I compute that with respect to alpha zero it is simply the sum over

all the yi in the misclassified set and we have the minus sign still in front and if we compute

the partial derivative with respect to alpha you can see this is minus the sum over yi times xi.

Now we can essentially look into the update rule and let's look into the special case where we

update after each visited misclassification. Then we essentially get a new estimate of our alpha zero

and alpha in every observed misclassification and we immediately choose to update. So we update from

k to k plus one and you see now that in brackets we write the new iteration step and we choose this

vector notation and we see that this is of course the previous iteration what we've seen in step

number k and then we have plus lambda that is essentially the step size of our optimizer

times yi and in the other entry of this vector we have yi times xi. Generally the lambda can be

chosen also to one which is a simplification of the update step. So the procedure would then end up

in having some input with training samples with the different xi and yi in the set S and then we

initialize for example with alpha zero in iteration zero with zero and the vector alpha in iteration

Teil einer Videoserie :

Zugänglich über

Offener Zugang

Dauer

00:25:23 Min

Aufnahmedatum

2020-11-08

Hochgeladen am

2020-11-08 13:18:07

Sprache

en-US

In this video, we introduce the Rosenblatt Perceptron and perform a convergence analysis.

This video is released under CC BY 4.0. Please feel free to share and reuse.

For reminders to watch the new video follow on Twitter or LinkedIn. Also, join our network for information about talks, videos, and job offers in our Facebook and LinkedIn Groups.

Music Reference: Damiano Baldoni - Thinking of You

Einbetten
Wordpress FAU Plugin
iFrame
Teilen