Welcome everybody to pattern recognition. So today we want to look a bit into the first
starts of neural networks and in particular we want to look into the Rosenblatt perceptron
and look into its optimization and the actual convergence proofs and convergence behavior.
So let's start looking into the Rosenblatt perceptron. So this was already developed in 1957
and the main idea behind the perceptron is that we want to compute a linear decision boundary and
we assume that the classes are linearly separable and then we are able to compute a linear separating
hyperplane that minimizes the distance of misclassified feature vectors to the decision
boundary. So this is the main idea that is behind the Rosenblatt perceptron. So in the
following we assume that the class numbers are plus and minus one and the decision boundary is
a linear function and it's given as y star equals to sine of alpha transpose x plus alpha zero so
you could say the alpha is the normal vector of the hyperplane and alpha zero is essentially
the offset that is moving the plane away from the origin. So we essentially compute the signed
distance of the vector x to this hyperplane and then we only map to the sine of the signed distance
which is then either minus one or plus one. So we're essentially only interested whether we are on
the one side or the other side of the plane and this can be mapped by the sine function. Now if
we want to optimize this problem and essentially determine optimal parameters for alpha and alpha
zero then we can determine them by the following minimization problem. So we define some function
D of alpha zero and alpha and this is given as minus the sum over the set M where M is the set
of misclassified vectors and then we compute Yi so this is essentially the ground truth label
times alpha transpose Xi plus alpha zero. So if you look at this equation carefully you understand
that Yi is the ground truth label so it's either one or minus one and you see here then if I compute
alpha transposed x i plus alpha zero I'm computing the signed distance. So we know that for a misclassified
sample the two will have exactly the opposite sign. So you see that we essentially will get a
negative value for the bracket then we will have a positive value for Yi because only in this case
this would be misclassified and of course we can essentially flip the sign on both samples and we
will also get a misclassification then. So only in the case where both have the same sign we would
have a correct classification. So this means that everything every element of the sum has a negative
sign and this is why we are multiplying with minus one in front of the sum because only then we will
get positive values. So now you see that if we minimize this we're essentially minimizing the
loss that is caused by all of these misclassifications. So we essentially see now that the elements of the
sum depend on the set of misclassified feature vectors and this essentially can change in every
iteration. So every time I change the decision boundary I'm changing the set of misclassified
samples. So this is a huge problem because the set will change probably in every iteration.
So the cardinality of m is a discrete variable and we kind of have competing variables. We have the
continuous parameters of the linear decision boundary and the discrete cardinality of m.
So let's look at this objective function that we seek to minimize. We already explained this in
detail and if we now want to minimize we of course need to compute the gradient of this objective
function and you see if I compute that with respect to alpha zero it is simply the sum over
all the yi in the misclassified set and we have the minus sign still in front and if we compute
the partial derivative with respect to alpha you can see this is minus the sum over yi times xi.
Now we can essentially look into the update rule and let's look into the special case where we
update after each visited misclassification. Then we essentially get a new estimate of our alpha zero
and alpha in every observed misclassification and we immediately choose to update. So we update from
k to k plus one and you see now that in brackets we write the new iteration step and we choose this
vector notation and we see that this is of course the previous iteration what we've seen in step
number k and then we have plus lambda that is essentially the step size of our optimizer
times yi and in the other entry of this vector we have yi times xi. Generally the lambda can be
chosen also to one which is a simplification of the update step. So the procedure would then end up
in having some input with training samples with the different xi and yi in the set S and then we
initialize for example with alpha zero in iteration zero with zero and the vector alpha in iteration
Presenters
Zugänglich über
Offener Zugang
Dauer
00:25:23 Min
Aufnahmedatum
2020-11-08
Hochgeladen am
2020-11-08 13:18:07
Sprache
en-US
In this video, we introduce the Rosenblatt Perceptron and perform a convergence analysis.
This video is released under CC BY 4.0. Please feel free to share and reuse.
For reminders to watch the new video follow on Twitter or LinkedIn. Also, join our network for information about talks, videos, and job offers in our Facebook and LinkedIn Groups.
Music Reference: Damiano Baldoni - Thinking of You