Okay, good morning everybody. So let's start with the usual recap. We talked about regression
after talking about classification for quite a while. Starting with linear regression.
The goal here is that we have a cloud of points on a plane and we are trying to find the linear
function that minimizes the distance between the graph of that function and the data points that
we have. As usual we assume we have a set of examples in this case consisting of pairs of
real numbers. The goal is to find the linear function i.e. one that satisfies the linearity
constraint such that that works. Then we do a bit of math. We try to minimize the squared error
loss, plug that in there, try to find the minimum and then it turns out that conveniently we find
a closed form solution i.e. these two equations which allow us to derive the parameters w1,
the linear factor and w0, the constant summand for our hypotheses. That is nice in the case where
we have a closed form as we do with univariate linear regression. In the case where we don't
have such a closed form we have to do something like gradient descent. With linear regression
it turns out that the weight space happens to be convex which makes that problem rather easy which
means if we do any kind of greedy search by minimizing basically the derivative of the loss
function we are guaranteed to end up in the global minimum. By the way I double checked convex that
really does mean if I connect any two points they are above the graph of the function. If they're
below the graph of the function consistently it's called concave. Unfortunately it also turns out
that there's people who call convex either convex or concave so things get a bit murky and at some
point I realized later in the slides there's going to be a maximization problem and it still says the
function is convex. Russell Norvig actually used the convention that convex just means either it's
convex or it's concave but either of the two. Both variants exist annoyingly. That is convex
that means if we do any kind of gradient descent we're going to end up at the global minimum anyway.
Again as mentioned with univariate linear regression it doesn't make much sense to use
gradient descent but of course we're going to generalize at some point so looking at gradient
descent is still a good idea. Here's the general algorithm for gradient descent. We have some
function f in our case the loss function. We have a set of weights w and we have some learning rate
alpha and then what we do is until our weights have converged we iterate over all of our weights
and adjust them by taking the partial derivative of our loss function on those weights with respect
to the particular weight we're currently interested in and then just subtract that from the current
state of the weight and if we iterate over that then in the case of linear regression we're guaranteed
to converge at some point. Alpha we call the learning rate. We can either fix that in general
or we can do what we are going to call learning rate decay which is to say we start with a rather
high learning rate and then over time by iterating over all of our data samples we're going to
gradually reduce the learning rate which helps the whole thing to converge necessarily. If at
some point like alpha reaches zero we're definitely converging. So we can do that for the quadratic
loss in the case of linear regression then we end up with these two particular gradient descent
update rules. We can also do that for multiple examples at once if we just compute the loss
for a couple of points in our case and then do an update of our weights once over the total combined
loss of that we call the batch gradient descent and the data that we use for one such update step
is the corresponding batch and if we do that for random subsets of the examples that we have with
some fixed batch size we call that stochastic gradient descent that is basically the default
most popular way to for example do gradient descent on neural networks. Usually if you have
like a couple of thousand data examples you just batch them into something like 20 at once and then
do compute the loss for all of those 20 and then do one update step after every 20 feet forward once.
Okay in the multivariate case of course things get a bit tricky and the most important thing
here is one trick which we're going to use again and again and again namely anytime we have the
situation where we have a constant summand as a parameter of our model and then a bunch of linear
factors on the inputs what we are going to do is we invent a new dummy coordinate for our inputs
which we just set uniformly to one for the simple reason that that allows us to describe this entire
thing here as just one scalar multiplication of two vectors. So it's really just a notational trick
Presenters
Zugänglich über
Offener Zugang
Dauer
01:21:21 Min
Aufnahmedatum
2024-06-20
Hochgeladen am
2024-06-21 05:19:05
Sprache
en-US