17 - Artificial Intelligence II [ID:52589]

50 von 733 angezeigt

Okay, good morning everybody. So let's start with the usual recap. We talked about regression

after talking about classification for quite a while. Starting with linear regression.

The goal here is that we have a cloud of points on a plane and we are trying to find the linear

function that minimizes the distance between the graph of that function and the data points that

we have. As usual we assume we have a set of examples in this case consisting of pairs of

real numbers. The goal is to find the linear function i.e. one that satisfies the linearity

constraint such that that works. Then we do a bit of math. We try to minimize the squared error

loss, plug that in there, try to find the minimum and then it turns out that conveniently we find

a closed form solution i.e. these two equations which allow us to derive the parameters w1,

the linear factor and w0, the constant summand for our hypotheses. That is nice in the case where

we have a closed form as we do with univariate linear regression. In the case where we don't

have such a closed form we have to do something like gradient descent. With linear regression

it turns out that the weight space happens to be convex which makes that problem rather easy which

means if we do any kind of greedy search by minimizing basically the derivative of the loss

function we are guaranteed to end up in the global minimum. By the way I double checked convex that

really does mean if I connect any two points they are above the graph of the function. If they're

below the graph of the function consistently it's called concave. Unfortunately it also turns out

that there's people who call convex either convex or concave so things get a bit murky and at some

point I realized later in the slides there's going to be a maximization problem and it still says the

function is convex. Russell Norvig actually used the convention that convex just means either it's

convex or it's concave but either of the two. Both variants exist annoyingly. That is convex

that means if we do any kind of gradient descent we're going to end up at the global minimum anyway.

Again as mentioned with univariate linear regression it doesn't make much sense to use

gradient descent but of course we're going to generalize at some point so looking at gradient

descent is still a good idea. Here's the general algorithm for gradient descent. We have some

function f in our case the loss function. We have a set of weights w and we have some learning rate

alpha and then what we do is until our weights have converged we iterate over all of our weights

and adjust them by taking the partial derivative of our loss function on those weights with respect

to the particular weight we're currently interested in and then just subtract that from the current

state of the weight and if we iterate over that then in the case of linear regression we're guaranteed

to converge at some point. Alpha we call the learning rate. We can either fix that in general

or we can do what we are going to call learning rate decay which is to say we start with a rather

high learning rate and then over time by iterating over all of our data samples we're going to

gradually reduce the learning rate which helps the whole thing to converge necessarily. If at

some point like alpha reaches zero we're definitely converging. So we can do that for the quadratic

loss in the case of linear regression then we end up with these two particular gradient descent

update rules. We can also do that for multiple examples at once if we just compute the loss

for a couple of points in our case and then do an update of our weights once over the total combined

loss of that we call the batch gradient descent and the data that we use for one such update step

is the corresponding batch and if we do that for random subsets of the examples that we have with

some fixed batch size we call that stochastic gradient descent that is basically the default

most popular way to for example do gradient descent on neural networks. Usually if you have

like a couple of thousand data examples you just batch them into something like 20 at once and then

do compute the loss for all of those 20 and then do one update step after every 20 feet forward once.

Okay in the multivariate case of course things get a bit tricky and the most important thing

here is one trick which we're going to use again and again and again namely anytime we have the

situation where we have a constant summand as a parameter of our model and then a bunch of linear

factors on the inputs what we are going to do is we invent a new dummy coordinate for our inputs

which we just set uniformly to one for the simple reason that that allows us to describe this entire

thing here as just one scalar multiplication of two vectors. So it's really just a notational trick

Teil einer Videoserie :

Artificial Intelligence II

Presenters

Dr.-Ing. Dennis Müller

Zugänglich über

Offener Zugang

Dauer

01:21:21 Min

Aufnahmedatum

2024-06-20

Hochgeladen am

2024-06-21 05:19:05

Sprache

en-US

Einbetten

Wordpress FAU Plugin

 https://www.fau.tv/clip/id/52589

iFrame

<iframe src="https://api.video.uni-erlangen.de/services/oembed/?url=https://www.fau.tv/clip/id/52589&format=iframe&maxwidth=1280&maxheight=720" width="1280" height="720"seamless allowfullscreen style="border: 0; padding: 0; margin: 0;overflow: hidden;"></iframe>

Herunterladen

Video

Audio

Per RSS abonnieren