5 - Analysis of gradient descent on wide two-layer ReLU neural networks (L. Chizat, EPFL Lausanne) [ID:36962]
50 von 318 angezeigt

Thank you for the introduction. So what I'm going to present today is joint work with

Francis Bach and it will be based on several papers that we have written together which

will be listed in the end of this talk. And it deals with two-layer neural networks. So

let me first describe this class of model and where they are used. So we consider a

standard supervised learning class where we are given a couple of random variables x and

y where x are the inputs they live in R to the D and D is typically large in the applications

we have in mind and y is an output and typically a scalar or just plus one or minus one in

classification tasks. And the goal is given n iid samples from these couple of random

variables which we will call the training set is to build a function h which I will

call a predictor which tries to predict the correct output y on unobserved input x's.

And in this talk we will consider a specific class of predictors which are really huge.

So these are predictors which are parameterized by a set of weights parameters which are called

wj and they are of the following form. They are the sum of m simple functions phi of wj

x. m will be called the width of the two layer neural networks and this function phi it is

of the following form it is the composition of a affine transform of the input parameterized

by a weight a. A is typically called the input weight of the neural network in this representation

this corresponds to these parameters. Then we have a non-linearity which is just a non-negative

part and then we have the output which is can you see the bar at the bottom of my screen

it's a bit annoying. No no okay so that's so I see the last line I see is phi is too

homogeneous oh yeah perfect then that's fine. And then we have the output weights which

are which is here just a scalar so we have these simple functions they are of the form

of rich functions and we sum m such functions with a scaling factor one over m because we

will be interested in taking the limit when m is plus infinity. So w is the parameter

for one simple function it is it leads in order to the d plus two and here to fix ideas

I've written the model with the rectified in our unit but we could cover some other

types of functions phi in with the tools I will present but what is key is that this

function phi is too homogeneous in the parameters w so this means that when I multiply the parameters

by some scalar r positive the output will be multiplied by r squared because both the

inputs and the output weights will be multiplied by r. What's important in this talk is this

two homogeneity property which is specific to two-layer neural networks and also the

separability of the structure of the predictor h which is also specific to two-layer neural networks.

All right so now that we have defined the class of predictors that we consider the goal is to select

the correct weights wj to do the learning task to do the prediction and to do that we choose

convex loss so you can think of the logistic loss for classification tasks or the square loss for

regression tasks and then we define the empirical risk with potentially a weight decay regularization

which will be a function of all the parameters of the neural network which I call fm, m is the width

of the neural network and it's the sum of the empirical risk that is the sum of the losses of

over the training sets plus lambda a regularization factor times the sum of the square of all the

parameters so this is usually called weight decay in machine learning. So I see that someone has raised a hand.

Yeah yeah yeah Leneik the fact that you assume this degree two homogeneity excludes that

in your neural network you allow yourself the the traditional say drift or translation say coefficient.

Ah okay so here for the input layer we can have the translation coefficients it's written by

adding one as a last coordinate of the input vector so this corresponds to adding a so-called

bias in machine learning but then in the output layer I don't put a bias otherwise this property

will be lost so yes so but we don't get get anything in representation power by adding a bias in the

output layer so essentially we it is not a restrictive assumption for two layer neural networks.

But there is a bias in the relu in the sense that you can translate the relu right? Yeah it's possible

yeah yeah because here I've added this one at the last coordinate. Okay thank you.

So we have this empirical risk potentially with regularization I will consider both the regularized

and the unregularized case and it is notoriously difficult to minimize this function because it is

Teil einer Videoserie :

Zugänglich über

Offener Zugang

Dauer

00:43:27 Min

Aufnahmedatum

2021-10-22

Hochgeladen am

2021-10-22 12:46:16

Sprache

en-US

Einbetten
Wordpress FAU Plugin
iFrame
Teilen