Thank you for the introduction. So what I'm going to present today is joint work with
Francis Bach and it will be based on several papers that we have written together which
will be listed in the end of this talk. And it deals with two-layer neural networks. So
let me first describe this class of model and where they are used. So we consider a
standard supervised learning class where we are given a couple of random variables x and
y where x are the inputs they live in R to the D and D is typically large in the applications
we have in mind and y is an output and typically a scalar or just plus one or minus one in
classification tasks. And the goal is given n iid samples from these couple of random
variables which we will call the training set is to build a function h which I will
call a predictor which tries to predict the correct output y on unobserved input x's.
And in this talk we will consider a specific class of predictors which are really huge.
So these are predictors which are parameterized by a set of weights parameters which are called
wj and they are of the following form. They are the sum of m simple functions phi of wj
x. m will be called the width of the two layer neural networks and this function phi it is
of the following form it is the composition of a affine transform of the input parameterized
by a weight a. A is typically called the input weight of the neural network in this representation
this corresponds to these parameters. Then we have a non-linearity which is just a non-negative
part and then we have the output which is can you see the bar at the bottom of my screen
it's a bit annoying. No no okay so that's so I see the last line I see is phi is too
homogeneous oh yeah perfect then that's fine. And then we have the output weights which
are which is here just a scalar so we have these simple functions they are of the form
of rich functions and we sum m such functions with a scaling factor one over m because we
will be interested in taking the limit when m is plus infinity. So w is the parameter
for one simple function it is it leads in order to the d plus two and here to fix ideas
I've written the model with the rectified in our unit but we could cover some other
types of functions phi in with the tools I will present but what is key is that this
function phi is too homogeneous in the parameters w so this means that when I multiply the parameters
by some scalar r positive the output will be multiplied by r squared because both the
inputs and the output weights will be multiplied by r. What's important in this talk is this
two homogeneity property which is specific to two-layer neural networks and also the
separability of the structure of the predictor h which is also specific to two-layer neural networks.
All right so now that we have defined the class of predictors that we consider the goal is to select
the correct weights wj to do the learning task to do the prediction and to do that we choose
convex loss so you can think of the logistic loss for classification tasks or the square loss for
regression tasks and then we define the empirical risk with potentially a weight decay regularization
which will be a function of all the parameters of the neural network which I call fm, m is the width
of the neural network and it's the sum of the empirical risk that is the sum of the losses of
over the training sets plus lambda a regularization factor times the sum of the square of all the
parameters so this is usually called weight decay in machine learning. So I see that someone has raised a hand.
Yeah yeah yeah Leneik the fact that you assume this degree two homogeneity excludes that
in your neural network you allow yourself the the traditional say drift or translation say coefficient.
Ah okay so here for the input layer we can have the translation coefficients it's written by
adding one as a last coordinate of the input vector so this corresponds to adding a so-called
bias in machine learning but then in the output layer I don't put a bias otherwise this property
will be lost so yes so but we don't get get anything in representation power by adding a bias in the
output layer so essentially we it is not a restrictive assumption for two layer neural networks.
But there is a bias in the relu in the sense that you can translate the relu right? Yeah it's possible
yeah yeah because here I've added this one at the last coordinate. Okay thank you.
So we have this empirical risk potentially with regularization I will consider both the regularized
and the unregularized case and it is notoriously difficult to minimize this function because it is
Zugänglich über
Offener Zugang
Dauer
00:43:27 Min
Aufnahmedatum
2021-10-22
Hochgeladen am
2021-10-22 12:46:16
Sprache
en-US