6 - Francis Bach: On the convergence of gradient descent for wide two-layer neural networks [ID:16039]
50 von 446 angezeigt

Okay, so first I would like to thank the organizers for inviting me.

It's a pleasure to give a talk from my bedroom.

So before I start, so I've given instructions not to be disturbed, but as all attendees

with children will know, sometimes it does not work. But this being said, I hope to be

left alone for one hour. And before I start, also I would like to acknowledge the amazing work

of my colleague, Lena Ikh, Shiza, who's behind most of the ideas I'm going to present today.

So today I'm going to consider supervised learning, like a parametric framework,

where I'm given some observations, XI, YI. XI can be an image, YI can be a label on that image,

and the goal is to predict Y given X for a new data point. So this is typically done through

a prediction function, H, which has two inputs, the data X and some parameter theta, which will

be an RD in this. So within machine learning, in the last like 20 years, people have considered

many two types of prediction functions. The first one is basically linear in the parameter theta,

but it can of course be non-linear in the inputs X through the use of the feature vector field X.

And the classical example is advertising, where field X is a pretty large

vector of zeros and ones, like encoding your navigation history, but there are many applications

of linear models in speech, biometrics, and so on. So this will make our life easier, but in many

problems, people want to go beyond linear models, and this is where neural networks come in, and a

classical application is in a computer vision, where the input is an image and Y is the output.

And for those problems, linear models are not state of the art anymore, and people now use

a sequence of linear plus non-linear activation function models, which are based on that,

and those are neural networks where you start from a linear model, theta 1 transpose X. So you combine

your inputs linearly, take some non-linear activation function sigma, and then go on for a while,

and you get R layers, and this is a so-called deep neural network when R is R is big.

As you see an example with only two hidden layers. So keep in mind that if you just consider the last

layer, theta R, that this is a linear model, but this becomes a non-linear model when you start to

include the first layers in the process. So the way learning is done is by formulating this as the

optimization problem, where you want to minimize a regularized version of some empirical risk,

where you have some data fitting term, you have a loss function between

Y i, the label, or the output, and your prediction, and you can think of L as being the logistic loss,

like we will see today, or the square loss. So essentially a convex loss in H. So this is

typically seen as a weak assumption, like most losses for learning are convex, and you add some

regularizer to avoid above it. So this is the way machine learning is classically cast as an

optimization problem, but of course, we come back to that later in the talk. This is just a meme to

an end. So the real goal is to do well on unseen data. So I will often call this test error the

expected risk, which is unobserved, because you don't observe the distribution of test points,

and this will be called the empirical risk. So this is a classical issue that we want to minimize

the expected risk, where we only have access to the empirical risk. So this is an optimization

problem. When is it convex? So sufficient conditions for convexity are to have a convex loss,

and again, this is a weak assumption, but typically you need to have some linear predictions.

Okay, so this will not work for neural networks. So what are the positive consequences? There are

two of them. First, as soon as you go convex, you get a long sequence of efficient algorithms,

which are typically gradient-based. So of course, they can also be defined for non-conduct functions,

and in fact, they will use gradient descent in the non-convex framework. But once you're convex,

they go with quantitative analysis, both in terms of runtime, the time it takes to optimize,

and also in terms of prediction performance on unseen data. It is a long sequence of words,

you can do acceleration, you can do sampling, and so on. So this is like one part of the,

what I call the golden years of convexity in machine learning, which I start at 1995 with SVM,

and I ended up this year because I had to choose a year. And there are a lot of examples of those

golden years, so they started with SVMs and kernel methods, but all the sparsity and low rank was

clearly an important aspect of convexity. Optimal transport is a fashionable one at the moment.

Zugänglich über

Offener Zugang

Dauer

00:49:31 Min

Aufnahmedatum

2020-05-18

Hochgeladen am

2020-05-19 12:16:24

Sprache

en-US

Many supervised learning methods are naturally cast as optimization problems. For prediction models which are linear in their parameters, this often leads to convex problems for which many guarantees exist. Models which are non-linear in their parameters such as neural networks lead to non-convex optimization problems for which guarantees are harder to obtain. In this talk, I will consider two-layer neural networks with homogeneous activation functions where the number of hidden neurons tends to infinity, and show how qualitative convergence guarantees may be derived. I will also highlight open problems related to the quantitative behavior of gradient descent for such models.

Einbetten
Wordpress FAU Plugin
iFrame
Teilen