Perfect, fine, okay. Well, thank you. Thank you, Leon and Danielle for inviting me to give this talk in your research group.
So I will be talking briefly about some interactions, some recent work, but also generally some interactions between the fields of deep learning and control theory, something very vague, but just to set the idea straight.
So I will mainly talk about supervised learning, you know, this is a subset of machine learning and supervised learning consists in essentially approximating a function which you don't know but you have samples, input output samples of this function.
So, the inputs live in D dimensional space and the outputs which are called labels can live in different kinds of spaces and generally we distinguish two types of labels, yi two types of outputs, and two types of supervised learning tasks depending on these outputs
which are the tasks of classification, where the labels are discrete values. So we have M discrete values for example which we can identify as set from one to M, or we can deal with regression tasks where this is perhaps more standard in statistics where we are
able to approximate a function which goes from RD to RM and takes continuous values and in terms of applications. So classification, if you imagine that you only have two classes and you want to say yes or no, so positive or negative for example for
say spam filters or cancer tests I give myself, I give my machine an image which is in, say, R9216 which is a picture which has 96 times 96 pixels. And I want my machine to be able to predict whether the picture represents breast cancer which is going to say plus or not breast
cancer which is going to say minus so this is a typical task in classification, whereas in regression tasks.
We are rather interested in approximating a function which take continuous values so I think, for me the most natural application of this is this data driven PD analysis and data driven analysis of different kinds of models where we don't know the solution
of the system but we know some snapshots of it so we know for example, points in the space and time points, and we know the values of the solution at these different points and we want to approximate the entire, the entire solution from these points and this
is what you also see in the graph. So we distinguish two tasks. In reality, of course when you set up the problem it's really not the two tasks are not too different but in the modeling part.
You will see that there are going to be some differences in in in the tasks.
So this is generally supervised learning. And, well, how do we solve these tasks.
Well, it really depends on on on the data we have in question if we have some very simple data so let us assume we have a super well we have a classification tasks task to solve.
We have, say white and black points in two dimensions and what we want to do is classify them so we want given these two clouds of points to be able to construct an approximation of the unknown function which generates this points, so that given a new
point in the plan of the plane, we are able to say whether this new point is black or white. And for binary classification tasks such as this where the data we can see is separated by by a line by hyper plane.
We can imagine a very simple strategy where, in fact, we propose this kind of model which basically just take some parameter W which we're going to optimize and then we take the sign of W times the, the input to give us the output,
because the outputs are going to be plus or minus one so I'm just going to take the sign function and we're going to try to fit this parameter so that I can find.
I can find the new outputs, and I can also do some similar, similar things for regression tasks where I said of having why I is either minus one or one I can have many different kinds of continuous values, and instead of just having the sign I will just
have the identity function and this is the simplest model in statistics linear regression and least squares fitting so this is really the basic or basic statistic model.
However, what is the issue with these two proposals is that, of course, the data is not going to be as simple as this in practice so we're going to have to envisage and use more complicated models which are going to be able to cover this case but also be able to
generalize to more complex data sets.
So I immediately skipped to neural networks. Of course there are many other models which are used in statistics I will not be able to name all of them but I'm going to concentrate on neural networks and in this talk and neural networks.
There are many ways, equivalent ways you can define them of course in the computer science literature there's this tendency to define them via the graphical representation, I much prefer writing them down as a, as a discrete dynamical system, where we
can put data as an initial value of this dynamical system so we're going to solve the same dynamical system we're going to find a single pair of parameters WK and BK for all the data.
So, this is the important point so we're going to plug in every single data point from our data set, and we're going to basically transform it by means of a sequence of parameterized affine maps, which are then composed with fixed simple non linearities
and the nonlinearities, we're going to assume throughout this work are generally going to be global ellipses continuous and going to be zero at the origin.
So these are the two typical examples hyperbolic tangent and anelope but there are many other examples, and so this is basically what is a neural network so our neural network has
five, it has these parameters WK BK which we're going to look to optimize, but then it has also a bunch of hyper parameters which are fixed up the only, such as the number of layers and layers such as the depth, which is called the depth, but also the width so
we can see that in this formulation at each a new layer the dimension of the unknown state is going to change so it's various and this is also a hyper parameter which you're supposed to
estimate a purely, and these guys have basic definition of a neural network in machine learning jargon is commonly called a fully connected feed forward neural network or multi layer perceptron I think multi layer perceptron is perhaps more popular and more in line with
historical backgrounds.
And so why use neural networks so does it really make any. Is it reasonable, does it make sense, why would you use this.
So, first, say, mathematical justification of why neural networks are perhaps a reasonable choice of solving statistical so solving supervised learning tasks is the fact that they are universal approximators in the sense that I can prove that the sets which
are generated by different neural network architectures are dense subsets of certain spaces of functions and the typical and the most canonical theorem the most canonical result in this direction is the Sabinco universal approximation theorem I believe this is the first result
in this direction, which considers the simplest possible on your own network you can have so we have only one single kill layer, which has a variable depth.
And you consider a sigmoid function which is not polynomial and which is going to be well in his paper he considers a very specific class which is called sigmoidal functions but what matters is, is a different property for his proof.
And basically what he proves is that this set generated by these kinds of functions which are, as you can see again as previous, similarly as the previous slide, we only have one layer this is just a translation and rotation of, and then composition
of this shrinking function sigma so this is a nonlinear composition.
And this, this space is dense in the space of continuous functions on any compact sets I put minus one one to our D here but this is true for any compact set.
And how did the Sabinco justify this how did you prove this this is essentially a banner, the hand banner arguments by contribute by contradiction by hand banner and then you find by the representation theorem that there is a necessary, there is a sufficient
nonlinear to sigma which ensures you that this density result is true the proof is, I mean the paper itself is very short and I recommend anyone who is interested in these kinds of directions to read it.
And of course this result I mean it's, it's really the most basic result and it has been extended since to multi layer network so I can fix the number of layers but I can still vary the width so with epsilon if I want to have the epsilon close then I need
something like one of our epsilon number of neurons in some layer. And then people have also considered the dual direction which is to rather fix the width of the neural network and then increase the depth with epsilon.
And then there is last week I found out of this very shocking results which I was still very surprised about and I haven't completely understood the proof but this paper by Majorov and Pincus, 99 where basically they prove that.
So, okay these first two directions I was talking about was either so you want to approximate to epsilon vicinity a continuous function so basically we have to be able to increase something in the neural network and in Sibenko's case you're increasing the
the width, so you're adding more neurons in one layer, whereas in these papers by Terry Lyons and all the papers he's citing they're basically increasing the depth so you fix the width say equal to D and then you increase the depth and you can do this for a wide variety of architectures
but there is this very counterintuitive paper by Majorov and Pincus where basically what they say is that you fix both the depth and the width so you fix two layers to hidden layers so not one like Sibenko you have two hidden layers.
And then he says, I can take exactly a specific amount depending on the dimension of the function I'm trying to approximate, I can take a specific amount of neurons so a specific amount of width in each of these two layers, which is going to allow me to approximate
a continuous functional compact set epsilon close of course you're still paying something here because now, even though the architecture is fixed the norm of the parameters.
Zugänglich über
Offener Zugang
Dauer
01:04:25 Min
Aufnahmedatum
2020-12-09
Hochgeladen am
2020-12-10 18:40:13
Sprache
en-US
Borjan Geshkovski (UAM) on "The interplay of deep learning and control theory"