Right, so you were talking with Dennis about neural networks.
All right, and we're talking only about the simplest of networks, namely feed forward
networks. They are essentially directed a cyclic graph. In comparison to human brains,
that's not an accurate model, but the math is kind of simple. These cyclic networks,
we call recurrent networks, and we need, and probably know this from a basic class of
computer architecture. You can only have state in a network of gates. If you have cycles,
and that gives you flip flops and memory and all of those kind of things, that's not
what we're looking at. The surprising thing is that even in this very, very simple feed
forward network, we can do stuff. The simplest thing you looked at was perceptron networks.
Networks that are just one layer, and if you think about it, every little neuron is actually
a binary classifier, or has in it this binary classifier, possibly softened by a logistic
function. If we have a perceptron network, that's essentially how it looks. An input layer,
an output layer, a couple of weights. Between any, and if we have a full network, between
any input and output neuron, we have a weight. The weight, basically, the weight matrix
really tells us about what the function is. In general, the function is something like
a multi-dimensional threshold function, a cliff. That's what a perceptron can be. Essentially
a multi-dimensional linear classifier smoothed over by a sigmoid. You look at that, the
mass of this is very easy. If you have a bigger network, with hidden layers in between,
things get more interesting. We can do more functions, as you've seen. We get essentially
bigger polynomials, more interesting functions that means. The idea is basically, in neural
networks, we can adjust the functions they correspond to. We have some input activation
which gets pushed or fed forward through the network. The weights actually change the
things that happen in the network so you get an output pattern. You can imagine the more
weights we have, the more interesting functions become. The kind of converse to this is that
single-layer perceptrons are relatively boring things. They are a relatively small set
of functions you can build with single-layer perceptrons. In particular, there are boolean
functions like the XO function, which, where you can already see that the data is actually
not linearly separable, so a single-layer perceptron cannot learn that function. If the boolean
function you want to represent is XO, then a decision tree we can find for XO, but a single
layer perceptron can't do the trick. Which is why? Which is why you almost never hear
about them anymore. We have networks that look more like this, that have multi-layers.
Somebody, and I must say I envy this person for the good idea, had the wonderful idea to
say, these things here are much better than these kind of things. We call those deep learning.
Isn't that a wonderful word? It makes you instantly famous because you're doing deep stuff.
It means nothing, but we have more than two layers, more than the input and the output
layers. There's nothing more to it. But it sounds great, doesn't it? Okay, so we have
to do deep learning. It just means we have to understand this math, rather than just
that math. Okay, so as a kind of an exercise you looked at, perceptron learning and the
idea is that since we're in supervised learning, we can actually have examples, and the examples
say an input pattern corresponds to an output pattern. So what easier than to say, oh, if
we feed forward the input pattern, we're getting an output, and then we'll just measure
how wrong the output is. We call that the loss, and then we just minimize that loss. Very
simple. There's a couple of questions left over. How do we measure the loss? And empirically,
the square error is a good measure, mostly because it's convex. The square is a convex
function, which means it has unique global maxima. Wonderful. Okay, that's essentially
the reason why we like the square error measurement. Okay, and so we basically do optimization
by gradient descent, since everything, since the math is so linear, essentially, we can
do the math very easily. We have differentiable functions. All of that is very nice, so we
can just do what you did in high school, namely realize that at the minimum, we have to
have zero in the partial derivatives. That's what we're looking for by gradient descent.
Presenters
Zugänglich über
Offener Zugang
Dauer
01:22:36 Min
Aufnahmedatum
2023-06-13
Hochgeladen am
2023-06-14 16:29:05
Sprache
en-US