Yesterday, we completed this section of the course where we talked about learning from
examples via essentially linear regression, which is essentially again weight fitting
in a hypothesis space given by a very simple model.
We had talked about straight out linear regression and classification. We talked about how to
use those in bio-inspired computation in neural networks, which is just basically having networks
of linear classifiers, which again you did weight fitting for.
The last thing we looked at was support vector machines. Essentially, we're doing linear
regression or classification again. Only this time, we add two more tricks essentially,
which makes this extremely useful and once it's well implemented, you can basically use
support vector machines out of the box. There are lots of packages around that give you
this.
What are the basic ideas? The basic idea is that we want to have instead of just any linear
classifier, we want to have a linear classifier that keeps maximum distance to all the examples
and the hope here is that this generalizes better than a randomly chosen one. We do every
linear regression. We don't know which one of these we'll actually get. What we do instead,
instead of doing straight out error minimization over the space of hypothesis, we do error
and distance to error minimization. In a way, instead of having a thin classifier, we're
optimizing also the thickness of our classifier that already works. That's the first idea.
Get better generalization properties by keeping our distance and the way this will work out
is that miraculously, the methods will basically only take the support vectors, these examples
that are closest into account for classification, for the weight figure. That's the one idea.
You can do this by just the old minimization trick by adding a new breadth parameter, which
you also subject to minimization. You can do also something else, which is what's actually
done, as you can use quadratic programming methods, which are more efficient in practice.
Also allow, and that's the important part of this, the so-called kernel trick, which
is not as easily done in the minimization gradient descent method. The idea here is
that if you have non-linearly separable sets, you can sometimes transform them into higher
dimensions to make them linearly separable. The example that everybody uses, because it's
between two and three dimensions and not between five and 2,000 dimensions, which is actually
what happens in practice, is this one where you have a circle-shaped separator, which
is not linear, of course, but you can actually transform it into this cone-shaped distribution,
where as you know, if you have a cone, then a circle is just one way of cutting the cone,
if the support vector of the cutting plane is collinear with the axis of the cone, and
then you get a linear separator here.
Now, in principle, that's something you can always do, but the necessary transformations,
in this example, this one here, kind of play badly with your calculation. The advantage
of these quadratic programming approach is that actually you see that the data, the x
part here, only enters in the form of a cross product, which is the stuff you actually feed
into the kernel function, and very often you can actually compute your kernel of the cross
product without ever really computing the kernel function itself. The kernel disappears
into the woodwork, and of course that makes this approach very attractive and computationally
efficient.
And that's what SVM packages use. So they give you a standard set of kernels, and then
you can kind of project up, and that often gives you a separable feature space. And you
can imagine alone with these kind of ideas, with these cone-shaped embeddings, you get
all kinds of circles, ellipses, parabolas, and so on, dividing lines as cone sections
at any point. So if you adjust the weights here, instead of just having linear models,
you get essentially cone section models just by this little trick. And many of many point
sets become separable or almost separable if you allow yourself all kinds of cone sections.
And if you go up in the dimensionality, you get basically polynomial-shaped sections,
Presenters
Zugänglich über
Offener Zugang
Dauer
00:12:16 Min
Aufnahmedatum
2021-03-30
Hochgeladen am
2021-03-31 11:57:37
Sprache
en-US
Recap: Support Vector Machines
Main video on the topic in chapter 8 clip 18.