6 - Machine Learning for Physicists [ID:52680]
50 von 1220 angezeigt

Florian Mankov told me that you already started last time with TNSE, this topic of visualizing

high dimensional data, but maybe I will just quickly go through it again.

But maybe it could be also a chance for you if you have questions to make them.

And then I will show you some nice visualization using PNSE.

Okay, so again, so the basic, as I mentioned already, the basic idea is just to auto-visualize

high dimensional data and I mean, ideally in two dimensions because our screens are two dimensional.

And so this was already mentioned, so there is a naive way to do it, which is to use principal component analysis.

And the idea is that you have this high dimensional data and you project them on the two dimensions

where they have larger variation basically.

And this is a linear type of mapping and I mean, it works quite well if you apply it, not for visualization,

but just to compress the data and use many dimensions, but using just two dimensions often is very rough.

And so it's very difficult to, for instance, identify clusters or data that are seen in some semantic way,

for instance, pictures that show cats and pictures that show dogs.

Or I think in this example, this is a projection of the Amnesty data set and so the colors are the different number.

And you see they are all scattered.

If you wouldn't have the colors, you wouldn't be able to understand which number is which.

Whereas if you use some type of nonlinear mapping, you can get a very nice cluster.

And so even without having in advance this information about the semantic meaning,

you can already, I guess, or maybe this island that here is yellow has some special meaning.

Okay, and then there are in principle many ways.

Okay, I mean the general problem that you want to solve is how to project nonlinearly from a large space to a small space.

And as you know from the maps, it's difficult even if you want to go from 2D on a sphere to 2D on a plane.

So it's not possible to conserve all the distances and that's why the normal is completely blow up.

It's the same point and so it's basically projected onto a line.

So there is this problem, so that's why you have to do some kind of compromise

and there's always some kind of the dock that you want to use machine learning to do it.

And the most general idea would be, okay, I just want to define a cost function,

which is the function of the distance in the high dimensional space and the distance of the same two points in the small dimensional space.

And then I can interpret and then I want to minimize this cost function

and I can interpret the gradient of this cost function as a force.

And then I want to have a force that if the two points are more distant in the projected space than in the rear space,

it would be an attractive force and vice versa.

And okay, this would be some naive approach, but there is a very smart approach

that is so close to casting neighbor embedding, which basically works in the following way.

Basically the distances between the points are encoded in some probabilities

and I have two different probabilities, one for the point in the large dimensional space

and one for the point in the small dimensional space.

And once I have this encoding, my cost function just requires that these two probability functions are matched as much as possible.

And then I will use the standard cost function that I always use when I want two probability functions to match,

which is the so-called Kohlberg-Lagler divergence, which is somehow related to the entropy.

So it's like the entropy of a probability minus the cross entropy, if you wish.

And it's called a divergence because it's not a distance in a matrix space,

but still has some properties, so it's always positive and it's 0 if the two probability distribution match.

And then so this leaves only open the question, how do I encode the distances into probabilities?

And the solution from Hinton and Matten is the following.

So basically for the large space I use this kind of exponential function,

which penalizes very much the points, so it goes very quickly to 0 for points that are far away.

Whereas for the low dimensional space I use a so-called Cauchy distribution,

or it's called also student distribution, and this is where the T in TSME comes from.

And here the idea is that here the probability decays less faster,

Teil einer Videoserie :

Zugänglich über

Offener Zugang

Dauer

01:00:01 Min

Aufnahmedatum

2024-06-13

Hochgeladen am

2024-06-14 08:19:03

Sprache

en-US

Einbetten
Wordpress FAU Plugin
iFrame
Teilen