1 - Clustering with K-means and Gaussian Mixtures [ID:30720]
50 von 333 angezeigt

Welcome to this first lecture in the course Mathematical Data Science. This lecture is

going to be about clustering methods. In particular, I will speak about

the k-means algorithm and Gaussian mixture models.

Okay, so first let us speak a little bit about what is clustering in the first place. And so,

for example, in this situation here, I show you a picture of apples and oranges. The question is,

is there any structure in the data? And of course, for the human eye, it is kind of obvious that here

you see apples and oranges. You might say this data set can in fact be split into two different

clusters. One containing all the apples and the second one containing the oranges.

All right. So you would like to basically sub-by the data set into these two sub-data sets. And

this is basically what clustering does. A cluster is a group of objects such that within the cluster,

all objects resemble each other, but objects from two different clusters, they should look

considerably different. In data science, clustering is very important because it allows to

reduce a huge data set into a very small number of entities, namely the clusters, which are still

representative for the whole data set. Okay. So of course, since in data science,

we're not always dealing with images. Most of the times we're in fact dealing with a huge collection

of numbers, which represents our data. And in this fruit example, these numbers could be measurements

of the height and the width of different fruits. And then if you put this information into a diagram,

you get something like this point cloud. And visually it's already clear that there are

multiple categories of data, which you have measured here. It might be one group up here,

which is very large. It has a large height and a large width. And there's maybe a big group here.

And then there are three other groups like this. So it could be that in fact, this data set

can be clustered into four sub-clusters. And this is basically the question of all clustering

algorithms. How can we sort the given data into categories or clusters? And this is what's called

the clustering problem. And one should remark that something which always helps is getting

more information about the data. So here, for instance, I plotted an additional measurement

of the fruits, namely their mass, which is plotted in this vertical axis here. And this

makes the clustering a lot easier since you might be very sure that these three data points up here

should belong to their own cluster because they have a very high mass compared to all the other

data points down there. So for instance, these three data points might correspond to melons,

let's say, whereas all the other data points correspond to apples or limes, which have a much

lower mass. So introducing more information makes clustering a lot easier. But this is, of course,

somehow to be expected. And a final remark here is that clustering the raw data by hand

is very cumbersome. So all the information about the different fruits is basically stored in such

a table where you have 13 different fruits. Each one comes with a mass, a width, a height, and a color.

And extracting the clustering from such a collection of numbers is almost impossible for

human beings. So you definitely need an algorithm which does this in an automated fashion.

And let me now formulate a wish list for a good clustering algorithm. And first, let's say what we

are given typically. First, we are given a number of data points, capital N, which can be quite large.

So it represents the size of your data set. And in this fruit example, you might, for instance,

have measurements of thousands of fruits, which you would then like to cluster into, let's say,

four different clusters. And for this, you use a number of variables which you measure. And this

number of variables is denoted by capital M. And it can be the width and the height, as we've seen.

It can be the mass, but also other variables like price or color. And then you're given a data set,

capital X, which consists of capital N data points, x1 to xm, where each data point comes with capital

M different variables. And in addition, you also have to specify a number of capital K of assumed

clusters. Of course, one could argue that a good clustering algorithm should be able to estimate

the number of clusters from the data. And this is impossible. There are algorithms which do that.

However, in this introductory lecture, we will always assume K to be fixed and selected by the

user. And this makes actually a lot of sense, because if you don't specify the number of clusters,

then there's basically always a trivial clustering algorithm which puts every single data point into

Teil einer Videoserie :

Presenters

Leon Bungert Leon Bungert

Zugänglich über

Offener Zugang

Dauer

00:39:04 Min

Aufnahmedatum

2021-04-10

Hochgeladen am

2021-04-10 19:16:21

Sprache

en-US

Einbetten
Wordpress FAU Plugin
iFrame
Teilen