Welcome to this first lecture in the course Mathematical Data Science. This lecture is
going to be about clustering methods. In particular, I will speak about
the k-means algorithm and Gaussian mixture models.
Okay, so first let us speak a little bit about what is clustering in the first place. And so,
for example, in this situation here, I show you a picture of apples and oranges. The question is,
is there any structure in the data? And of course, for the human eye, it is kind of obvious that here
you see apples and oranges. You might say this data set can in fact be split into two different
clusters. One containing all the apples and the second one containing the oranges.
All right. So you would like to basically sub-by the data set into these two sub-data sets. And
this is basically what clustering does. A cluster is a group of objects such that within the cluster,
all objects resemble each other, but objects from two different clusters, they should look
considerably different. In data science, clustering is very important because it allows to
reduce a huge data set into a very small number of entities, namely the clusters, which are still
representative for the whole data set. Okay. So of course, since in data science,
we're not always dealing with images. Most of the times we're in fact dealing with a huge collection
of numbers, which represents our data. And in this fruit example, these numbers could be measurements
of the height and the width of different fruits. And then if you put this information into a diagram,
you get something like this point cloud. And visually it's already clear that there are
multiple categories of data, which you have measured here. It might be one group up here,
which is very large. It has a large height and a large width. And there's maybe a big group here.
And then there are three other groups like this. So it could be that in fact, this data set
can be clustered into four sub-clusters. And this is basically the question of all clustering
algorithms. How can we sort the given data into categories or clusters? And this is what's called
the clustering problem. And one should remark that something which always helps is getting
more information about the data. So here, for instance, I plotted an additional measurement
of the fruits, namely their mass, which is plotted in this vertical axis here. And this
makes the clustering a lot easier since you might be very sure that these three data points up here
should belong to their own cluster because they have a very high mass compared to all the other
data points down there. So for instance, these three data points might correspond to melons,
let's say, whereas all the other data points correspond to apples or limes, which have a much
lower mass. So introducing more information makes clustering a lot easier. But this is, of course,
somehow to be expected. And a final remark here is that clustering the raw data by hand
is very cumbersome. So all the information about the different fruits is basically stored in such
a table where you have 13 different fruits. Each one comes with a mass, a width, a height, and a color.
And extracting the clustering from such a collection of numbers is almost impossible for
human beings. So you definitely need an algorithm which does this in an automated fashion.
And let me now formulate a wish list for a good clustering algorithm. And first, let's say what we
are given typically. First, we are given a number of data points, capital N, which can be quite large.
So it represents the size of your data set. And in this fruit example, you might, for instance,
have measurements of thousands of fruits, which you would then like to cluster into, let's say,
four different clusters. And for this, you use a number of variables which you measure. And this
number of variables is denoted by capital M. And it can be the width and the height, as we've seen.
It can be the mass, but also other variables like price or color. And then you're given a data set,
capital X, which consists of capital N data points, x1 to xm, where each data point comes with capital
M different variables. And in addition, you also have to specify a number of capital K of assumed
clusters. Of course, one could argue that a good clustering algorithm should be able to estimate
the number of clusters from the data. And this is impossible. There are algorithms which do that.
However, in this introductory lecture, we will always assume K to be fixed and selected by the
user. And this makes actually a lot of sense, because if you don't specify the number of clusters,
then there's basically always a trivial clustering algorithm which puts every single data point into
Presenters
Leon Bungert
Zugänglich über
Offener Zugang
Dauer
00:39:04 Min
Aufnahmedatum
2021-04-10
Hochgeladen am
2021-04-10 19:16:21
Sprache
en-US