Okay.
Welcome to KDD once again.
I'm sorry for last week.
I was too sick to give the lecture.
Sorry for that.
I hope everybody read my mail before coming here and not after coming here.
So everybody was at home at the time and not in this lecture hall.
Regarding the lost time, we decided to not reduce the content of the lecture but add
an additional date for an extra lecture.
However, this date is not set yet and we will make sure that there is recording in that
lecture so everybody that isn't able to be in person, be here in person for that lecture
will have a recording to look at.
However, probably in three weeks, four weeks time, not this week, not next week, but in
some weeks time.
A positive note is that I changed my laser pointer so now everybody should be able to
see my pointer.
I think that's good as well.
We will restart at a point we stopped in two weeks ago.
So we will start at the slide measuring data similarity and dissimilarity of a data lecture.
And we will today talk about the last slides of the data lecture and then we'll go on with
the pre-processing lecture.
Okay, are there any questions without any regard to today's slides?
Okay, then let's start.
Okay, we already talked about what data is in general.
Now we will have to talk about what data is compared to our data sets.
And why do we need that?
Because we need that similarity, dissimilarity, specific applications, specific methods like
classification, like clustering, like outlier analysis.
We will need specific measurement to classify whether two data sets or two data tuples are
similar to each other or not similar to each other.
And ideally we do not want a binary scale, so it's identical and it's not identical,
but we want to have a continuous measurement between the similarity of two slides.
We will talk about each of these methods in later lectures, so classification will be
part of the lecture seven, clustering will be part of the lecture eight, and outlier will
be part of lecture nine.
We will talk about those things later on in the semester, therefore I will skip the definition
of clusters as well because we will talk about that again in lecture eight.
Now to similarity and dissimilarity.
Of course everybody has an opinion what similarity and dissimilarity is.
For example, are you similar to your seat neighbor?
Hard to measure.
We need a specific definition of similarity to measure it.
Similarity is defined as how alike two data objects are.
In most cases we will choose a value between one and zero and one, including zero and one.
But in some cases you will have other intervals as well.
The higher the value is, the more alike, the more similar a value is, typically.
Dissimilarity on the other side is often a synonym of distance.
How far off are two data objects of each other?
So for example, if we have two points in a coordinate system, we have two dimensions.
One dimension is H and one dimension is celery.
Presenters
Zugänglich über
Offener Zugang
Dauer
01:32:06 Min
Aufnahmedatum
2024-05-06
Hochgeladen am
2024-05-07 11:46:05
Sprache
en-US