Okay, because we have some things left in the preprocessing lecture, and if I remember
correctly, correct me if I'm wrong, we stopped at chapter data reduction last time.
So we discussed correlation last time and said we will go on there.
Is that correct?
Can someone give me some signal whether I'm correct?
Okay, seems like that.
Okay, one question, is the recording started again?
Okay, perfect.
Then we can go on.
It's a long switch between the guest lecture topics and this topic.
Of course, the guest lecture topic is more advanced than anything we will do in this lecture,
but it's a sneak peek into the real world industry needs, which are of course much more advanced
than anything a basic lecture can give you.
However, today we will talk about something they have done in the guest lecture, data reduction.
Data reduction is part of preprocessing because there is a lot of data.
In the guest lecture, it was a lot of different kinds of data.
In this topic, we will discuss uniform data.
Okay, now it's working.
First, what is data reduction?
Let's say we have a lot of data and we can only handle a certain amount of data.
Then we have to of course reduce data.
And in the real world, as it is with Linda and as it is with Predato, we have terabytes
of data, such as air purifying plant, air separation plant has a lot of data over time.
And of course, other use cases even have more data than that.
In that case, we can of course just throw out some tuples, just throw out columns we
don't need, just throw out things we know we do not need.
But what to do if we think we might need any of that data?
Then we have to do data reduction, some more advanced kinds of data reduction.
Today we will discuss waver transforms, principal component analysis, and yeah, add a bit of
selection, subset selection and add a bit of creation is basically what I've said already,
but we will come to that as well.
Let's first start with again some basic information.
Yeah, with too much data, we have a problem.
If there are too many tuples, we have a problem.
But the most common problem is that we have not too many tuples, too many entries in the
database, but we have too many entries of columns or too many columns, too many dimensions.
That's known as the course of dimensionality.
This can only be handled not by throwing out attributes, but by reducing correlation or
by reducing correlated columns into one.
We already know what correlation is, and now we want to reduce redundant information
within multiple attributes, because if we have a column giving you age and a column
giving you a birthday, that's somehow redundant.
Why do we need both of them?
Of course, that's a very simple example, but with all of these methods we will discuss
We can automatically find out columns and reduce them into one dimension or two dimensions
instead of many dimensions.
The problem with dimensionality or high dimensionality is also that some methods perform very poorly
if you have a lot of dimensions.
We need a certain fixed amount of dimensions to be easier to work with, but also to, for
example, have easier ways to visualize dimensions.
Presenters
Zugänglich über
Offener Zugang
Dauer
00:33:29 Min
Aufnahmedatum
2024-05-13
Hochgeladen am
2024-05-14 09:56:03
Sprache
en-US