3 - Knowledge Discovery in Databases [ID:52959]

50 von 316 angezeigt

Okay, because we have some things left in the preprocessing lecture, and if I remember

correctly, correct me if I'm wrong, we stopped at chapter data reduction last time.

So we discussed correlation last time and said we will go on there.

Is that correct?

Can someone give me some signal whether I'm correct?

Okay, seems like that.

Okay, one question, is the recording started again?

Okay, perfect.

Then we can go on.

It's a long switch between the guest lecture topics and this topic.

Of course, the guest lecture topic is more advanced than anything we will do in this lecture,

but it's a sneak peek into the real world industry needs, which are of course much more advanced

than anything a basic lecture can give you.

However, today we will talk about something they have done in the guest lecture, data reduction.

Data reduction is part of preprocessing because there is a lot of data.

In the guest lecture, it was a lot of different kinds of data.

In this topic, we will discuss uniform data.

Okay, now it's working.

First, what is data reduction?

Let's say we have a lot of data and we can only handle a certain amount of data.

Then we have to of course reduce data.

And in the real world, as it is with Linda and as it is with Predato, we have terabytes

of data, such as air purifying plant, air separation plant has a lot of data over time.

And of course, other use cases even have more data than that.

In that case, we can of course just throw out some tuples, just throw out columns we

don't need, just throw out things we know we do not need.

But what to do if we think we might need any of that data?

Then we have to do data reduction, some more advanced kinds of data reduction.

Today we will discuss waver transforms, principal component analysis, and yeah, add a bit of

selection, subset selection and add a bit of creation is basically what I've said already,

but we will come to that as well.

Let's first start with again some basic information.

Yeah, with too much data, we have a problem.

If there are too many tuples, we have a problem.

But the most common problem is that we have not too many tuples, too many entries in the

database, but we have too many entries of columns or too many columns, too many dimensions.

That's known as the course of dimensionality.

This can only be handled not by throwing out attributes, but by reducing correlation or

by reducing correlated columns into one.

We already know what correlation is, and now we want to reduce redundant information

within multiple attributes, because if we have a column giving you age and a column

giving you a birthday, that's somehow redundant.

Why do we need both of them?

Of course, that's a very simple example, but with all of these methods we will discuss

We can automatically find out columns and reduce them into one dimension or two dimensions

instead of many dimensions.

The problem with dimensionality or high dimensionality is also that some methods perform very poorly

if you have a lot of dimensions.

We need a certain fixed amount of dimensions to be easier to work with, but also to, for

example, have easier ways to visualize dimensions.

Teil einer Videoserie :

Knowledge Discovery in Databases

Presenters

M. Sc. Dominik Probst

Zugänglich über

Offener Zugang

Dauer

00:33:29 Min

Aufnahmedatum

2024-05-13

Hochgeladen am

2024-05-14 09:56:03

Sprache

en-US

Einbetten

Wordpress FAU Plugin

 https://www.fau.tv/clip/id/52959

iFrame

<iframe src="https://api.video.uni-erlangen.de/services/oembed/?url=https://www.fau.tv/clip/id/52959&format=iframe&maxwidth=1280&maxheight=720" width="1280" height="720"seamless allowfullscreen style="border: 0; padding: 0; margin: 0;overflow: hidden;"></iframe>

Herunterladen

Video

Audio

Per RSS abonnieren