Let's start with today's lecture.
We stopped in the last lecture somewhere in the last slide regarding A priori.
However, I've read in the forum that there have been some questions regarding every slide starting with slide 16.
So I will go back to that.
I will not go into detail on every slide back here, but I will try to round it up again.
Last time we discussed how A priori works in general.
A priori has some disadvantages because we have to recount something.
We have to count every time we have candidates and that can be a huge amount of candidates.
We can solve that problem to be able to count less candidates or to have less counting steps by building a hash tree.
That's a pretty simple solution where we put different candidate item sets.
Item 1, item 4 and item 5 build a candidate length 3 item set, which might become a frequent item set if it satisfies the threshold, the min support threshold.
We don't know the min support threshold here, but we know there is some min support threshold.
And now we go through that whole tree with a specific subset function to find where in our tree this specific candidate item set will be.
And we only have to count this trace.
The next slide, that slide was specifically mentioned in the forum, is the SQL statement regarding candidate generation with SQL.
I mentioned during the last lecture this slide is not relevant for the exam.
This slide will not be part of the exam.
This is just to show you we are able to do SQL statements.
We are able to implement a priori with SQL.
However, I wouldn't expect you to understand all details within that SQL statement and we just do not have time to go into detail in this SQL statement.
So I will skip it here again.
But it's not relevant for the exam to understand this statement or to be able to write it yourself.
It's just for you to look it up if you ever want to implement a priori with SQL.
Then we have some further improvements.
I think I've talked about this hash table in some detail in the last lecture as well.
However, there have been some misconceptions I've seen later in the questions after the lecture where people didn't understand how this hash table works.
Basically, let's say we have an original data set, transactional data set, a transactional database with one to three transactions.
Simple as that.
And we have the items A, B, C in the first transaction.
Items A and B in the second transaction and C and B in the third transaction.
We determined win support is two.
So every one item set will become frequent, of course, because A is present in two transactions, B is present in three transactions, and C is also present in two transactions.
And now we have the candidate item sets A, B, A, C, and B, C.
What we can do, of course, is to count each of them for their own.
Which, with a small example like that, is of course possible to count.
However, with hashing, we do something like putting many candidate item sets into one bucket.
Let's say we put them all into one bucket.
Of course, that wouldn't make any sense in the real world.
But let's say there's a D as well.
Let's do it this way.
Maybe it's easier to have an example like that.
These are in this bucket and this is in this bucket.
Now we do not count each candidate for themselves, but we count whether any of these transactions is present within a transaction.
So within transaction one, is there A, B, A, C, or B, C present?
Yes, at least one.
All of them are present within that.
So we simply increase the count by one.
Now we check for the same transaction one, whether there's at least one transaction, one item set of this combination, of these combinations present.
No, it isn't.
So we do not increase the count of that bucket.
Presenters
Zugänglich über
Offener Zugang
Dauer
01:27:48 Min
Aufnahmedatum
2024-06-03
Hochgeladen am
2024-06-04 09:46:04
Sprache
en-US