Okay, welcome to the next video nugget.
This one is about information retrieval, a relatively simple application of language
technologies and natural language processing.
So what is information retrieval? That's the organization representation and storage of
information objects in general, so that you give the user easy access to the relevant
information and satisfy the user's various information needs.
That's a relatively general definition and you usually come in contact with information
research in the form of web search, think Google or Bing or those kind of things, Duck
Duck Go.
And there the information that is organized and represented and stored is the information
of the World Wide Web or at least the accessible portion of that.
And the information need is more interesting, right?
You come to a web server engine with lots of information needs.
So for instance, I want to find out what the weather is tomorrow or I want to find out
whether it is really true that Obama died yesterday or those kind of things.
That is different.
This information need is different from what you usually get offered by a web search engine,
namely a little box in which you can fill in words.
So we think about information retrieval as something that retrieves information driven
by information.
So web search is really, as you know, a fully automatic process that responds to a user
query by returning a sort of document list, the famous 10 blue links that is relevant
to the user requirements expressed in the query.
And the query you can think of as an answer to the information need, but it's not the
same.
So typically what we do, and I'm going to show you now what's called the vector space
model for information retrieval, is that we basically think of documents and the queries
as essentially bags of words.
A bag is a multi-set.
So those can be, if you have a fixed vocabulary, then you can represent those as what we call
word frequency vectors in the real numbers.
The length of the vector is actually the size of the vocabulary.
All we're going to use is the natural numbers vectors, but you can sometimes you want to
normalize them.
Or it's convenient to immediately think about real numbers.
So if we have two documents as an example, like have a good day and have a great day,
then of course the joint vocabulary is have a good, great day.
And then you have the two word, then word, then good has the word frequency vector that
takes, has a one in the third position and great as a one in the fourth condition.
And this first document is, has all the words there exactly once except for great.
So we're, we have a one zero here in the great position.
The idea that these vector space models actually pursue is that for web search, you can represent
the documents as world word frequency vectors.
And then you return those documents that are in a way similar as word vectors to the, to
the word to the query.
Okay.
So in, in a picture is essentially if you have here where to be able to see it, we'll
represent two vectors as, and we're going to call them similar if the vectors point
into essentially the same direction.
And if you think of that direction right.
Presenters
Zugänglich über
Offener Zugang
Dauer
00:18:56 Min
Aufnahmedatum
2021-07-02
Hochgeladen am
2021-07-02 14:27:01
Sprache
en-US