4 - 29.4 Language Models [ID:35200]
50 von 171 angezeigt

Hello everybody and welcome to the video nugget on language models.

So remember a formal language is a set of strings and formal languages like Java or

C++. It is easy to understand what is a valid Java or C++ string. We just apply the language

grammar and if it says yes, then it is and if the grammar says no, then it isn't. So

it's a decidable problem. For languages like natural languages like English, German, Spanish,

Chinese or so, this is not the case. So we look at a concrete example in English. So

something like not to be invited is sad is definitely English, whereas to not be invited

is sad is slightly controversial. So the idea is that instead of having a formal language

where we have a clear zero one, if you will, criterion, whether something is a string of

a language in natural languages, it's actually better to use a probability distribution,

which would be something like near one for the first example and less than one, significantly

less than one for the second example. So such a probability distribution we'll call a language

model. And there's kind of a choice to make here. We can have language models of characters

or language models of words. And we're going to see both of those in this video nugget.

So the idea is that we try to derive the language model from a text corpus, which is essentially

a large and structured set of texts. There's a whole area called corpus linguistics, which

is just basically deriving stuff like language models, but also other things from corporate.

So corporate are used for statistical analysis. We're going to see this hypothesis testing

and doing things like validating linguistic rules or something like that. And that's something

we're going to. So we're going to presuppose that we have a text corpus and getting and

curating those is a non-trivial task force. There are large text corpus, the corpora, for

instance, a large corpus of English newswire text is used in the pantry bank, which you

may have heard in German language, there's a corpus by the Battlesman company. They've

basically collected all the newspapers they publish for the last 30 years and so on. That's

for instance, being used to curate the words of the dictionary block house, which you may

have heard about. So we're going to, I'm going to first look into Ngram character models.

So remember a written texts are composed of characters, which are letters of digits and

punctuations and spacing and so on. So we can study languages, language models for sequences

of characters. And we're going to use something we've studied before, namely, we are going

to look at this as a Markov like process. And we're going to take the notation we've

developed from Markov processes, basically saying, well, the sequence or CN, we're going

to write as C1 colon N for this length. So we're going to call a character sequence of

length N and Ngram for one, two and three, we're going to use unigram, bigram and trigram

as traditional names. And we're just going to think of an Ngram model as a Markov chain

of order N minus one. So for a trigram model, we're going to get the task of kind of predicting

the next character, the probability of CI, given the observation of the previous characters

in a trigram model is just basically CI, given CI minus one and CI minus two. So if we kind

of factor with the chain rule, the probability of a particular length N sequence is just via

the chain rule, which we now can use the Markov property, we're getting basically this big

problem, this big product. And the important thing to see here is that the only thing we

need is actually this conditional probability. And we can see that we have, and to calculate

that we need three, we need three to the power of how many characters we have. So for a trigram

gram model for a language with 100 characters, which is about what you need, kind of if you

count in digits, punctuation spaces, and so on, we need a million entries for that model.

And of course, you need a big corpus for that. And so you need a corpus with 10 to the seven

characters. Okay, so that's doable. 10 to the seven characters is something like 10

to the four, so 10,000 pages, that's easy to collect. So character models are something

we can relatively easily do. So the question might be, what can we do with those? One of

the things is language identification. You may want to know what natural language a text

is written in. Typically, you would like to know which lexicon to use, which language

Teil eines Kapitels:
Chapter 29. Natural Language Processing

Zugänglich über

Offener Zugang

Dauer

00:25:30 Min

Aufnahmedatum

2021-07-01

Hochgeladen am

2021-07-01 11:17:06

Sprache

en-US

Einbetten
Wordpress FAU Plugin
iFrame
Teilen