2 - SIGMathLing Seminar -- Deyan Ginev: Scientific Statement Classification over arXiv.org [ID:30853]
50 von 646 angezeigt

See what we can do. Okay, I believe it is recording. Hello. Thanks everybody for pending.

Frederick asked me to give this impromptu talk, as we discussed briefly, in a way to fill the time before the next external talk.

And I'm happy to deliver a very nice fun experiment, we got to do over archive, it was our first successful experiment over the archive data.

With the large scale methods that are available to the community at large that have been so successful in the mainstream NLP.

And the topic there is scientific statement classification so I'll discuss briefly what are the scientific statements in archive, what is this classification setup we came up with, and I'll try to keep as much time as possible for a live showcase.

But since we have 40 minutes maybe I should just skip over some of the slides.

So, these first three slides have actually seen in my first talk I gave for Sigma think in the seminar here.

Late last year, end of 2020, which is that for about 15 years we've been working on converting archive to HTML five, and with archive has made a lot of progress in attaining its own content and we ourselves have made progress with getting more and more percentages

of data into HTML five that is a lot more machine readable than the source PDFs in particular for my formally, but not only for my formally also for scientific statements.

So,

what I'm going to talk about today.

And I skipped over the slide that they gave me last, the last talk.

And this is this nice little paper we made for the LREC conference, where we try to present a new language resource which is here all the statements from archive that we found to be what presentable enough for a linguistic resource.

And also attaching a few baselines of what existing modeling toolkits can do on top of that resource, just by virtue of employing the scale with that so okay into the subject matter of archive.

What are the archive statements so archive I kind of went without introduction it was in the first two slides it's this large corpus of documents in the basement of Cornell.

And I think everybody pretty much is aware of the pre print resource where the community disseminates their work before review and then shares links to those papers with others and but yeah by virtue of archive email we have this in HTML five and that's

about one and a half million documents at this point in HTML five.

And by the time we're writing this paper they were closer to one 1.1 million, but that's still a very substantial number of documents. So because they're written in Latin in the vast majority especially the ones that we're converting.

We can look for Latin deposit markup, I think the positive markup so things like the MS backslash new environment.

This is actually.

This should be new theorem, I'm sorry.

I'm not sure there's examples of the theorems downstairs so things like may fear and makes theorem.

Of course, there's the main kind of base theorem definition proof, going from the standard ones into the custom author authoring once where the authors themselves define their own environment so in archive it turns out.

20,000 custom AMS environments. And I had the privilege of manually inspecting the highest

frequency ones. And I think I went through about 500 of them. And the ones that were obvious in

meanings, things like main theorem and X theorem, I grouped together. And what the conversion to

HTML does is it preserves this markup in the HTML documents. So by Lattech ML on LamaPlan,

we could extract out all the paragraphs that were in these environments. And my paragraphs

in the logical paragraphs in archive rather than the, it's not, so if you think of HTML,

it's not the P elements that contain text. It is the, essentially the div elements that contain

text tables, inline equations, display equations, and a variety of other content. So they're

paragraphs in terms of the paper's narrative rather than in terms of HTML specific paragraph.

Yeah, so that is, that is essentially the premise of the experiments. We were looking for

low-hanging fruit because we had all this data and to use the deep methods we wanted to have

a big data set. And the question was, well, what is the obvious low-hanging fruit? And the environments

we had for statements were very numerous and also very well marked up. So that's the Curtis

Yovlatic and Curtis Yovlatic ML. We had a very strong and clear handle on them. So a little bit

about that data. If you look at the leading paragraph of such environments, so the first,

the very first text that follows after begin theorem, begin definition, 96% of those actually

fit standards token window for language modeling tools, especially back then. So this was working

2018. And then the standard things were bi-directional LSTMs and the GLAAV word embeddings and to-veq

word embeddings. So something around 500 tokens was pretty standard. I think I saw 480 in at least

a couple of papers I was using in the baseline. So I just went with 480. I think the hierarchical

attention networks were using 480 at the time. So yeah, 96% of the leading paragraphs fit in that

window. So that means if we restrict it and just look at the leading paragraphs, we actually

retain the vast majority of the data. So that's 10 million paragraphs to be specific.

And that decision for that experiment was being forced by, I actually tried having multiple

paragraphs just to fill this window up. So if you have, yeah, let's say a long proof that has

multiple steps, even with the full window, it's not a guarantee that the whole proof will fit,

but you could at least try to use the full window. And when I did that and I trained

Teil einer Videoserie :

Zugänglich über

Offener Zugang

Dauer

01:11:12 Min

Aufnahmedatum

2021-04-12

Hochgeladen am

2021-04-12 22:27:44

Sprache

en-US

Einbetten
Wordpress FAU Plugin
iFrame
Teilen