Download presentation
Presentation is loading. Please wait.
1
1 Corpora for the coming decade Adam Kilgarriff Lexical Computing Ltd
2
Aston May 2009 Kilgarriff: Corpora for the coming decade How to build a corpus Like this: (demo) http://beta.sketchengine.co.uk/auth/wbc http://beta.sketchengine.co.uk/auth/wbc
3
Aston May 2009 Kilgarriff: Corpora for the coming decade3 How should they be different? Bigger Better
4
Aston May 2009 Kilgarriff: Corpora for the coming decade4 Bigger Motivation Ample data for rare phenomena Big subcorpora For language modelling More like Google-scale but without Google disadvantages See Googleology is Bad Science, CL 2007
5
Aston May 2009 Kilgarriff: Corpora for the coming decade5 Better Less noise Fewer duplicates Richer markup At word, sentence level At document level (text type, subcorpora)
6
Aston May 2009 Kilgarriff: Corpora for the coming decade6 Divide and rule Bigger (+ cleaning + deduplication) Big Web Corpus (BiWeC) Currently 5.5b fully processed Target 20b words Jan Pomikalek, Pavel Rychly Better New Model Corpus
7
Aston May 2009 Kilgarriff: Corpora for the coming decade7 New Model Corpus model 1.small version: model train 2.design: data model New Model Corpus 1:100 scale model To replace BNC as design model
8
Aston May 2009 Kilgarriff: Corpora for the coming decade8 BNC design model Most often used Eg for other languages pre-web f(blog)=0 Corpora now bigger, far quicker, far cheaper, different issues BNC design model past its sell-by Kilgarriff Atkins Rundell, Corpus Lg 2007
9
Aston May 2009 Kilgarriff: Corpora for the coming decade9 New model Data Markup
10
Aston May 2009 Kilgarriff: Corpora for the coming decade10 Data From the web 100m words Small sample size Copyright ??Creative Commons Licence
11
Aston May 2009 Kilgarriff: Corpora for the coming decade11 Composition General crawl50 Targeted Fiction 7 Blog 7 Newspaper (RSS feed) 7 Speech10 Film transcripts, chatshow Domain-specific19 Business, medical, law
12
Aston May 2009 Kilgarriff: Corpora for the coming decade12 Markup Collaborative We distribute data Anyone applies their tools Pos-tagger, parser, co-ref resolution, domain classifier, WSD, semantic classifier, time phrases, named entities... We integrate, display in Sketch Engine Research potential from multiple markup
13
Aston May 2009 Kilgarriff: Corpora for the coming decade13 Recombine the two strands Apply methods with good accuracy (and fast) to BiWeC Result will be Bigger Better
14
Aston May 2009 Kilgarriff: Corpora for the coming decade The Sketch Engine Full-functionality corpus system Fast Web-based In daily use for lexicography at OUP, Collins, CUP, Macmillan, … Le Robert, Cornelsen, Patakis, INL, … Many universities, language teaching Free trial Demo: http://sketchengine.co.ukhttp://sketchengine.co.uk
15
Aston May 2009 Kilgarriff: Corpora for the coming decade What can computers count up to? By default, 2 billion 32 bits, one for the sign, 2 31 = 2 billion Re-engineering required to go beyond Most corpus systems: tough limit Sketch Engine Recently re-engineered for 64-bit integers No longer limited
16
Aston May 2009 Kilgarriff: Corpora for the coming decade16 NLP by web services? Big corpora big to hold, hard to access fast Sketch Engine: corpus specialist Web API FrameNet TEDDCLOG: Taiwan English Data Driven Cloze (test sentence) Generation All welcome
17
Aston May 2009 Kilgarriff: Corpora for the coming decade17 Practicalities Free trial accounts Collaborators, innovative users free longer-term accounts Wikinomics, Tapscott and Williams API Details under 'help' on SkE home page New Model Corpus Available by end 2009: watch Corpora
18
Aston May 2009 Kilgarriff: Corpora for the coming decade Thank you http://www.sketchengine.co.uk Enjoy!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.