Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd
Madrid 2010Kilgarriff: Web Corpora Corpora show us the facts of the language
Madrid 2010Kilgarriff: Web Corpora3 What is a corpus? a corpus is a collection of texts –when viewed as an object of language research
Madrid 2010Kilgarriff: Web Corpora Which texts? Written Spoken
Madrid 2010Kilgarriff: Web Corpora Written Books –Fiction –Non-fiction –Textbooks Newspapers Letters, unpublished Web pages Academic journals Student essays …
Madrid 2010Kilgarriff: Web Corpora Spoken Must be transcribed, for text corpora Conversation –Who? Region, class, age-group, situation… Lectures TV and Radio Film transcripts Meetings, seminars …
Madrid 2010Kilgarriff: Web Corpora Which texts? Different purposes, different text types Making dictionaries: –Cover the whole language –Some of everything
Madrid 2010Kilgarriff: Web Corpora How much? Most words are rare Zipf’s Law To get enough data for most words, we need very big corpora
Madrid 2010Kilgarriff: Web Corpora Zipf’s Law Word (pos) r f r x f the (det) to (prep) as (adv) playing (vb) paint (vb) amateur (adj) 10,
Madrid 2010Kilgarriff: Web Corpora Zipf’s Law the: 6% 100 most frequent: 45% 7500 most frequent: 90% all others: rare
Madrid 2010Kilgarriff: Web Corpora Zipf’s Law
Madrid 2010Kilgarriff: Web Corpora Leading English Corpora: Size Size of Corpora (in words) 1960s 1970s 1980s 1990s 2000s Brown/LOB COBUILD BNC OEC
Madrid 2010Kilgarriff: Web Corpora Good news The web
Madrid 2010Kilgarriff: Web Corpora14 You can’t help noticing Replaceable or replacable? –
Madrid 2010Kilgarriff: Web Corpora15 Very very large –2006 estimates for duplicate free, linguistic, Google- indexed web German: 44 billion words Italian: 25 billion words English: 1,000 billion -10,000 billion words Most languages Most language types Up-to-date Free Instant access
Madrid 2010Kilgarriff: Web Corpora16 What is a corpus? a corpus is a collection of texts –when viewed as an object of language research
Madrid 2010Kilgarriff: Web Corpora17 Is the web a corpus? Yes
Madrid 2010Kilgarriff: Web Corpora18 but it’s not representative
Madrid 2010Kilgarriff: Web Corpora19 sublanguage Language = core + sublanguages Options for corpus construction –none –some –all None –impoverished view of language Some: BNC –cake recipes and gastro-uterine disease –not car repair manuals or astronomy or … All: until recently, not viable
Madrid 2010Kilgarriff: Web Corpora20 Representativeness The web is not representative but nor is anything else Text type variation –under-researched, lacking in theory Atkins Clear Ostler 1993 on design brief for BNC; Biber 1988, Kilgarriff 2001 Text type is an issue across NLP –Web: issue is acute because, as against BNC or WSJ, we simply don’t know what is there
Madrid 2010Kilgarriff: Web Corpora21 What is out there? What text types are there on the web? –some are new: chatroom –proportions is it overwhelmed by porn? How much? Hard question
Madrid 2010Kilgarriff: Web Corpora22 Comparing frequency lists Web1T –Present from google –All 1-, 2-, 3-, 4, 5-grams with f>40 in one trillion (10 12) words of English that’s 1,000,000,000,000 Compare with BNC –100 words with highest Web1T:BNC ratio –100 words with lowest ratio
Madrid 2010Kilgarriff: Web Corpora23 Web-high (155 terms) 61 web and computing –config browser spyware url www forum 38 porn 22 US English (incl Spanish influence –los) 18 business/products common on web –poker viagra lingerie ringtone dvd casino rental collectible tiffany –NB: BNC is old 4 legal –trademarks pursuant accordance herein
Madrid 2010Kilgarriff: Web Corpora24 Web-low Exclude British English, transcription/tokenisation anomalies –herself stood seemed she looked yesterday sat considerable had council felt perhaps walked round her towards claimed knew obviously remained himself he him
Madrid 2010Kilgarriff: Web Corpora25 Observations Pronouns and past tense verbs –Fiction Masc vs fem Yesterday –Probably daily newspapers Constancy of ratios: –He/him/himself –She/her/herself
Madrid 2010Kilgarriff: Web Corpora26 The web –a social, cultural, political phenomenon –new, little understood –a legitimate object of science –mostly language we are well placed –a lot of people will be interested Let’s –study the web –source of language data –apply our tools for web use (dictionaries, MT) –use the web as infrastructure
Madrid 2010Kilgarriff: Web Corpora27 Using Search Engines No setup costs Start querying today Methods Hit counts ‘snippets’ –Metasearch engines, WebCorp Find pages and download
Madrid 2010Kilgarriff: Web Corpora28 Googleology Google hit counts for language modelling –Example: (Keller & Lapata 2003) –36 queries to estimate freq(fulfil, obligation) to each of Google and Altavista Very interesting work Great interest in query syntax
Madrid 2010Kilgarriff: Web Corpora29 The Trouble with Google not enough instances –max 1000 not enough queries –max 1000 per day with API not enough context –10-word snippet around search term sort order –search term in titles and headings untrustworthy hit counts limited search options linguistically dumb, eg not lemmatised aime/aimer/aimes/aimons/aimez/aiment …
Madrid 2010Kilgarriff: Web Corpora30 Appeal –Zero-cost entry, just start googling Reality –High-quality work: high-cost methodology
Madrid 2010Kilgarriff: Web Corpora31 Also: No replicability Methods, stats not published At mercy of commercial corporation
Madrid 2010Kilgarriff: Web Corpora32 Also: No replicability Methods, stats not published At mercy of commercial corporation Googleology is bad science So…
Madrid 2010Kilgarriff: Web Corpora33 Basic steps Gather pages –Google hits –Select and gather whole sites –General crawl Filter De-duplicate Linguistic processing Load into corpus tool
Madrid 2010Kilgarriff: Web Corpora34 Oxford English Corpus Whole domains chosen and harvested –control over text type 2 billion words (Mar 08)
Madrid 2010Kilgarriff: Web Corpora35 Oxford English Corpus
Madrid 2010Kilgarriff: Web Corpora36 DeWaC, ItWaC, UKWaC 1.5 B words each Marco Baroni, Adriano Ferraresi Seeds: –mid-frequency words from ‘core vocab’ lists and corpora Google on seed words, then crawl
Madrid 2010Kilgarriff: Web Corpora37 Filtering Non-text (sound, image etc) files Boilerplate (within file) –Copyright notices, navigation bars –“high markup” heuristic Not “text in sentences” –Look for function words –Lists?? Sports results?? Crossword puzzles?? Spam, pornography –Tough De-duplication (also tough)
Madrid 2010Kilgarriff: Web Corpora38 Small, specialised corpora Terminologists Translators needing target-language domain-specific vocab Specialist dictionaries –Don’t exist –Expensive/inaccessible –Out of date
Madrid 2010Kilgarriff: Web Corpora39 BootCat ( Bootstrapping Corpora and Terms) Put in seed terms Google/Yahoo search Retrieve Google/Yahoo hits –Remove duplicates, boilerplate Small instant corpora Baroni and Bernardini, LREC 2004 Web version –WebBootCaT –At Sketch Engine site
Madrid 2010Kilgarriff: Web Corpora Task Choose area of specialist interest –English or Spanish Select at least 5 seed terms –Specialist: good Build corpus –At least 100,000 words –Iterate if necessary Find at least six words/phrases/meanings you did not know before Write up