Download presentation
Presentation is loading. Please wait.
1
Do we still need corpora (now that we have the Web)? Silvia Bernardini University of Bologna, Italy silvia.bernardini@unibo.it Postgraduate Conference in Corpus linguistics 22 May 2008
2
The corpus A collection of texts assumed to be representative of a given language, dialect, or other subset of a language, to be used for linguistic analysis. (Francis 1992(1982):17) A collection of naturally-occurring language text, chosen to characterize a state or variety of a language. (Sinclair 1991:171) A closed set of texts in machine-readable form established for general or specific purposes by previously defined criteria. (Engwall 1992:167) Finite-sized body of machine-readable text, sampled in order to be maximally representative of the language variety under consideration. (McEnery and Wilson 1996:23) A collection of (1) machine-readable (2) authentic texts […] which is (3) sampled to be (4) representative of a particular language or language variety. (McEnery et al. 2006:5)
3
The Web A mine of language data of unprecedented richness (Lüdeling et al 2007) A fabulous linguists’ playground (Kilgarriff and Grefenstette 2003) [a] cheerful anarchy (Sinclair 2004) A helluva lot of text, stored on computers… (Leech 1992:106)
4
Is the Web a corpus? Yes! The definition of corpus should be broad. We define a corpus simply as “a collection of texts”. If that seems too broad, the one qualification we allow relates to the domains and contexts in which the word is used […]: A corpus is a collection of texts when considered as an object of language or literary study. The answer to the question “Is the web a corpus?” is yes. Kilgarriff and Grefenstette (2003:334)
5
Is the Web a corpus? No! The cheerful anarchy of the Web thus places a burden of care on a user, and slows down the process of corpus building. The organisation and discipline has to be put in by the corpus builder. […] users of a corpus assume that there is a consistency of selection, processing and management of the texts in the corpus. Corpora should be designed and constructed exclusively on external criteria. (Sinclair 2005)
6
This talk The Web and the corpus –Disambiguating the WaC acronym –Where the Web wins out –Where the corpus holds its ground Web as Corpus initiatives @ Forlì –The BootCaT way –The WaCky! way Open issues and ways forward
7
Web as Corpus? (The Web corpus “proper”) The Web as a corpus surrogate The Web as a corpus supermarket The mega-corpus (or mini-Web)
8
The Web as a corpus surrogate Googleology… e.g.: Keller and Lapata (2003) –Predicate-argument bigrams –adj-noun, noun-noun, verb-noun –not attested in the BNC “Web counts correlate reliably with [human plausibility] judgments, for all three types of predicate-argument bigrams tested, both seen and unseen. For the seen bigrams, […] the Web frequencies correlate better with judged plausibility than corpus frequencies” (ibid: 481). … is bad science “Working with commercial search engines makes us develop workarounds. We become experts in the syntax and constraints of Google, Yahoo!, Altavista, and so on. We become ‘googleologists’” (Kilgarriff 2007:147)
9
Google… Unreplicable –Véronis (2005): 5 billion "the" have disappeared overnight –Kilgarriff (2007:148): “queries are sent to different computers, at different points in the update cycle, and with different data in their caches” Uncontrollable –Asterisk treated as placeholder for 1 word or more than 1 word –Punctuation and capitalisation disregarded (even in phrases) –Search hits are per page –Ranking criteria and result sorting (popularity, geographic relevance, …) Linguistically naïve –No morphosyntactic annotation 36 queries to extract fulfill + obligation (Keller and Lapata 2003) Impossible to extract fulfill + NOUN –Unsophisticated query language No sub-string matching No span options
10
SE post-processors? e.g. WebCorp, KWiCFinder – Wildcards and tamecards – Concordance output – Collocation Not a solution, really – Slow – Same limits as SE
11
The Web as a corpus supermarket Selecting and downloading texts –General or specialized –Can be automatised (infra) e.g. (general): –Leeds Internet corpora (Sharoff 2006) English, Chinese, Finnish, French, German, Italian, Japanese Lemmatised and pos-tagged Indexed with the CWB and searchable online (CQP) –Fletcher’s WaC (Fletcher 2007) ~500M words of English (AU, CA, GB, IE, NZ, US) will be pos-tagged
12
Pros “Traditional” corpus => – Replicable results – Control over corpus contents In principle – Control over search methods – Linguistically sophisticated searches supported
13
BUT… Compromise btwn Web and corpus => –Relying on SE (Google, LiveSearch) –Size –Up-to-dateness –Understanding of corpus contents/structure –Variety of corpus contents –Noise
14
The mega-corpus/miniweb Baroni (2007): Effort spent by NLP community in developing Google-skills would be better spent building our own Google-sized corpora None available so far, but: –WebCorp (Renouf et al. 2007) –The WaCky! effort (infra) Ultimate objective, build a linguist’s search engine for the Web
15
Where the Web wins out Up-to-dateness Size Convenience –Cost –Ease of collection –Under-resourced languages Web-specific genres Reference purposes
16
Where the corpus holds its ground Selection on external criteria – Cf.: a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research (Sinclair 2005) Register/genre control Representativeness and documentation Pre- or non-Web genres
17
e.g.: McEnery et al 2007 Collocation information for learners’ dictionaries “Help”: Full or bare infinitive? –Varieties of English, language change, syntactic environment Acquisition of grammatical morphemes –Learner language Swearing in modern British English –writing vs. speaking –sociolinguistic variables Conversation vs. formal speech in AmEng Aspect marking in English-Chinese translation –Parallel corpora –Cf. Resnik and Smith (2003)
18
Two approaches to the Web as corpus The BootCaT way 1.Select initial seeds (terms) 2.Query SE for random seed combinations 3.Retrieve pages and format as text (corpus) 4.Extract new seeds via corpus comparison 5.Iterate Designed for translation students Also used for reference corpus building Leeds Internet Corpora
19
BootCaT pros… Implemented in perl as a set of simple command-line scripts Freely available (http://sslmit.unibo.it/~baroni/bootcat.html)http://sslmit.unibo.it/~baroni/bootcat.html documented Integrated into the Sketch Engine pipelineSketch Engine Community effort –WebBootCaT –JBootCaTJBootCaT
20
An example: wine tasting Automatic query generation acetic acid acidity aftertaste aged alcohol appley aroma ascescence astringent … wine rich unfiltered attractive wine stylish "malolactic fermentation" sour wine meager harsh spritzy wine dumb tobacco direct wine watery grapey tears wine hazy breed nouveau wine spicy flat body wine vinous spritzy unfined wine fleshy cigarbox easy wine puckery sharp nutty …
22
“vanilla” collocates (span=1R) BootCaT wine tasting corpus (English, 1.5M words) BNC
23
…and BootCaT cons Relies on SE=> same limits (cf. supra) –…and Google no longer gives out API keys Not really an option for very large corpus building projects
24
A more ambitious alternative The Wacky way Aim: produce very large (~2bn words) web-derived corpora for several languages Collaborative effort, using existing open tools, making developed tools publicly available http://wacky.sslmit.unibo.it/ Wacky corpora currently available: –deWaC, itWaC, ukWaC, frWaC
25
The Wacky pipeline Submit random word combinations to Google and obtain list of URLs (seeding) Crawling (Heritrix) Code removal and boilerplate stripping Language filtering Near-duplicate detection Tokenization, POS-tagging and lemmatisation Indexing and querying
26
An example: constructing ukWaC Seeding: mid-frequency content words (BNC); words from spoken text (BNC); vocabulary list for foreign learners Crawl limited to UK domain and html Processing –Only files btwn 5 and 200kb kept –Perfect duplicates discarded –Code, boilerplate, files with unconnected text and pornographic pages removed –Near-duplicates removed
27
UkWaC: Details and size 2,000 seed word pairs 6,528 seed URLs 351 GB raw crawl size 19 GB after document filtering 5.69 M of documents after filtering 12 GB after near-duplicate cleaning 2.69 M of documents after near-duplicate cleaning 30 GB size with annotation 1,914,150,197 tokens 3,798,106 types Further info and availability: http://wacky.sslmit.unibo.it/http://wacky.sslmit.unibo.it/
28
A wacky example Results for wacky+NOUN (>2), Baroni et al. submitted BNC 3 ideas 2 roles 2 photo 2 items 2 humour 2 characters UkWaC 71 world, 44 ideas, 43 wigglers, 42 wiggler, 28 characters, 27 sense, 22 comedy, 21 stuff, 21 races, 20 things, 19 idea, 15 humour, 13 games, 12 race, 11 backy, 10 baccy, 10 fun, 10 game, 10 inventions, 10 names, 10 uses
29
WaC: What the future holds Have WaC replaced “traditional” corpora? –Not really… Challenges –Cleaning techniques –Web-tuned annotation tools –Indexing and querying systems –(Automatic) text classification
30
Approaches to Web text classification Biber and Kurjan (2007) –Search engine categories not well defined for purposes of linguistic analysis Google directory –Multidimensional analysis text type approach –Register approach future work
31
Sharoff forthcoming –Genre typology based on EAGLES recommendations “Communicative intentions” Discussion, information, instruction, propaganda, recreation, regulations reporting –SVMs to automatically categorise texts in Web corpus –Classifiers trained on manually-classified texts BNC + subset of Web corpus Approaches to Web text classification
32
WaC challenges Representativeness Without representativeness, whatever is found to be true of a corpus, is simply true of that corpus – and cannot be extended to anything else (Leech 2007:135)
33
Compilers make the best corpus they can in the circumstances, and their proper stance is to be detailed and honest about the contents. From their description of the corpus, the research community can judge how far to trust their results, and future users of the same corpus can estimate its reliability for their purposes. (Sinclair 2005) WaC challenges Documentation
34
Thank you Silvia Bernardini University of Bologna, Italy silvia.bernardini@unibo.it Postgraduate Conference in Corpus linguistics 22 May 2008
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.