Constructing and Evaluating Web Corpora: ukWaC Adriano Ferraresi University of Bologna Aston University Postgraduate Conference 22 May 2008
OUTLINE >Introduction >Building a very large Web-derived corpus >Evaluating ukWaC vs. the BNC >Issues and challenges in WaC
“Web-as-corpus”? >The Web: an immense, free, easily available source of textual materials >Traditional corpus resources: >Recent or very uncommon linguistic phenomena? >Specialized linguistic sub-domains? >Minority languages? >Use of the Web for linguistic purposes
The WaCky project >Exploiting the Web to build very large (~2 billion tokens) general-purpose corpora for various languages > >A largely language-independent pipeline: itWaC, deWaC >The last born: ukWaC
OUTLINE >Introduction >Building a very large Web-derived corpus >Seed selection >Crawling >Post-crawl cleaning and annotation >Evaluating ukWaC vs. the BNC >Issues and challenges in WaC
Seed selection >Aim: greatest possible variety of text contents and genres >Ueyama (2006): effects of seed selection >Sampling from traditional written sources => “Public-sphere” documents >Sampling from basic vocabulary lists => Blogs, forums of discussion, etc. >ukWaC: >Mid-frequency content words from the BNC >Vocabulary list for foreign learners >Spoken English (BNC)
Crawling >Using the Heritrix crawler >Excluding non-html data >A simple heuristic: limiting the crawl to the.uk Internet domain
Post-crawl cleaning and annotation >To reduce noise in the data (from 351 GB… to 12 GB!) >Filtering: >Only documents between 5KB and 200 KB >Code and boilerplate (Fletcher, 2004) removed >Language and pornography filtering >Near-duplicate detection and removal >Annotation: the TreeTagger
OUTLINE >Introduction >Building a very large Web-derived corpus >Evaluating ukWaC vs. the BNC >Methodology >Nouns most typical of ukWaC >Nouns most typical of the BNC >A note of caution >Issues and challenges in WaC
UkWaC vs. the BNC: A vocabulary-based comparison >Along the lines of Sharoff (2006): comparing noun wordlists across a traditional corpus (the BNC) and a Web corpus (ukWaC) >Log-likelihood association measure: the nouns “most typical” of either corpus >50 nouns with the highest log-likelihood score: >250 randomly selected concordances >Associated URL
The nouns most typical of ukWaC >Three main categories: >Web- and computer-related texts >“Public-sphere” documents: > Universities > The government and NGOs >Some examples:
The nouns most typical of the BNC >Three main categories: >Imaginative texts: e.g. eyes appears 74% of the times in ‘fiction/prose’ texts >Spoken language >Politics and economy >Some examples:
A note of caution >The methodology highlights several lexical differences btwn ukWaC and the BNC >However: a high log-likelihood score does not indicate absolute typicality >E.g. eyes, the 4th “most typical” noun of the BNC, is 15 times more frequent in ukWaC >What features make ukWaC and the BNC similar, instead of different?
OUTLINE >Introduction >Building a very large Web-derived corpus >Evaluating ukWaC vs. the BNC >Issues and challenges in WaC
Future work on ukWaC >ukWaC is being actively used in simulations of human learning, lexical semantics, language teaching… >However, we would like to improve on it: >Better data cleaning techniques (maybe adopting CLEANEVAL methodologies; Fairon et al., 2007) >Automatic classification into domains and genres (Santini et al., 2006) >… and to extend the analysis: >Usage-oriented task: discovery of collocational patterns for lexicography
THANK YOU! Adriano Ferraresi University of Bologna Aston University Postgraduate Conference 22 May 2008
REFERENCES Fairon, C., Naets, H., Kilgarriff, A. and de Schryver, G.-M. (eds.) (2007) Building and exploring Web corpora. Proceedings of the WAC3 Conference. Louvain: Presses Universitaires de Louvain. Fletcher, W.H. (2004). Making the Web more useful as a source for linguistic corpora. In Connor, U. and Upton, T. (eds.) Corpus Linguistics in North America Santini M., Power, R. and Evans, R. (2006) Implementing a characterization of genre for automatic genre identification of Web pages. In Proceeding of Joint Conference of the International Committee on Computational Linguistics and the Association for Computational Linguistics (COLING/ACL 2006). Sharoff, S. (2006) Creating general-purpose corpora using automated search engine queries. In Baroni, M. and Bernardini, S. (eds.) Wacky! Working papers on the Web as Corpus. Bologna: GEDIT Ueyama, M. (2006) Evaluation of Web-based Japanese reference corpora: effects of seed selection and time interval. In Baroni, M. and Bernardini, S. (eds.) WaCky! Working Papers on the Web as Corpus. Bologna: GEDIT Edizioni