Download presentation
Presentation is loading. Please wait.
1
Constructing and Evaluating Web Corpora: ukWaC Adriano Ferraresi University of Bologna adriano@sslmit.unibo.it Aston University Postgraduate Conference 22 May 2008
2
OUTLINE >Introduction >Building a very large Web-derived corpus >Evaluating ukWaC vs. the BNC >Issues and challenges in WaC
3
“Web-as-corpus”? >The Web: an immense, free, easily available source of textual materials >Traditional corpus resources: >Recent or very uncommon linguistic phenomena? >Specialized linguistic sub-domains? >Minority languages? >Use of the Web for linguistic purposes
4
The WaCky project >Exploiting the Web to build very large (~2 billion tokens) general-purpose corpora for various languages > http://wacky.sslmit.unibo.it/ >A largely language-independent pipeline: itWaC, deWaC >The last born: ukWaC
5
OUTLINE >Introduction >Building a very large Web-derived corpus >Seed selection >Crawling >Post-crawl cleaning and annotation >Evaluating ukWaC vs. the BNC >Issues and challenges in WaC
6
Seed selection >Aim: greatest possible variety of text contents and genres >Ueyama (2006): effects of seed selection >Sampling from traditional written sources => “Public-sphere” documents >Sampling from basic vocabulary lists => Blogs, forums of discussion, etc. >ukWaC: >Mid-frequency content words from the BNC >Vocabulary list for foreign learners >Spoken English (BNC)
7
Crawling >Using the Heritrix crawler >Excluding non-html data >A simple heuristic: limiting the crawl to the.uk Internet domain
8
Post-crawl cleaning and annotation >To reduce noise in the data (from 351 GB… to 12 GB!) >Filtering: >Only documents between 5KB and 200 KB >Code and boilerplate (Fletcher, 2004) removed >Language and pornography filtering >Near-duplicate detection and removal >Annotation: the TreeTagger
9
OUTLINE >Introduction >Building a very large Web-derived corpus >Evaluating ukWaC vs. the BNC >Methodology >Nouns most typical of ukWaC >Nouns most typical of the BNC >A note of caution >Issues and challenges in WaC
10
UkWaC vs. the BNC: A vocabulary-based comparison >Along the lines of Sharoff (2006): comparing noun wordlists across a traditional corpus (the BNC) and a Web corpus (ukWaC) >Log-likelihood association measure: the nouns “most typical” of either corpus >50 nouns with the highest log-likelihood score: >250 randomly selected concordances >Associated URL
11
The nouns most typical of ukWaC >Three main categories: >Web- and computer-related texts >“Public-sphere” documents: > Universities > The government and NGOs >Some examples:
12
The nouns most typical of the BNC >Three main categories: >Imaginative texts: e.g. eyes appears 74% of the times in ‘fiction/prose’ texts >Spoken language >Politics and economy >Some examples:
13
A note of caution >The methodology highlights several lexical differences btwn ukWaC and the BNC >However: a high log-likelihood score does not indicate absolute typicality >E.g. eyes, the 4th “most typical” noun of the BNC, is 15 times more frequent in ukWaC >What features make ukWaC and the BNC similar, instead of different?
14
OUTLINE >Introduction >Building a very large Web-derived corpus >Evaluating ukWaC vs. the BNC >Issues and challenges in WaC
15
Future work on ukWaC >ukWaC is being actively used in simulations of human learning, lexical semantics, language teaching… >However, we would like to improve on it: >Better data cleaning techniques (maybe adopting CLEANEVAL methodologies; Fairon et al., 2007) >Automatic classification into domains and genres (Santini et al., 2006) >… and to extend the analysis: >Usage-oriented task: discovery of collocational patterns for lexicography
16
THANK YOU! Adriano Ferraresi University of Bologna adriano@sslmit.unibo.it Aston University Postgraduate Conference 22 May 2008
17
REFERENCES Fairon, C., Naets, H., Kilgarriff, A. and de Schryver, G.-M. (eds.) (2007) Building and exploring Web corpora. Proceedings of the WAC3 Conference. Louvain: Presses Universitaires de Louvain. Fletcher, W.H. (2004). Making the Web more useful as a source for linguistic corpora. In Connor, U. and Upton, T. (eds.) Corpus Linguistics in North America 2002. Santini M., Power, R. and Evans, R. (2006) Implementing a characterization of genre for automatic genre identification of Web pages. In Proceeding of Joint Conference of the International Committee on Computational Linguistics and the Association for Computational Linguistics (COLING/ACL 2006). Sharoff, S. (2006) Creating general-purpose corpora using automated search engine queries. In Baroni, M. and Bernardini, S. (eds.) Wacky! Working papers on the Web as Corpus. Bologna: GEDIT. 63-98. Ueyama, M. (2006) Evaluation of Web-based Japanese reference corpora: effects of seed selection and time interval. In Baroni, M. and Bernardini, S. (eds.) WaCky! Working Papers on the Web as Corpus. Bologna: GEDIT Edizioni. 99-126.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.