Constructing and Evaluating Web Corpora: ukWaC Adriano Ferraresi University of Bologna Aston University Postgraduate Conference.

Constructing and Evaluating Web Corpora: ukWaC Adriano Ferraresi University of Bologna adriano@sslmit.unibo.it Aston University Postgraduate Conference 22 May 2008

OUTLINE >Introduction >Building a very large Web-derived corpus >Evaluating ukWaC vs. the BNC >Issues and challenges in WaC

“Web-as-corpus”? >The Web: an immense, free, easily available source of textual materials >Traditional corpus resources: >Recent or very uncommon linguistic phenomena? >Specialized linguistic sub-domains? >Minority languages? >Use of the Web for linguistic purposes

The WaCky project >Exploiting the Web to build very large (~2 billion tokens) general-purpose corpora for various languages > http://wacky.sslmit.unibo.it/ >A largely language-independent pipeline: itWaC, deWaC >The last born: ukWaC

OUTLINE >Introduction >Building a very large Web-derived corpus >Seed selection >Crawling >Post-crawl cleaning and annotation >Evaluating ukWaC vs. the BNC >Issues and challenges in WaC

Seed selection >Aim: greatest possible variety of text contents and genres >Ueyama (2006): effects of seed selection >Sampling from traditional written sources => “Public-sphere” documents >Sampling from basic vocabulary lists => Blogs, forums of discussion, etc. >ukWaC: >Mid-frequency content words from the BNC >Vocabulary list for foreign learners >Spoken English (BNC)

Crawling >Using the Heritrix crawler >Excluding non-html data >A simple heuristic: limiting the crawl to the.uk Internet domain

Post-crawl cleaning and annotation >To reduce noise in the data (from 351 GB… to 12 GB!) >Filtering: >Only documents between 5KB and 200 KB >Code and boilerplate (Fletcher, 2004) removed >Language and pornography filtering >Near-duplicate detection and removal >Annotation: the TreeTagger

OUTLINE >Introduction >Building a very large Web-derived corpus >Evaluating ukWaC vs. the BNC >Methodology >Nouns most typical of ukWaC >Nouns most typical of the BNC >A note of caution >Issues and challenges in WaC

UkWaC vs. the BNC: A vocabulary-based comparison >Along the lines of Sharoff (2006): comparing noun wordlists across a traditional corpus (the BNC) and a Web corpus (ukWaC) >Log-likelihood association measure: the nouns “most typical” of either corpus >50 nouns with the highest log-likelihood score: >250 randomly selected concordances >Associated URL

The nouns most typical of ukWaC >Three main categories: >Web- and computer-related texts >“Public-sphere” documents: > Universities > The government and NGOs >Some examples:

The nouns most typical of the BNC >Three main categories: >Imaginative texts: e.g. eyes appears 74% of the times in ‘fiction/prose’ texts >Spoken language >Politics and economy >Some examples:

A note of caution >The methodology highlights several lexical differences btwn ukWaC and the BNC >However: a high log-likelihood score does not indicate absolute typicality >E.g. eyes, the 4th “most typical” noun of the BNC, is 15 times more frequent in ukWaC >What features make ukWaC and the BNC similar, instead of different?

OUTLINE >Introduction >Building a very large Web-derived corpus >Evaluating ukWaC vs. the BNC >Issues and challenges in WaC

Future work on ukWaC >ukWaC is being actively used in simulations of human learning, lexical semantics, language teaching… >However, we would like to improve on it: >Better data cleaning techniques (maybe adopting CLEANEVAL methodologies; Fairon et al., 2007) >Automatic classification into domains and genres (Santini et al., 2006) >… and to extend the analysis: >Usage-oriented task: discovery of collocational patterns for lexicography

THANK YOU! Adriano Ferraresi University of Bologna adriano@sslmit.unibo.it Aston University Postgraduate Conference 22 May 2008

REFERENCES Fairon, C., Naets, H., Kilgarriff, A. and de Schryver, G.-M. (eds.) (2007) Building and exploring Web corpora. Proceedings of the WAC3 Conference. Louvain: Presses Universitaires de Louvain. Fletcher, W.H. (2004). Making the Web more useful as a source for linguistic corpora. In Connor, U. and Upton, T. (eds.) Corpus Linguistics in North America 2002. Santini M., Power, R. and Evans, R. (2006) Implementing a characterization of genre for automatic genre identification of Web pages. In Proceeding of Joint Conference of the International Committee on Computational Linguistics and the Association for Computational Linguistics (COLING/ACL 2006). Sharoff, S. (2006) Creating general-purpose corpora using automated search engine queries. In Baroni, M. and Bernardini, S. (eds.) Wacky! Working papers on the Web as Corpus. Bologna: GEDIT. 63-98. Ueyama, M. (2006) Evaluation of Web-based Japanese reference corpora: effects of seed selection and time interval. In Baroni, M. and Bernardini, S. (eds.) WaCky! Working Papers on the Web as Corpus. Bologna: GEDIT Edizioni. 99-126.

Constructing and Evaluating Web Corpora: ukWaC Adriano Ferraresi University of Bologna Aston University Postgraduate Conference.

Similar presentations

Presentation on theme: "Constructing and Evaluating Web Corpora: ukWaC Adriano Ferraresi University of Bologna Aston University Postgraduate Conference."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Constructing and Evaluating Web Corpora: ukWaC Adriano Ferraresi University of Bologna Aston University Postgraduate Conference.

Similar presentations

Presentation on theme: "Constructing and Evaluating Web Corpora: ukWaC Adriano Ferraresi University of Bologna Aston University Postgraduate Conference."— Presentation transcript:

Similar presentations

About project

Feedback