Constructing and Evaluating Web Corpora: ukWaC Adriano Ferraresi University of Bologna Aston University Postgraduate Conference.

Slides:



Advertisements
Similar presentations
Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd.
Advertisements

BOTWU BootCaTters of the world unite! Erika Dalan (University of Bologna)
Uses of a Corpus “[E]xplore actual patterns of language use”
Using Corpus Tools in Discourse Analysis Discourse and Pragmatics Week 12.
1 Corpora for all Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
The use of an intelligent forum crawler for data retrieval from e-learning portals Miloš Pavković and Jelica Protić, University of Belgrade School of.
1 / 22 Issues in Text Similarity and Categorization Jordan Smith – MUMT 611 – 27 March 2008.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Recent Developments in Technological Tools for the Purpose of Facilitating SLA.
Making useful wordlists for ELT Topical vocabulary from the WWW Simon Smith & Scott Sommers Ming Chuan University, Taipei Adam Kilgarriff, Lexical Computing.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
1 Corpora for the coming decade Adam Kilgarriff Lexical Computing Ltd.
Do we still need corpora (now that we have the Web)? Silvia Bernardini University of Bologna, Italy Postgraduate Conference.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Corpus Linguistics What can a corpus tell us ? Levels of information range from simple word lists to catalogues of complex grammatical structures and.
Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd.
Research methods in corpus linguistics Xiaofei Lu.
GL12 Conf. Dec. 6-7, 2010NTL, Prague, Czech Republic Extending the “Facets” concept by applying NLP tools to catalog records of scientific literature *E.
Memory Strategy – Using Mental Images
CORPUS LINGUISTICS: AN INTRODUCTION Susi Yuliawati, M.Hum. Universitas Padjadjaran
Donghui Xu Spring 2011, COMS E6125 Prof. Gail Kaiser.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.
Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
 What is the BNC?  What is Xaira?  How to use the BNC for: › Language teaching and learning › Research.
Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.
Reflections on Using Corpora Data in EFL Teaching CHEN BO Chongqing Jiaotong University 2006.
GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing.
Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty Gabrilovich et.al WWW2004.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Chapter 6: Information Retrieval and Web Search
Corpus Evaluation Adam Kilgarriff Lexical Computing Ltd Corpus evaluationPortsmouth Nov
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
CL 2005, Birmingham Web as Corpus Workshop Intro: Adam Kilgarriff 1 Web as Corpus Workshop Co-chairs: Marco Baroni Adam Kilgarriff Sebastian Hoffman.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Corpus Linguistics in Research Doctorate in Education University of Warwick 6th November 2008.
An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Learning a Monolingual Language Model from a Multilingual Text Database Rayid Ghani & Rosie Jones School of Computer Science Carnegie Mellon University.
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
GDEX: Automatically finding good dictionary examples in a corpus Auckland 2012Kilgarriff: GDEX1.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
GDEX: Automatically finding good dictionary examples in a corpus Kivik 2013Kilgarriff: GDEX1.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
Language Identification and Part-of-Speech Tagging
Measuring Monolinguality
Statistical NLP: Lecture 7
Sentiment analysis algorithms and applications: A survey
ALE161 國際行銷英文簡報技巧 International Marketing Presentation Techniques
Making useful wordlists for ELT

Computational and Statistical Methods for Corpus Analysis: Overview
Exploring the BNC Corpus
Using Corpora for Language Research
Corpus-Based ELT CEL Symposium Creating Learning Designers
Using GOLD to Tracking L2 Development
Date: 2012/11/15 Author: Jin Young Kim, Kevyn Collins-Thompson,
Presentation transcript:

Constructing and Evaluating Web Corpora: ukWaC Adriano Ferraresi University of Bologna Aston University Postgraduate Conference 22 May 2008

OUTLINE >Introduction >Building a very large Web-derived corpus >Evaluating ukWaC vs. the BNC >Issues and challenges in WaC

“Web-as-corpus”? >The Web: an immense, free, easily available source of textual materials >Traditional corpus resources: >Recent or very uncommon linguistic phenomena? >Specialized linguistic sub-domains? >Minority languages? >Use of the Web for linguistic purposes

The WaCky project >Exploiting the Web to build very large (~2 billion tokens) general-purpose corpora for various languages > >A largely language-independent pipeline: itWaC, deWaC >The last born: ukWaC

OUTLINE >Introduction >Building a very large Web-derived corpus >Seed selection >Crawling >Post-crawl cleaning and annotation >Evaluating ukWaC vs. the BNC >Issues and challenges in WaC

Seed selection >Aim: greatest possible variety of text contents and genres >Ueyama (2006): effects of seed selection >Sampling from traditional written sources => “Public-sphere” documents >Sampling from basic vocabulary lists => Blogs, forums of discussion, etc. >ukWaC: >Mid-frequency content words from the BNC >Vocabulary list for foreign learners >Spoken English (BNC)

Crawling >Using the Heritrix crawler >Excluding non-html data >A simple heuristic: limiting the crawl to the.uk Internet domain

Post-crawl cleaning and annotation >To reduce noise in the data (from 351 GB… to 12 GB!) >Filtering: >Only documents between 5KB and 200 KB >Code and boilerplate (Fletcher, 2004) removed >Language and pornography filtering >Near-duplicate detection and removal >Annotation: the TreeTagger

OUTLINE >Introduction >Building a very large Web-derived corpus >Evaluating ukWaC vs. the BNC >Methodology >Nouns most typical of ukWaC >Nouns most typical of the BNC >A note of caution >Issues and challenges in WaC

UkWaC vs. the BNC: A vocabulary-based comparison >Along the lines of Sharoff (2006): comparing noun wordlists across a traditional corpus (the BNC) and a Web corpus (ukWaC) >Log-likelihood association measure: the nouns “most typical” of either corpus >50 nouns with the highest log-likelihood score: >250 randomly selected concordances >Associated URL

The nouns most typical of ukWaC >Three main categories: >Web- and computer-related texts >“Public-sphere” documents: > Universities > The government and NGOs >Some examples:

The nouns most typical of the BNC >Three main categories: >Imaginative texts: e.g. eyes appears 74% of the times in ‘fiction/prose’ texts >Spoken language >Politics and economy >Some examples:

A note of caution >The methodology highlights several lexical differences btwn ukWaC and the BNC >However: a high log-likelihood score does not indicate absolute typicality >E.g. eyes, the 4th “most typical” noun of the BNC, is 15 times more frequent in ukWaC >What features make ukWaC and the BNC similar, instead of different?

OUTLINE >Introduction >Building a very large Web-derived corpus >Evaluating ukWaC vs. the BNC >Issues and challenges in WaC

Future work on ukWaC >ukWaC is being actively used in simulations of human learning, lexical semantics, language teaching… >However, we would like to improve on it: >Better data cleaning techniques (maybe adopting CLEANEVAL methodologies; Fairon et al., 2007) >Automatic classification into domains and genres (Santini et al., 2006) >… and to extend the analysis: >Usage-oriented task: discovery of collocational patterns for lexicography

THANK YOU! Adriano Ferraresi University of Bologna Aston University Postgraduate Conference 22 May 2008

REFERENCES Fairon, C., Naets, H., Kilgarriff, A. and de Schryver, G.-M. (eds.) (2007) Building and exploring Web corpora. Proceedings of the WAC3 Conference. Louvain: Presses Universitaires de Louvain. Fletcher, W.H. (2004). Making the Web more useful as a source for linguistic corpora. In Connor, U. and Upton, T. (eds.) Corpus Linguistics in North America Santini M., Power, R. and Evans, R. (2006) Implementing a characterization of genre for automatic genre identification of Web pages. In Proceeding of Joint Conference of the International Committee on Computational Linguistics and the Association for Computational Linguistics (COLING/ACL 2006). Sharoff, S. (2006) Creating general-purpose corpora using automated search engine queries. In Baroni, M. and Bernardini, S. (eds.) Wacky! Working papers on the Web as Corpus. Bologna: GEDIT Ueyama, M. (2006) Evaluation of Web-based Japanese reference corpora: effects of seed selection and time interval. In Baroni, M. and Bernardini, S. (eds.) WaCky! Working Papers on the Web as Corpus. Bologna: GEDIT Edizioni