Using Corpora for Language Research

Using Corpora for Language Research
Lecture 6 Using Corpora for Language Research COGS 523-Lecture 6 Web as Corpus COGS Bilge Say COGS 523

Lecture 6 Related Readings Bernardini, S., M. Baroni and S. Evert. (2006) A WaCky Introduction. in Working Papers on the Web as Corpus. M. Baroni, S. Bernardini, A. Ferraresi and E. Zanchetta. To appear. The WaCky Wide Web: A Collection of Very Large Linguistically Processed Web-Crawled Corpora. Language Resources and Evaluation Journal (At the above address) Sharoff, S. (2006) Open-source Corpora. International Journal of Corpus Linguistics. 11(4), pp Kilgariff and Grefenstette (2003).Introduction to Special Issue on the Web as Corpus. Computational Linguistics 29(3), Kilgariff (2007) Googleology is Bad Science, Computational Linguistics, 33(1). Web As Corpus Workshops. See proceedings of the 2008 one in the link below under conferences: COGS Bilge Say COGS 523

Web as Corpus Attractiveness: Free, immense and easily available
Issues: Representativeness and Balance Legal Issues of Web Corpora Suitability of Linguistic Queries with Search Engines COGS Bilge Say

Web as Corpus-many senses?
The Web as a corpus surrogate The Web as a corpus shop – create your own corpus from web The Web as corpus proper – web as representative of web English The Mega-corpus/Mini-web – a new object of suitable for linguistic study-combining above three approaches COGS Bilge Say

Advantages Size: estimate for the size of text
For Google: 3-4 Tera words -T= 1012 (for 2003) 100 million words BNC good for 10,000 types of core English but what about the rest of types that occur 50 times or less? COGS Bilge Say

Language Distribution
Estimate by function words which are stable over many types of text as predictors of corpus size- estimate in next slide Xu (2000)-71% in English; 7% Japanese, 5% German, 2% French, Chinese ... Proportion of non English text to English text is growing Errors are more than traditionally published text but significantly less than others. COGS Bilge Say

Estimate of Web size in words, as indexed by AltaVista, for various languages (Table 3 of Kilgarriff & Grefenstette, 2003) COGS Bilge Say

Frequencies of English phrases in the BNC and on AltaVista in 1998 and 2001, and on AlltheWeb in The counts for BNC and AltaVIsta are for individual occurrences of the phrase. The counts for AlltheWeb are page counts (the phrase may appear more than once on any page) (Table 1 of Kilgarriff & Grefenstette, 2003) COGS Bilge Say

AltaVista frequencies for candidate translations of groupe de travail (Table 4 of Kilgarriff & Grefenstette, 2003) COGS Bilge Say

Natural Language Processing View
Probabilistic models of language based on very large quantities of data (even if noisy) are better than estimates based on small and clean sets, using sophisticated smoothing techniques (Kilgariff and Grenfenstette) NLP applications using Web as Corpus Word Sense Disambiguation Ontology Population Statistical Machine Translation More pragmatic corpus definitions: Is corpus x good for task y? Less emphasis on construction and design principles COGS Bilge Say

Legal Issues of Web Corpora
Really different from non-Web corpora? You can develop a Web corpus without copying it.. GNU Free Documentation Licence (for distributing) Caches and indices by search engines are formidable anyhow.. COGS Bilge Say

Issues Representativeness
Try to understand Web balance, not aim for representativeness Automatic characterization of text types from web COGS Bilge Say

Querying for linguistic analysis with Search Engines
Not enough context for each instance Not enough instances Unreliable frequency statistics (e.g. Hit counts per page instead of token statistics; titles or headings promote ranking) Automated querying is limited Limited search syntax and annotation (no lemmas or part of speech tags) Some exceptions-search engines that treat web as a corpus environment: COGS Bilge Say

Building your own corpora from web
Interoperable tools to build your own corpus from web and use it with a query engine... BootCat, WebBootCat and SketchEngine (for lexicographic purposes mostly) Free only for trial, individual academic licenses 50 euros per year... COGS Bilge Say

Creating a Corpus from Web
Crawling: Selecting “seed” URLs Harder for the “general” corpus case Representative of what? (what if sample web profile for a language is 90% pornography and dating sites, 19% Linux how-tos, 1% others) Retrieve pages by crawling Issues: Efficiency, duplicates, politeness, traps, file handling COGS Bilge Say

Cleaning Up Removing HTML tags
Boilerplate stripping: You do not want “Click here” to be the most frequent phrase in your corpus Language/encoding detection Near-duplicate discovery – same tutorial with different headers Specialized community effort: CLEANEVAL SIGWAC- Special Interest Group of Web as Corpus of the ACL COGS Bilge Say

Annotation Header information (text types): Newer semiautomatic classification schemes are being developed (Mehler and Glein) Tokenization, POS annotation, Lemmatisation Pecularities of web language: neologisms, acronyms, smileys, non-standard spelling COGS Bilge Say

Query Indexing and searching Expressiveness Ease of Use Performance
Scalability COGS Bilge Say

Sharoff’s Internet Corpora
Affordable alternatives to BNC-like efforts ? English, Chinese, Romanian, Russian, Ukranian, Turkish (under way) Composition assessment (by text typology comparison to resources such as BNC, Russian Reference Corpora or by comparing frequency lists) Seed generation: most frequent types from reference corpora COGS Bilge Say

Triangulation for Internet Corpora (Fig. 1 of Sharoff, 2006)
COGS Bilge Say

The balance of text types in various corpora (Table 1 of Sharoff, 2006)
COGS Bilge Say

Words less/more frequent in news corpora (Part of the Table 2 of Sharoff, 2006)
COGS Bilge Say

Words less/more frequent in internet corpora (Part of the Table 3 of Sharoff, 2006)
COGS Bilge Say

The size of Internet corpora (Table 4 of Sharoff, 2006)
COGS Bilge Say

Applications from 4th Web as Corpus (WAC) Workshop (2008)
GReG: Reranking snippets returned by Google’s search engine in the best 10 links by introducing linguistic information (tagging, syntactic constituency, partial logical form) GLB (Google for the Linguist on a Budget): An open source and free system for robust web crawling –querying multiple dimensions – load balancing on many CPUs esp for testing language models for NLP on web based corpora. COGS Bilge Say

Applications from 4th Web as Corpus (WAC) Workshop (2008)
Victor: A web page cleaning tool introduced in CLEANEVAL 2007 esp with a linguistic aim, using machine learning, its own annotation toolset and evaluation metrics GlossaNet2: free online concordancer service that allows users to search dynamic web corpora via RSS feeds – uses features of Unitex corpus tool COGS Bilge Say

WaCky corpora ukWaC, deWaC, itWaC (details in Baroni et al.)
ukWaC: large British English web derived corpus of 2 billion tokens – freely available – part-of-speech tagged and lemmatized- wide range of genres – 30 GB with annotation- comparison w BNC vocabulary-wise is available COGS Bilge Say

Lecture 6 Lecture 7 and 8 Lecture 7: April 14th, your tool evaluation presentations and reports! Lecture 8: Statistics: McEnery and Wilson (2001) Ch 3; McEnery et al.(2006) Unit A6. Biber et al. Methodology Boxes. COGS Bilge Say COGS 523

Using Corpora for Language Research

Similar presentations

Presentation on theme: "Using Corpora for Language Research"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Using Corpora for Language Research

Similar presentations

Presentation on theme: "Using Corpora for Language Research"— Presentation transcript:

Similar presentations

About project

Feedback