What's on the Web? The Web as a Linguistic Corpus Adam Kilgarriff Lexical Computing Ltd University of Leeds.

Slides:



Advertisements
Similar presentations
U.S. Government Language Requirements U.S. Government Language Requirements 7 September 2000 Everette Jordan Department of Defense
Advertisements

Terminology-finding in the Sketch Engine Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vit Suchomel Lexical Computing Ltd., Brighton,
Yemelia International Language Services Translations Translations Translations Interpreting InterpretingInterpreting Multi-lingual IT Presentations Multi-lingual.
Adaptxt® Enhanced Keyboards for Smartphones and Tablets: CUSTOM-MADE FOR OEM SUCCESS KeyPoint Technologies February 25, 2013.
WebBootCaT usage Adam Kilgarriff Lexical Computing Ltd.
Ideal Lingua Translations Ideal Lingua Translations is a leading Translation Services Provider which offers:  Highest Quality Language Solutions 
Curricular exams Irish, English, Ancient Greek, Arabic, French, German, Hebrew Studies, Italian, Japanese, Spanish and Russian.
 They speak German  8.47 million of people live there.
J. Kunzmann, K. Choukri, E. Janke, A. Kießling, K. Knill, L. Lamel, T. Schultz, and S. Yamamoto Automatic Speech Recognition and Understanding ASRU, December.
Using Corpora for Teaching Chinese Dr. Adam Kilgarriff Lexical Computing Ltd Leeds University UK.
Clients for XProtect VMS What’s new presentation
< Translator Team > 25+ Languages, …and growing!.
English Language Proficiency 2011 Census Analysis Tristan Browne.
Measuring Monolinguality Chris Biemann NLP Department, University of Leipzig LREC-06 Workshop on Quality Assurance and Quality Measurement for Language.
Multiculturalism in Canada Julia Sadokhina Julia Sadokhina Irina Novikava Irina Novikava.
1 Linguistic Resources needed by Nuance Jan Odijk Cocosda/Write Workshop.
Linkkservicesworld LTD. SERVICES Translation English / Spanish / English Interpretation/ Full Professional Medical Support / Editing / Proofreading.
Talk, Translate, and Voice By: Jill Gruttadauro, Amanda Swetish, Porter Waung.
The Influence of First Language on Reading and Spelling in English Linda Siegel University of British Columbia Vancouver, CANADA
Database publishers RBDigital Zinio Indieflix Recorded Books McGraw-Hill Access Engineering Access Medicine McGraw-Hill E-Books Library Cengage Gale Gale.
Lund Online E-Books & E-Reference Malin Asplund & Monique Schutterop Higher Education & Library Reference.
UNLIMITED. SIMULTANEOUS. NO CHECK-OUT. eREFERENCE.
Tools for Historical corpus research, and a corpus of Latin Barbara McGillivray Oxford University Press Adam Kilgarriff Lexical Computing Ltd.
Advanced Google Searching June Liebert Director and Assistant Professor The John Marshall Law School “Do no harm” – the Google mantra.
Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.
Using Corpora for Teaching Chinese Dr. Adam Kilgarriff Lexical Computing Ltd Leeds University UK.
Survey on university students choosing a language course as an extra-curricular activity DIUS & AULC Department for Innovation Universities and Skills.
Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd.
1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass.
First International Sketch Grammar Workshop Ljubljana 3-4 February 2010.
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
Richard Baraniuk International Experiences with Open Educational Resources.
Terminology, translation, and PRESEMT; word frequency lists and KELLY 1 Adam Kilgarriff Lexical Computing Ltd SKEW-2, March 2011Kilgarriff: PRESEMT and.
Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.
Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.
1 Translate and Translator Toolkit Universally accessible information through translation Jeff Chin Product Manager Michael Galvez Product Manager.
©Subject Centre for Languages, Linguistics and Area Studieslanguage unlimited!
Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.
Copyright © IBM Corp., The Eclipse™ Babel Project Translation Server Kit Lo IBM™ Corporation.
1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass.
Comparable Corpora BootCaT (CCBC) (or: In Praise of BootCaT) Adam Kilgarriff, Jan Pomikalek, Avinesh PVS Lexical Computing Ltd. Work Supported by EU FP7.
Terminology-finding in the Sketch Engine Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vit Suchomel Lexical Computing Ltd., Brighton,
Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK.
Why Study Languages Produced by the Subject Centre for Languages, Linguistics and Area Studies …When Everyone Speaks English?
What can Parents Do to Help Their Children Learn?.
Luis Avila Tics. We have to recognize all the operating systems we have nowadays in the different smartphones Blackberry: Bb OS Iphone: iOS Nokia: symbian.
Curricular language exams Irish, English, Ancient Greek, Arabic, French, German, Hebrew Studies, Italian, Japanese, Spanish and Russian.
LanguagesLanguages. What is language? A human system of communication that uses arbitrary signals such as voice sounds, gestures, or written symbols.
F ACTORS TO G OOGLE A D S ENSE A PPROVAL By: Aarif Habeeb.
The next 10 years of web globalization John Yunker Byte Level Research.
Tel: Fax: P.O. Box: 22392, Dubai - UAE
EUROPEAN DAY OF LANGUAGES. The European Year of Languages 2001 was organised by the Council of Europe and the European Union. Its activities celebrated.
Languages of Europe Romance, Germanic, and Slavic.
Advanced Directives: What to Assess with Seniors
Mitubishi Chemical Holdings Group
RECENT TRENDS IN SMT By M.Balamurugan, Phd Research Scholar,
Measuring Monolinguality
Sentiment Analysis: The Emotionality of Discourse .
Anton Boyko Microsoft azure mvp, mcp Microsoft Devops TE
Oracle Supplier Management Solution Product Availability
Mitubishi Chemical Holdings Group

A Latin corpus for Sketch Engine
Definition of Health WHO approved translation
Mitubishi Chemical Holdings Group
Part of Speech Tagging with Neural Architecture Search
COUNTRIES NATIONALITIES LANGUAGES.
NATIONALITIES  « What’s your nationality? » « I’m French »
Sales Presenter Available now Standard v Slim

Presentation transcript:

What's on the Web? The Web as a Linguistic Corpus Adam Kilgarriff Lexical Computing Ltd University of Leeds

BL, Jan 2011Kilgarriff: Web as Corpus2 You can’t help noticing Replaceable or replacable? –

What is a corpus? A collection of texts Call it a corpus when – Used for literary or linguistic research BL, Jan 20113Kilgarriff: Web as Corpus

History BL, Jan 20114Kilgarriff: Web as Corpus

BL, Jan 2011Kilgarriff: Web as CorpusSlide 5 Corpora since the 1960s Size (in words) 1960s 1970s 1980s 1990s 2000s Brown/LOB COBUILD BNC OEC

Pioneers Dictionary publishers – Most words rare: must be vast Other interested parties – Mostly for word frequency lists: Educationalists Psychologists Since 1990s – Language technology BL, Jan 20116Kilgarriff: Web as Corpus

Corpus types Monolingual Parallel – Bi-texts: a text and its translation – Statistical machine translation Google translate Comparable – More than one language, same kind of text for each BL, Jan 20117Kilgarriff: Web as Corpus

Parameters Language Size – A thousand to a trillion words 1,000 to 1,000,000,000,000 – words, sentences, GB, hours Text type – Writing, speech – Newspaper, blog, chat, academic, …, mixed – Sport, hairdressing, DNA of the nematode worm BL, Jan 20118Kilgarriff: Web as Corpus

The Web Very very large – 2006 estimates for duplicate free, linguistic, Google-indexed web German: 44 billion words Italian: 25 billion words English: trillion words Most languages Most language types Up-to-date Free Instant access BL, Jan 20119Kilgarriff: Web as Corpus

BL, Jan 2011Kilgarriff: Web as Corpus10 What is out there? What text types are there on the web? – some are new: chatroom – proportions is it overwhelmed by porn? How much? Hard question

BL, Jan 2011Kilgarriff: Web as Corpus11 Comparing frequency lists Web1T – Present from Google – All 1-, 2-, 3-, 4, 5-grams with f>40 in one trillion words of English Compare with British National Corpus – 100m words – Early 1990s: pre-web Keywords of each vs. other – Highest contrast of frequency

BL, Jan 2011Kilgarriff: Web as Corpus12 Web-high (155 terms)‏ 61 web and computing – config browser spyware url www forum 38 porn 22 US English (incl Spanish influence –los)‏ 18 business/products common on web – poker viagra lingerie ringtone dvd casino rental collectible tiffany – NB: BNC is old 4 legal – trademarks pursuant accordance herein

BL, Jan 2011Kilgarriff: Web as Corpus13 BNC-high Exclude British English, transcription/tokenisation anomalies – herself stood seemed she looked yesterday sat considerable had council felt perhaps walked round her towards claimed knew obviously remained himself he him

BL, Jan 2011Kilgarriff: Web as Corpus14 Observations Pronouns and past tense verbs – Fiction Masc vs fem Yesterday – Probably daily newspapers Constancy of ratios: – He/him/himself – She/her/herself

Corpus Factory Most languages: no large corpora Goal – 100 biggest languages, 100m-word corpora BootCat method – Repeat 50,000 times Seeds words Send to a search engine – In random pairs, threes or fours Collect the pages the search engine finds – Seed words from wikipedia BL, Jan Kilgarriff: Web as Corpus

42 Languages Arabic Bengali Bulgarian Chinese Croatian Czech Danish Dutch English Estonian Finnish French German Greek Gujarati Hebrew Hindi Indonesian Irish Italian Japanese Korean Malay Malayalam Maltese Norwegian Persian Polish Portuguese Romanian Russian Serbian Slovene Spanish Swahili Swedish Tamil Telugu Thai Turkish Vietnamese Welsh BL, Jan Kilgarriff: Web as Corpus

Corpus quality Character encoding ‘boilerplate’ – Navigation bars, adverts, legal disclaimers, … Duplicates Language – Contamination by English Concerns shared by by Google, Microsoft, IBM etc LCL use (and develop) leading methods BL, Jan Kilgarriff: Web as Corpus

Levels of processing Lemmas and word forms – Invade vs invade invaded invades invaded Part-of-speech tagging – Also word-class tagging brush (verb) (“she brushed him aside”) vs. brush (noun) (“Give me the brush.”) can (verb) (“he can do it”) vs. can (noun) (“the beer can”) Some languages, not others BL, Jan Kilgarriff: Web as Corpus

Demo BL, Jan Kilgarriff: Web as Corpus