1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

Slides:



Advertisements
Similar presentations
Terminology-finding in the Sketch Engine Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vit Suchomel Lexical Computing Ltd., Brighton,
Advertisements

The Cambridge Learner Corpus, English Profile, the Sketch Engine and the Kelly Project Adam Kilgarriff Lexical Computing Ltd
Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd.
WebBootCaT usage Adam Kilgarriff Lexical Computing Ltd.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
How dominant is the commonest sense of a word? Adam Kilgarriff Lexicography MasterClass Univ of Brighton.
PHYS 2020 Pseudocode. Real Programmers Program in Pencil!  You can save a lot of time if you approach programming in a methodical way.  1) Write a clear.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
1 Corpora for all Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Evaluating the Waspbench A Lexicography Tool Incorporating Word Sense Disambiguation Rob Koeling, Adam Kilgarriff, David Tugwell, Roger Evans ITRI, University.
1 Corpora for the coming decade Adam Kilgarriff. Dublin June 2009 Kilgarriff: Corpora for the coming decade2 How should they be different?  Bigger 
Augmenting online dictionary entries with corpus data for Search Engine Optimisation Holger Hvelplund, 1 Adam Kilgarriff, 2 Vincent Lannoy, 1 Patrick White.
Constructing and Evaluating Web Corpora: ukWaC Adriano Ferraresi University of Bologna Aston University Postgraduate Conference.
Talking about your homework News story? –What made you choose…? One of your words? –What made you choose…? (Give your vocabulary books to another student.
1 Corpora for the coming decade Adam Kilgarriff Lexical Computing Ltd.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd.
What's on the Web? The Web as a Linguistic Corpus Adam Kilgarriff Lexical Computing Ltd University of Leeds.
Searching and Accessing the Cultural Heritage in a Digital World Yoram Elkaim International Conference on Intellectual Property & Cultural Heritage in.
Simple Maths for Keywords Adam Kilgarriff Lexical Computing Ltd.
Labels: automation Adam Kilgarriff. Auckland 2012Kilgarriff / Labels: automation2 Which words are:  Most distinctive of business English?  Most often.
Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.
Using Corpora for Teaching Chinese Dr. Adam Kilgarriff Lexical Computing Ltd Leeds University UK.
First International Sketch Grammar Workshop Ljubljana 3-4 February 2010.
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.
GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing.
1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.
/425 Declarative Methods - J. Eisner /425 Declarative Methods Prof. Jason Eisner MWF 3-4pm (sometimes 3-4:15)
FishBase Summary Page about Salmo salar in the standard Language of FishBase (English) ENBI-WP-11: Multilingual Access to European Biodiversity Sites through.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
GoogleDictionary Paul Nepywoda Alla Rozovskaya. Goal Develop a tool for English that, given a word, will illustrate its usage.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK.
Comparable Corpora BootCaT (CCBC) (or: In Praise of BootCaT) Adam Kilgarriff, Jan Pomikalek, Avinesh PVS Lexical Computing Ltd. Work Supported by EU FP7.
1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds.
1 Evaluating word sketches and corpora Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Corpus Evaluation Adam Kilgarriff Lexical Computing Ltd Corpus evaluationPortsmouth Nov
Using Corpora in Language Research Adam Kilgarriff Lexical Computing Ltd Universities of Leeds January 2013Adam Kilgarriff.
Malta, May 2010Kilgarriff: Corpora by Web Services1 Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities.
Course grading Project: 75% Broken into several incremental deliverables Paper appraisal/evaluation/project tool evaluation in earlier May: 25%
Terminology-finding in the Sketch Engine Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vit Suchomel Lexical Computing Ltd., Brighton,
CL 2005, Birmingham Web as Corpus Workshop Intro: Adam Kilgarriff 1 Web as Corpus Workshop Co-chairs: Marco Baroni Adam Kilgarriff Sebastian Hoffman.
The Sketch Engine as Infrastructure for Large Scale Text Collections for Humanities Research Adam Kilgarriff Lexical Computing Ltd. & Univ of Leeds, UK.
Search & Searchability. Presentation from David Hawking – CSIRO Ineffectual corporate search tools can be the biggest drag on employee productivity. Knowledge.
L JSTOR Tools for Linguists 22nd June 2009 Michael Krot Clare Llewellyn Matt O’Donnell.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Lawrence Snyder University of Washington, Seattle © Lawrence Snyder 2004.
Steve Cassidy Computing at MacquarieNo 1 Searching The Web Steve Cassidy Centre for Language Technology Department of Computing Macquarie University.
Steps to an E-business  Developing Concept and Selling Points  Domain name  Website Development  Sales and Marketing.
GDEX: Automatically finding good dictionary examples in a corpus Auckland 2012Kilgarriff: GDEX1.
GDEX: Automatically finding good dictionary examples in a corpus Kivik 2013Kilgarriff: GDEX1.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
1 Chapter 5 (3 rd ed) Your library is an excellent resource tool. Your library is an excellent resource tool.
Learning Usage of English KWICly with WebLEAP/DSR
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
Map Reduce.
Evaluating word sketches and corpora
Using Corpora for Language Research
Tomaž Erjavec1, Adam Kilgarriff2, Irena Srdanović Erjavec3
Web archive data and researchers’ needs: how might we meet them?
Statistical n-gram David ling.
CS246: Search-Engine Scale
Presentation transcript:

1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds

2 Web as language resource  Replaceable or replacable?  check check

3  Very very large  Most languages  Most language types  Up-to-date  Free  Instant access

4 How to use the web?  Google or other commercial search engines (CSEs)  not

5 Using CSEs No setup costs Start querying today Methods  Hit counts  ‘snippets’ Metasearch engines, WebCorp  Find pages and download

6 Googleology  CSE hit counts for language modelling 36 queries to estimate freq(fulfil, obligation) to each of Google and Altavista (Keller & Lapata 2003) finding noun-noun relations “we issue exact phrase Google queries of type noun2 THAT * noun1” Nakov and Hearst 2006  Small community of researchers Corpora mailing list  Very interesting work  Intense interest in query syntax Creativity and person-years

7 The Trouble with Google  not enough instances max 1000  not enough queries max 1000 per day with API  not enough context 10-word snippet around search term  ridiculous sort order search term in titles and headings  untrustworthy hit counts  limited search syntax No regular expressions  linguistically dumb lemmatised  aime/aimer/aimes/aimons/aimez/aiment … not POS-tagged not parsed not

8  Appeal Zero-cost entry, just start googling  Reality High-quality work: high-cost methodology

9 Also:  No replicability  Methods, stats not published  At mercy of commercial corporation

10 Also:  No replicability  Methods, stats not published  At mercy of commercial corporation  Bad science

11 The 5-grams  A present from Google  All 1-, 2-, 3-, 4-, 5-grams with fr>=40 in a terabyte of English  A large dataset

12 Prognosis  Next 3 years Exciting new ideas Dazzlingly clever uses Drives progress in NLP

13 Prognosis  Next 3 years Exciting new ideas Dazzlingly clever uses  After 5+ years A chain round our necks  Cf Penn Treebank (others? Brickbats?)  Resource-led vs. ideas-led research

14 How to use the web?  Google or other commercial search engines (CSEs)  not

15 Language and the web  Web is mostly linguistic  Text on web << whole web (in GB) Not many TB of text Special hardware not needed  We are the experts

16 Community-building  ACL SIGWAC  WAC Kool Ynitiative (WaCKY) Mailing list Open source  WAC workshops WAC1, Birmingham 2005 WAC2, Trento (EACL), April 2006 WAC3, Louvain, Sept

17 Proof of concept: DeWaC, ItWaC  1.5 B words each, German and Italian  Marco Baroni, Bologna (+ AK)

18 What is out there?  What text types? some are new: chatroom proportions  is it overwhelmed by porn? How much?  Hard question

19 What is out there  The web a social, cultural, political phenomenon new, little understood a legitimate object of science mostly language  we are well placed a lot of people will be interested  Let ’ s study the web source of language data apply our tools for web use (dictionaries, MT) use the web as infrastructure

20 How to do it: Components 1.web crawler 2.filters and classifiers  de-duplication 3.linguistic processing Lemmatise, pos-tag, parse 4.Database Indexing user interface

21 1.Crawling  How big is your hard disk?  When will your sysadmin ban you? DeWaC/ItWaC  Open source crawler: heritrix

Seeding the crawl  Mid-frequency words  Spread of text types Formal and informal, not just newspaper DeWaC  Words from newspaper corpus  Words from list with “kitchen” vocab  Use Google to get seeds for crawls

23 2. Filtering  non ‘running-text’ stripping  Function word filtering  Porn filtering  De-duplication

Filtering: Sentences  What is the text that we want? Lists? Links? Catalogues? …  For linguistics, NLP in sentences  Use function words

Filtering: CLEANEVAL  “Text cleaning” Lots to be done, not glamorous Many kinds of dirt needing many kinds of filter  Open Competition/shared task Who can produce the cleanest text?! Input: arbitrary web pages “gold standard”  paragraph-marked plain text  Prepared by people  Workshop Sept do join us! 

26 3. Linguistic processing  Lemmatise, POS-tag, parse Find leading NLP group for each language Be nice to them Use their tools

27 Database, interface  Solved problem (at least for 1.5 BW)  Sketch Engine Sketch Engine

28 “Despite all the disadvantages, it’s still so much bigger”

29 How much bigger?  Method Sample words  30  Mid-to-high freq  Not common words in other major lgs  Min 5 chars Compare freqs, Google vs ItWaC/DeWaC

30 Google results (Italian)  Arbitrariness Repeat identical searches 9/30: > 10% difference 6/30: > 100% difference  API: typically 1/18 th ‘manual’ figure  Language filter mista bomba clima  mostly non-Italian pages  use MAX and MIN of 6 lg-filtered results

31  Clima=  Computational logic in multi-agent systems  Centre for Legumes in Mediterranean Agriculture (5-char limit too short)

32 Ratios, Google:DeWaC WORDMAX MIN RAW CLEAN besuchte stirn gerufen verringert bislang brach MAX/MIN: max/min of 6 Google values (millions) RAW: DeWaC document frequency before filters, dedupe CLEAN: DeWaC document frequency after filters, dedupe

33 ItWaC:Google ratio, best estimate  For each of 30 words Calculate ratio, max:raw Calculate ratio, min:raw  Take mid-point and average: 1:33 or 3% Calculate raw:vert  Average = 4.4  half (for conservativeness/uncertainty) = 2.2 3% x 2.2 = 6.6%  ItWaC:Google = 6.6%

34 Italian web size  ItWaC = 1.67b words  Google indexes 1.67/.066 = 25 bn words sentential non-dupe Italian

35 German web size  Analysis as for Italian  DeWaC: 3% Google  DeWaC = 1.41b words  Google indexes 1.41/.03 = 44 bn words sentential non-dupe German

36 Effort  ItWac, DeWac Less than 6 person months Developing the method  (EnWaC: in progress)

37 Plan ACL adopts it (like ACL Anthology) (LDC?) Say: 3 core staff, 3 years Goals could be:  English: 2% G-scale (still biggest part)  6 other major languages: 30% G-scale  30 other languages: 10% G-scale Online for  Searching as in SkE  Specifying, downloading subcorpora for intensive NLP “corpora on demand” Don’t quote me

38 Logjams  Cleaning See CLEANEVAL  Text type “what kind of page is it?” Critical but under-researched WebDoc proposal  (with Serge Sharoff, Tony Hartley) (a different talk)

39 Moral  Google, CSEs are wonderful Start today but bad science  Not Good science, reliable counts We (the NLP community) have the skills With collective effort, mid-sized project Google-scale is achievable

40 Thank you 

41 Scale and speed, LSE  Commercial search engines banks of computers highly optimised code but this is for performance no downtime instant responses to millions of queries  This proposal crawling: once a year downtime: acceptable not so many users

42 …but it’s not representative  The web is not representative  but nor is anything else  Text type variation under-researched, lacking in theory  Atkins Clear Ostler 1993 on design brief for BNC; Biber 1988, Baayen 2001, Kilgarriff 2001  Text type is an issue across NLP Web: issue is acute because, as against BNC or WSJ, we simply don ’ t know what is there

43 Oxford English Corpus  Method as above  Whole domains chosen and harvested control over text type  1 billion words  Public launch April 2006  Loaded into Sketch Engine

44 Oxford English Corpus

45 Oxford English Corpus

46 Examples  DeWaC, ItWaC Baroni and Kilgarriff, EACL 2006  Serge Sharoff, Leeds Univ UK English Chinese Russian English French Spanish, all searchable online  Oxford English corpus

47 Options for academics  Give up Niche markets, obscure languages Leave the mainstream to the big guys  Work out how to work on that scale Web is free, data availability not a problem