Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

Similar presentations


Presentation on theme: "1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds."— Presentation transcript:

1 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds

2 2 Web as language resource  Replaceable or replacable?  check check

3 3  Very very large  Most languages  Most language types  Up-to-date  Free  Instant access

4 4 How to use the web?  Google or other commercial search engines (CSEs)  not

5 5 Using CSEs No setup costs Start querying today Methods  Hit counts  ‘snippets’ Metasearch engines, WebCorp  Find pages and download

6 6 Googleology  CSE hit counts for language modelling 36 queries to estimate freq(fulfil, obligation) to each of Google and Altavista (Keller & Lapata 2003) finding noun-noun relations “we issue exact phrase Google queries of type noun2 THAT * noun1” Nakov and Hearst 2006  Small community of researchers Corpora mailing list  Very interesting work  Intense interest in query syntax Creativity and person-years

7 7 The Trouble with Google  not enough instances max 1000  not enough queries max 1000 per day with API  not enough context 10-word snippet around search term  ridiculous sort order search term in titles and headings  untrustworthy hit counts  limited search syntax No regular expressions  linguistically dumb lemmatised  aime/aimer/aimes/aimons/aimez/aiment … not POS-tagged not parsed not

8 8  Appeal Zero-cost entry, just start googling  Reality High-quality work: high-cost methodology

9 9 Also:  No replicability  Methods, stats not published  At mercy of commercial corporation

10 10 Also:  No replicability  Methods, stats not published  At mercy of commercial corporation  Bad science

11 11 The 5-grams  A present from Google  All 1-, 2-, 3-, 4-, 5-grams with fr>=40 in a terabyte of English  A large dataset

12 12 Prognosis  Next 3 years Exciting new ideas Dazzlingly clever uses Drives progress in NLP

13 13 Prognosis  Next 3 years Exciting new ideas Dazzlingly clever uses  After 5+ years A chain round our necks  Cf Penn Treebank (others? Brickbats?)  Resource-led vs. ideas-led research

14 14 How to use the web?  Google or other commercial search engines (CSEs)  not

15 15 Language and the web  Web is mostly linguistic  Text on web << whole web (in GB) Not many TB of text Special hardware not needed  We are the experts

16 16 Community-building  ACL SIGWAC  WAC Kool Ynitiative (WaCKY) Mailing list Open source  WAC workshops WAC1, Birmingham 2005 WAC2, Trento (EACL), April 2006 WAC3, Louvain, Sept 15-16 2007

17 17 Proof of concept: DeWaC, ItWaC  1.5 B words each, German and Italian  Marco Baroni, Bologna (+ AK)

18 18 What is out there?  What text types? some are new: chatroom proportions  is it overwhelmed by porn? How much?  Hard question

19 19 What is out there  The web a social, cultural, political phenomenon new, little understood a legitimate object of science mostly language  we are well placed a lot of people will be interested  Let ’ s study the web source of language data apply our tools for web use (dictionaries, MT) use the web as infrastructure

20 20 How to do it: Components 1.web crawler 2.filters and classifiers  de-duplication 3.linguistic processing Lemmatise, pos-tag, parse 4.Database Indexing user interface

21 21 1.Crawling  How big is your hard disk?  When will your sysadmin ban you? DeWaC/ItWaC  Open source crawler: heritrix

22 22 1.1 Seeding the crawl  Mid-frequency words  Spread of text types Formal and informal, not just newspaper DeWaC  Words from newspaper corpus  Words from list with “kitchen” vocab  Use Google to get seeds for crawls

23 23 2. Filtering  non ‘running-text’ stripping  Function word filtering  Porn filtering  De-duplication

24 24 2.1 Filtering: Sentences  What is the text that we want? Lists? Links? Catalogues? …  For linguistics, NLP in sentences  Use function words

25 25 2.2 Filtering: CLEANEVAL  “Text cleaning” Lots to be done, not glamorous Many kinds of dirt needing many kinds of filter  Open Competition/shared task Who can produce the cleanest text?! Input: arbitrary web pages “gold standard”  paragraph-marked plain text  Prepared by people  Workshop Sept 2007. do join us!  http://cleaneval.sigwac.org.uk http://cleaneval.sigwac.org.uk

26 26 3. Linguistic processing  Lemmatise, POS-tag, parse Find leading NLP group for each language Be nice to them Use their tools

27 27 Database, interface  Solved problem (at least for 1.5 BW)  Sketch Engine Sketch Engine

28 28 “Despite all the disadvantages, it’s still so much bigger”

29 29 How much bigger?  Method Sample words  30  Mid-to-high freq  Not common words in other major lgs  Min 5 chars Compare freqs, Google vs ItWaC/DeWaC

30 30 Google results (Italian)  Arbitrariness Repeat identical searches 9/30: > 10% difference 6/30: > 100% difference  API: typically 1/18 th ‘manual’ figure  Language filter mista bomba clima  mostly non-Italian pages  use MAX and MIN of 6 lg-filtered results

31 31  Clima=  Computational logic in multi-agent systems  Centre for Legumes in Mediterranean Agriculture (5-char limit too short)

32 32 Ratios, Google:DeWaC WORDMAX MIN RAW CLEAN -------------------------------------------------------------- besuchte10.53.8 8184018228 stirn3.380.62 3232011137 gerufen7.143.72 6672027187 verringert6.863.46 5216015987 bislang24.411.623900090098 brach4.362.26 4452019824 -------------------------------------------------------------- MAX/MIN: max/min of 6 Google values (millions) RAW: DeWaC document frequency before filters, dedupe CLEAN: DeWaC document frequency after filters, dedupe

33 33 ItWaC:Google ratio, best estimate  For each of 30 words Calculate ratio, max:raw Calculate ratio, min:raw  Take mid-point and average: 1:33 or 3% Calculate raw:vert  Average = 4.4  half (for conservativeness/uncertainty) = 2.2 3% x 2.2 = 6.6%  ItWaC:Google = 6.6%

34 34 Italian web size  ItWaC = 1.67b words  Google indexes 1.67/.066 = 25 bn words sentential non-dupe Italian

35 35 German web size  Analysis as for Italian  DeWaC: 3% Google  DeWaC = 1.41b words  Google indexes 1.41/.03 = 44 bn words sentential non-dupe German

36 36 Effort  ItWac, DeWac Less than 6 person months Developing the method  (EnWaC: in progress)

37 37 Plan ACL adopts it (like ACL Anthology) (LDC?) Say: 3 core staff, 3 years Goals could be:  English: 2% G-scale (still biggest part)  6 other major languages: 30% G-scale  30 other languages: 10% G-scale Online for  Searching as in SkE  Specifying, downloading subcorpora for intensive NLP “corpora on demand” Don’t quote me

38 38 Logjams  Cleaning See CLEANEVAL  Text type “what kind of page is it?” Critical but under-researched WebDoc proposal  (with Serge Sharoff, Tony Hartley) (a different talk)

39 39 Moral  Google, CSEs are wonderful Start today but bad science  Not Good science, reliable counts We (the NLP community) have the skills With collective effort, mid-sized project Google-scale is achievable

40 40 Thank you  http://www.sketchengine.co.uk http://www.sketchengine.co.uk

41 41 Scale and speed, LSE  Commercial search engines banks of computers highly optimised code but this is for performance no downtime instant responses to millions of queries  This proposal crawling: once a year downtime: acceptable not so many users

42 42 …but it’s not representative  The web is not representative  but nor is anything else  Text type variation under-researched, lacking in theory  Atkins Clear Ostler 1993 on design brief for BNC; Biber 1988, Baayen 2001, Kilgarriff 2001  Text type is an issue across NLP Web: issue is acute because, as against BNC or WSJ, we simply don ’ t know what is there

43 43 Oxford English Corpus  Method as above  Whole domains chosen and harvested control over text type  1 billion words  Public launch April 2006  Loaded into Sketch Engine

44 44 Oxford English Corpus

45 45 Oxford English Corpus

46 46 Examples  DeWaC, ItWaC Baroni and Kilgarriff, EACL 2006  Serge Sharoff, Leeds Univ UK English Chinese Russian English French Spanish, all searchable online  Oxford English corpus

47 47 Options for academics  Give up Niche markets, obscure languages Leave the mainstream to the big guys  Work out how to work on that scale Web is free, data availability not a problem


Download ppt "1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds."

Similar presentations


Ads by Google