Download presentation
Presentation is loading. Please wait.
Published byLucy Johnston Modified over 9 years ago
1
1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds
2
2 Web as language resource Replaceable or replacable? check check
3
3 Very very large Most languages Most language types Up-to-date Free Instant access
4
4 How to use the web? Google or other commercial search engines (CSEs) not
5
5 Using CSEs No setup costs Start querying today Methods Hit counts ‘snippets’ Metasearch engines, WebCorp Find pages and download
6
6 Googleology CSE hit counts for language modelling 36 queries to estimate freq(fulfil, obligation) to each of Google and Altavista (Keller & Lapata 2003) finding noun-noun relations “we issue exact phrase Google queries of type noun2 THAT * noun1” Nakov and Hearst 2006 Small community of researchers Corpora mailing list Very interesting work Intense interest in query syntax Creativity and person-years
7
7 The Trouble with Google not enough instances max 1000 not enough queries max 1000 per day with API not enough context 10-word snippet around search term ridiculous sort order search term in titles and headings untrustworthy hit counts limited search syntax No regular expressions linguistically dumb lemmatised aime/aimer/aimes/aimons/aimez/aiment … not POS-tagged not parsed not
8
8 Appeal Zero-cost entry, just start googling Reality High-quality work: high-cost methodology
9
9 Also: No replicability Methods, stats not published At mercy of commercial corporation
10
10 Also: No replicability Methods, stats not published At mercy of commercial corporation Bad science
11
11 The 5-grams A present from Google All 1-, 2-, 3-, 4-, 5-grams with fr>=40 in a terabyte of English A large dataset
12
12 Prognosis Next 3 years Exciting new ideas Dazzlingly clever uses Drives progress in NLP
13
13 Prognosis Next 3 years Exciting new ideas Dazzlingly clever uses After 5+ years A chain round our necks Cf Penn Treebank (others? Brickbats?) Resource-led vs. ideas-led research
14
14 How to use the web? Google or other commercial search engines (CSEs) not
15
15 Language and the web Web is mostly linguistic Text on web << whole web (in GB) Not many TB of text Special hardware not needed We are the experts
16
16 Community-building ACL SIGWAC WAC Kool Ynitiative (WaCKY) Mailing list Open source WAC workshops WAC1, Birmingham 2005 WAC2, Trento (EACL), April 2006 WAC3, Louvain, Sept 15-16 2007
17
17 Proof of concept: DeWaC, ItWaC 1.5 B words each, German and Italian Marco Baroni, Bologna (+ AK)
18
18 What is out there? What text types? some are new: chatroom proportions is it overwhelmed by porn? How much? Hard question
19
19 What is out there The web a social, cultural, political phenomenon new, little understood a legitimate object of science mostly language we are well placed a lot of people will be interested Let ’ s study the web source of language data apply our tools for web use (dictionaries, MT) use the web as infrastructure
20
20 How to do it: Components 1.web crawler 2.filters and classifiers de-duplication 3.linguistic processing Lemmatise, pos-tag, parse 4.Database Indexing user interface
21
21 1.Crawling How big is your hard disk? When will your sysadmin ban you? DeWaC/ItWaC Open source crawler: heritrix
22
22 1.1 Seeding the crawl Mid-frequency words Spread of text types Formal and informal, not just newspaper DeWaC Words from newspaper corpus Words from list with “kitchen” vocab Use Google to get seeds for crawls
23
23 2. Filtering non ‘running-text’ stripping Function word filtering Porn filtering De-duplication
24
24 2.1 Filtering: Sentences What is the text that we want? Lists? Links? Catalogues? … For linguistics, NLP in sentences Use function words
25
25 2.2 Filtering: CLEANEVAL “Text cleaning” Lots to be done, not glamorous Many kinds of dirt needing many kinds of filter Open Competition/shared task Who can produce the cleanest text?! Input: arbitrary web pages “gold standard” paragraph-marked plain text Prepared by people Workshop Sept 2007. do join us! http://cleaneval.sigwac.org.uk http://cleaneval.sigwac.org.uk
26
26 3. Linguistic processing Lemmatise, POS-tag, parse Find leading NLP group for each language Be nice to them Use their tools
27
27 Database, interface Solved problem (at least for 1.5 BW) Sketch Engine Sketch Engine
28
28 “Despite all the disadvantages, it’s still so much bigger”
29
29 How much bigger? Method Sample words 30 Mid-to-high freq Not common words in other major lgs Min 5 chars Compare freqs, Google vs ItWaC/DeWaC
30
30 Google results (Italian) Arbitrariness Repeat identical searches 9/30: > 10% difference 6/30: > 100% difference API: typically 1/18 th ‘manual’ figure Language filter mista bomba clima mostly non-Italian pages use MAX and MIN of 6 lg-filtered results
31
31 Clima= Computational logic in multi-agent systems Centre for Legumes in Mediterranean Agriculture (5-char limit too short)
32
32 Ratios, Google:DeWaC WORDMAX MIN RAW CLEAN -------------------------------------------------------------- besuchte10.53.8 8184018228 stirn3.380.62 3232011137 gerufen7.143.72 6672027187 verringert6.863.46 5216015987 bislang24.411.623900090098 brach4.362.26 4452019824 -------------------------------------------------------------- MAX/MIN: max/min of 6 Google values (millions) RAW: DeWaC document frequency before filters, dedupe CLEAN: DeWaC document frequency after filters, dedupe
33
33 ItWaC:Google ratio, best estimate For each of 30 words Calculate ratio, max:raw Calculate ratio, min:raw Take mid-point and average: 1:33 or 3% Calculate raw:vert Average = 4.4 half (for conservativeness/uncertainty) = 2.2 3% x 2.2 = 6.6% ItWaC:Google = 6.6%
34
34 Italian web size ItWaC = 1.67b words Google indexes 1.67/.066 = 25 bn words sentential non-dupe Italian
35
35 German web size Analysis as for Italian DeWaC: 3% Google DeWaC = 1.41b words Google indexes 1.41/.03 = 44 bn words sentential non-dupe German
36
36 Effort ItWac, DeWac Less than 6 person months Developing the method (EnWaC: in progress)
37
37 Plan ACL adopts it (like ACL Anthology) (LDC?) Say: 3 core staff, 3 years Goals could be: English: 2% G-scale (still biggest part) 6 other major languages: 30% G-scale 30 other languages: 10% G-scale Online for Searching as in SkE Specifying, downloading subcorpora for intensive NLP “corpora on demand” Don’t quote me
38
38 Logjams Cleaning See CLEANEVAL Text type “what kind of page is it?” Critical but under-researched WebDoc proposal (with Serge Sharoff, Tony Hartley) (a different talk)
39
39 Moral Google, CSEs are wonderful Start today but bad science Not Good science, reliable counts We (the NLP community) have the skills With collective effort, mid-sized project Google-scale is achievable
40
40 Thank you http://www.sketchengine.co.uk http://www.sketchengine.co.uk
41
41 Scale and speed, LSE Commercial search engines banks of computers highly optimised code but this is for performance no downtime instant responses to millions of queries This proposal crawling: once a year downtime: acceptable not so many users
42
42 …but it’s not representative The web is not representative but nor is anything else Text type variation under-researched, lacking in theory Atkins Clear Ostler 1993 on design brief for BNC; Biber 1988, Baayen 2001, Kilgarriff 2001 Text type is an issue across NLP Web: issue is acute because, as against BNC or WSJ, we simply don ’ t know what is there
43
43 Oxford English Corpus Method as above Whole domains chosen and harvested control over text type 1 billion words Public launch April 2006 Loaded into Sketch Engine
44
44 Oxford English Corpus
45
45 Oxford English Corpus
46
46 Examples DeWaC, ItWaC Baroni and Kilgarriff, EACL 2006 Serge Sharoff, Leeds Univ UK English Chinese Russian English French Spanish, all searchable online Oxford English corpus
47
47 Options for academics Give up Niche markets, obscure languages Leave the mainstream to the big guys Work out how to work on that scale Web is free, data availability not a problem
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.