Download presentation
Presentation is loading. Please wait.
Published byLorraine Griffin Modified over 8 years ago
1
CL 2005, Birmingham Web as Corpus Workshop Intro: Adam Kilgarriff 1 Web as Corpus Workshop Co-chairs: Marco Baroni Adam Kilgarriff Sebastian Hoffman
2
CL 2005, Birmingham Web as Corpus Workshop Intro: Adam Kilgarriff 2 "When you have tons of data and tons of computation you can make things work that don’t work on smaller systems" - Google's VP-engineering, Urs Hölzle
3
CL 2005, Birmingham Web as Corpus Workshop Intro: Adam Kilgarriff 3 History within CL 1989: corpora arrive on scene 1989-1993: “too dirty”: battles 1993: CL Special Issue: consummation … 1999: web arrives on scene 1999-2003: “too dirty” 2003: CL Special Issue .
4
CL 2005, Birmingham Web as Corpus Workshop Intro: Adam Kilgarriff 4 History within CL 1989: corpora arrive on scene 1989-1993: “too dirty”: battles 1993: CL Special Issue: consummation 1993: WVLC workshop series starts … 1999: web arrives on scene 1999-2003: “too dirty” 2003: CL Special Issue 2005: WAC workshop series starts
5
CL 2005, Birmingham Web as Corpus Workshop Intro: Adam Kilgarriff 5 History 10 9 10 8 10 7 10 6 Size (in words) 1960s 1970s 1980s 1990s 2000s 2010 Brown/LOB COBUILD BNC Gigaword ?
6
CL 2005, Birmingham Web as Corpus Workshop Intro: Adam Kilgarriff 6 Approaches Use Google hit counts Use snippets Use google, then download pages Spider from relevant starting sites (Marco Baroni’s analysis)
7
CL 2005, Birmingham Web as Corpus Workshop Intro: Adam Kilgarriff 7 The Trouble with Google not enough instances (max 1000) not enough context –ca 10-word snippet around search term ridiculous sort order –search term in titles and headings linguistically dumb –not lemmatised think/thinks/thinking/thought: four searches –not POS-tagged mixes up beat (n) and beat (v) –and why not parsed
8
CL 2005, Birmingham Web as Corpus Workshop Intro: Adam Kilgarriff 8 DIY do it ourselves –this community Wacky
9
CL 2005, Birmingham Web as Corpus Workshop Intro: Adam Kilgarriff 9 Components 1.web crawler 2.filters/classifiers - language id, non-text, boilerplate, genre 3.linguistic processor (optional) 4.database/indexing 5.statistical summariser (optional) 6.user interface.
10
CL 2005, Birmingham Web as Corpus Workshop Intro: Adam Kilgarriff 10 Programme 9.30 Welcome, goals Adam Kilgarriff 10.00Crawling Marco Baroni 10.30coffee 11.00Creating specialized and general corpora using automated search engine querying Marco Baroni and Serge Sharoff 12.00Small groups: what we have all been doing 1.00lunch 2.30 Processing web-derived text Sebastian Hoffman 3.15 Indexing and interfaces Stefan Evert and Adam Kilgarriff 4.00coffee 4.30Representing genre-specific websites Alexander Mehler and Rüdiger Gleim 5.00Small groups: “what are critical next steps for WaC activity?” 5.30 Plenary: where next? 6.10end
11
CL 2005, Birmingham Web as Corpus Workshop Intro: Adam Kilgarriff 11 Small groups (proposal) Around topics: wac for theoretical linguistics wac for applied linguistics –language teaching, translation, terminology wac for nlp wac for lexicography wac for ontology engineering Around problems: large crawls text processing, boilerplate removal, etc. indexing and interfaces
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.