Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff.

Slides:



Advertisements
Similar presentations
90 DAY PLAN.
Advertisements

Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd.
Getting to know your corpus Adam Kilgarriff Lexical Computing Ltd.
Y2 moderation workshop 2015 Alison Philipson.
WebBootCaT usage Adam Kilgarriff Lexical Computing Ltd.
Using Corpus Tools in Discourse Analysis Discourse and Pragmatics Week 12.
1 Corpora for all Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Measuring Distance between Language Varieties Adam Kilgarriff, Jan Pomikalek, Pavel Rychly, Vit Suchomel Supported by EU Project PRESEMT.
1 Corpora for the coming decade Adam Kilgarriff. Dublin June 2009 Kilgarriff: Corpora for the coming decade2 How should they be different?  Bigger 
1 Information Sharing and Assessment Systems How to find out whether you need to apply for Children Index access or attend to CAF training Next slide Click.
Augmenting online dictionary entries with corpus data for Search Engine Optimisation Holger Hvelplund, 1 Adam Kilgarriff, 2 Vincent Lannoy, 1 Patrick White.
Corpus Creation for Lexicography Adam Kilgarriff, Michael Rundell Lexicography MasterClass, UK Elaine Ui Dhonnchadha ITE (Linguistics Institute of Ireland)
The Sketch Engine -What is The Sketch Engine? -What is a corpus? -Looking at the BASE and the BAWE corpora. -How can this help.
Blogs – what, why and how? A blog is a web-log It is a simple website that anyone can setup without any advanced computer know-how It’s the future: blogs,
Constructing and Evaluating Web Corpora: ukWaC Adriano Ferraresi University of Bologna Aston University Postgraduate Conference.
Talking about your homework News story? –What made you choose…? One of your words? –What made you choose…? (Give your vocabulary books to another student.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
1 Corpora for the coming decade Adam Kilgarriff Lexical Computing Ltd.
Learn how to search for information the smart way Choose your own adventure!
Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd.
Research methods in corpus linguistics Xiaofei Lu.
What's on the Web? The Web as a Linguistic Corpus Adam Kilgarriff Lexical Computing Ltd University of Leeds.
147,000 more website visits per month? Three Simple Secrets That will get your website higher on Google SEO101.
A guide to GRANTnet. Overview Introduction to GRANTnet Registering to use GRANTnet Accessing GRANTnet How to conduct a comprehensive search Refine search.
Simple Maths for Keywords Adam Kilgarriff Lexical Computing Ltd.
The Community Café project: language teachers creating and sharing resources online Alison Dickens Subject Centre for Languages, Linguistics and Area Studies.
Kids Computer Club House
Labels: automation Adam Kilgarriff. Auckland 2012Kilgarriff / Labels: automation2 Which words are:  Most distinctive of business English?  Most often.
So much of everything Adam Kilgarriff Lexical Computing Ltd.
Evaluating Web Sites The Internet is a great place to find information. But, has anyone ever told you not to believe everything you read? Web Sites are.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
1 Welcome Working with Volunteers Course Heelis, 10 th January 2012 Mike Elliott, National Volunteering Manager Michelle Upton, Working Holidays Officer.
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.
Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.
FishBase Summary Page about Salmo salar in the standard Language of FishBase (English) ENBI-WP-11: Multilingual Access to European Biodiversity Sites through.
Surrey Information Point: how to guide. What is Surrey Information Point? Local directory of support available in the community and regulated providers.
1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.
Internet Searching Made Easy Last Updated: Lesson Plan Review Lesson 1: Finding information on the Internet –Web address –Using links –Search.
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy.
Personal Development Plan PDP. PDPs  A really straight forward way to start planning for your future success.  Also useful if you are working hard but.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
What to Know: 9 Essential Things to Know About Web Searching Janet Eke Graduate School of Library and Information Science University of Illinois at Champaign-Urbana.
1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds.
1 Evaluating word sketches and corpora Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Corpus Evaluation Adam Kilgarriff Lexical Computing Ltd Corpus evaluationPortsmouth Nov
Using Corpora in Language Research Adam Kilgarriff Lexical Computing Ltd Universities of Leeds January 2013Adam Kilgarriff.
Malta, May 2010Kilgarriff: Corpora by Web Services1 Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities.
Terminology-finding in the Sketch Engine Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vit Suchomel Lexical Computing Ltd., Brighton,
CL 2005, Birmingham Web as Corpus Workshop Intro: Adam Kilgarriff 1 Web as Corpus Workshop Co-chairs: Marco Baroni Adam Kilgarriff Sebastian Hoffman.
The Sketch Engine as Infrastructure for Large Scale Text Collections for Humanities Research Adam Kilgarriff Lexical Computing Ltd. & Univ of Leeds, UK.
 2007 by David A. Prentice We have new notes Pages We have new notes Pages ! ! ! ! ! ! ! ! ! ! ! ! 
ELanguages creative collaboration for teachers globally.
Evaluating Websites: A Paul Cuffee Guide A URL is a Uniform Resource Locator, or the ADDRESS or the website. Each file on the Internet has a unique address.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
SEO for Google in Hello I'm Dave Taylor from Webmedia.
Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd., UK Supported by EU Project PRESEMT.
Overview In this tutorial you will: learn what an e-portfolio is learn about the different things e-portfolios may be used for identify some options for.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
Year 2 SATS Information for Parents Tuesday 22 nd March 2016.
JCI Website Tips For videos and tips on using the JCI website, go to: jciuk.org.uk/get-involved/website- training/ For help,
Learning Services Induction for Partner Institution Students As a student of Edge Hill University you have a wealth of resources available to help you.
GDEX: Automatically finding good dictionary examples in a corpus Kivik 2013Kilgarriff: GDEX1.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
Evaluating word sketches and corpora
Using Corpora for Language Research
Learning Services Induction for Partner Institution Students
Faye Nicholson, P7 Class Teacher, Kingsland Primary School
Presentation transcript:

Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff

Kivik 2013Kilgarriff: Web corpora2 You can’t help noticing Replaceable or replacable? –

Kivik 2013Kilgarriff: Web corpora3 Very very large –2006 estimates for duplicate free, linguistic, Google- indexed web German: 44 billion words Italian: 25 billion words English: 1,000 billion -10,000 billion words Most languages Most language types Up-to-date Free Instant access

Kivik 2013Kilgarriff: Web corpora4 Overview Is the web a corpus? Representativeness What is out there? –Web1T Googleology Web corpus types –Targeted sites: Oxford English Corpus –General: WaC family –WebBootCaT

Kivik 2013Kilgarriff: Web corpora5 Is the web a corpus? Sinclair –in “Developing linguistic corpora, a guide to good practice. Corpus and Text – Basic Principles” “…not a corpus because dimensions unknown, constantly changing not designed from a linguistic perpective But –We can find out dimensions –Many corpora are not designed “as much chatroom dialogue as I can get” Def: a corpus is a collection of texts –when viewed as an object of language research

Kivik 2013Kilgarriff: Web corpora6 Is the web a corpus? Yes

Kivik 2013Kilgarriff: Web corpora7 but it’s not representative

Kivik 2013Kilgarriff: Web corpora8 Theory A random sample of a population is representative of it. Observations on sample support inferences about population (within confidence bounds)‏

Kivik 2013Kilgarriff: Web corpora9 Theory A random sample of a population is … What is the population? –production and reception –speech and text –copying

Kivik 2013Kilgarriff: Web corpora10 Theory Population not defined Representative sample not possible

Kivik 2013Kilgarriff: Web corpora11 sublanguage Language = core + sublanguages Options for corpus construction –none –some –all None –impoverished view of language Some: BNC –cake recipes and gastro-uterine disease –not car repair manuals or astronomy or … All: until recently, not viable

Kivik 2013Kilgarriff: Web corpora12 Representativeness The web is not representative but nor is anything else Text type variation –under-researched, lacking in theory Atkins Clear Ostler 1993 on design brief for BNC; Biber 1988, Kilgarriff 2001 Text type is an issue across NLP –Web: issue is acute because, as against BNC or WSJ, we simply don’t know what is there

Kivik 2013Kilgarriff: Web corpora13 What is out there? What text types are there on the web? –some are new: chatroom –proportions is it overwhelmed by porn? How much? Hard question

Kivik 2013Kilgarriff: Web corpora14 The web –a social, cultural, political phenomenon –new, little understood –a legitimate object of science –mostly language we are well placed –a lot of people will be interested Let’s –study the web –source of language data –apply our tools for web use (dictionaries, MT)‏ –use the web as infrastructure

Kivik 2013Kilgarriff: Web corpora15 Using Search Engines No setup costs Start querying today Methods Hit counts ‘snippets’ –Metasearch engines, WebCorp Find pages and download

Kivik 2013Kilgarriff: Web corpora16 Googleology Google hit counts for language modelling –Example: (Keller & Lapata 2003) –36 queries to estimate freq(fulfil, obligation) to each of Google and Altavista Very interesting work Great interest in query syntax

Kivik 2013Kilgarriff: Web corpora17 The Trouble with Google not enough instances –max 1000 not enough queries –max 1000 per day with API not enough context –10-word snippet around search term sort order –search term in titles and headings untrustworthy hit counts limited search options linguistically dumb, eg not lemmatised aime/aimer/aimes/aimons/aimez/aiment …

Kivik 2013Kilgarriff: Web corpora18 Appeal –Zero-cost entry, just start googling Reality –High-quality work: high-cost methodology

Kivik 2013Kilgarriff: Web corpora19 Also: No replicability Methods, stats not published At mercy of commercial corporation Googleology is bad science

Kivik 2013Kilgarriff: Web corpora20 Better: web-sourced corpora Gather pages –Google hits –Select and gather whole sites –General crawl Filter De-duplicate Linguistic processing Load into corpus tool

Kivik 2013Kilgarriff: Web corpora21 Oxford English Corpus Whole domains chosen and harvested –control over text type 2.3 billion words

Kivik 2013Kilgarriff: Web corpora22 Oxford English Corpus

Kivik 2013Kilgarriff: Web corpora23 WaC family 1.5 B words each Baroni and colleagues Seeds: –mid-frequency words from ‘core vocab’ lists and corpora Google on seed words, then crawl

TenTen Family Processing chain –Spiderling, a lingustic crawler A billion words a day –jusText for“cleaning”: removing non-text –Onion – remove duplicates (paragraph level) All major world languages 2-20 billion words Lexical Computing All available in Sketch Engine Kivik 2013Kilgarriff: Web corpora24

Kivik 2013Kilgarriff: Web corpora25 Small, specialised corpora Terminologists Translators needing target-language domain-specific vocab Specialist dictionaries –Don’t exist –Expensive/inaccessible –Out of date

Kivik 2013Kilgarriff: Web corpora26 BootCat ( Bootstrapping Corpora and Terms) Put in seed terms Google/Yahoo search Retrieve Google/Yahoo hits –Remove duplicates, boilerplate Small instant corpora Baroni and Bernardini, LREC 2004 Web version –WebBootCaT –At Sketch Engine site

But did I make a good corpus? Kivik 2013Kilgarriff: Web corpora27

Bad Science Ben Goldacre Kivik 2013Kilgarriff: Web corpora28

Bad Science Ben Goldacre Biases in samples –A quarter of the people who tested positive had just been on holiday in Mexico –But the research team didn’t notice Kivik 2013Kilgarriff: Web corpora29

Bad linguistics Our corpus study shows X –But what was in the corpus? Kivik 2013Kilgarriff: Web corpora30

Bad linguistics Our corpus study shows X –But what was in the corpus? –Moral: Get to know your corpus Kivik 2013Kilgarriff: Web corpora31

How? Read it? Too big to read Not designed to be read Kivik 2013Kilgarriff: Web corpora32

How? Compare it with other(s) Keyword lists Kivik 2013Kilgarriff: Web corpora33

UKWaC vs. enTenTen12 Kivik 2013Kilgarriff: Web corpora34

enTenTen vs. UKWaC accord actually amendment among bad because behavior believe bill blog ca center citizen color defense determine do dollar earth effort election even evil fact faculty favor favorite federal foreign forth guess guy he her him himself his honor human kid kill kind know labor law let liberal like man maybe me military movie my nation never nor not nothing official oh organization percent political post president pretty professor program realize recognize say shall she sin soul speak state suppose tell terrorist that thing think thou thy toward true truth unto upon violation vote voter war what while why woman yes accommodation achieve advice aim area assessment available band behaviour building centre charity click client club colour consultation contact council delivery detail develop development disabled enable enquiry ensure event excellent facility favourite full further garden guidance guide holiday improve information insurance join link local main manage management match mm nd offer opportunity organisation organise page partnership please pm poker pp programme project pub pupil quality range rd realise recognise road route scheme sector service shop site skill specialist st staff stage suitable telephone th top tour training transport uk undertake venue village visit visitor website welcome whilst wide workshop www Kivik 2013Kilgarriff: Web corpora35

enTenTen vs. UKWaC accord actually amendment among bad because behavior believe bill blog ca center citizen color defense determine do dollar earth effort election even evil fact faculty favor favorite federal foreign forth guess guy he her him himself his honor human kid kill kind know labor law let liberal like man maybe me military movie my nation never nor not nothing official oh organization percent political post president pretty professor program realize recognize say shall she sin soul speak state suppose tell terrorist that thing think thou thy toward true truth unto upon violation vote voter war what while why woman yes accommodation achieve advice aim area assessment available band behaviour building centre charity click client club colour consultation contact council delivery detail develop development disabled enable enquiry ensure event excellent facility favourite full further garden guidance guide holiday improve information insurance join link local main manage management match mm nd offer opportunity organisation organise page partnership please pm poker pp programme project pub pupil quality range rd realise recognise road route scheme sector service shop site skill specialist st staff stage suitable telephone th top tour training transport uk undertake venue village visit visitor website welcome whilst wide workshop www Kivik 2013Kilgarriff: Web corpora36

enTenTen vs. UKWaC accord actually amendment among bad because behavior believe bill blog ca center citizen color defense determine do dollar earth effort election even evil fact faculty favor favorite federal foreign forth guess guy he her him himself his honor human kid kill kind know labor law let liberal like man maybe me military movie my nation never nor not nothing official oh organization percent political post president pretty professor program realize recognize say shall she sin soul speak state suppose tell terrorist that thing think thou thy toward true truth unto upon violation vote voter war what while why woman yes accommodation achieve advice aim area assessment available band behaviour building centre charity click client club colour consultation contact council delivery detail develop development disabled enable enquiry ensure event excellent facility favourite full further garden guidance guide holiday improve information insurance join link local main manage management match mm nd offer opportunity organisation organise page partnership please pm poker pp programme project pub pupil quality range rd realise recognise road route scheme sector service shop site skill specialist st staff stage suitable telephone th top tour training transport uk undertake venue village visit visitor website welcome whilst wide workshop www Kivik 2013Kilgarriff: Web corpora37

enTenTen vs. UKWaC accord actually amendment among bad because behavior believe bill blog ca center citizen color defense determine do dollar earth effort election even evil fact faculty favor favorite federal foreign forth guess guy he her him himself his honor human kid kill kind know labor law let liberal like man maybe me military movie my nation never nor not nothing official oh organization percent political post president pretty professor program realize recognize say shall she sin soul speak state suppose tell terrorist that thing think thou thy toward true truth unto upon violation vote voter war what while why woman yes accommodation achieve advice aim area assessment available band behaviour building centre charity click client club colour consultation contact council delivery detail develop development disabled enable enquiry ensure event excellent facility favourite full further garden guidance guide holiday improve information insurance join link local main manage management match mm nd offer opportunity organisation organise page partnership please pm poker pp programme project pub pupil quality range rd realise recognise road route scheme sector service shop site skill specialist st staff stage suitable telephone th top tour training transport uk undertake venue village visit visitor website welcome whilst wide workshop www Kivik 2013Kilgarriff: Web corpora38

enTenTen vs. UKWaC Core verbs –be determine do guess know let say shall suppose tell think Pronouns –he her him his me my she Biber: more informal Kivik 2013Kilgarriff: Web corpora39

Judgements Not all or nothing –Both have (lots of) AmE and BrE –Observing patterns Not right or wrong Where does ‘believe’ belong? –Bible or core verbs? –No right answer, could be both The better you know the data, the better you understand why words are there Kivik 2013Kilgarriff: Web corpora40

The maths “this word is twice as common here as there” Simplest approach –Normalise frequencies Per thousand, or per million –Take ratio For examples –Assume two 1m-word corpora Normalisation not needed –Fc=focus corpus –Rc= reference corpus Kivik 2013Kilgarriff: Web corpora41

Kivik 2013Kilgarriff: Web corpora42 Problem 1: You can’t divide by zero Standard solution: add one Problem solved fc rcratio buggle100? stort1000? nammikin10000? fc rcratio buggle111 stort1011 nammikin10011

Kivik 2013Kilgarriff: Web corpora43 Problem 2: High ratios more common, less interesting for rarer words fc rc ratiointeresting? spug101 no grod yes ratio is not enough: frequency matters too Also some researchers: grammar, grammar words some researchers: lexis, content words No right answer Slider?

Kivik 2013Kilgarriff: Web corpora44 Solution Don’t just add 1, add n: n=1 n=100 word fc rc fc+n rc+nRatioRank obscurish middling common word fc rc fc+n rc+nRatioRank obscurish middling common

Kivik 2013Kilgarriff: Web corpora45 n=1000 word fc rc fc+n rc+nRatioRank obscurish middling common

Kivik 2013Kilgarriff: Web corpora46 Summary word fc rc n=1 n=100n=1000 obscurish1001st2nd3rd middling nd1st2nd common rd 1st

Kivik 2013Kilgarriff: Web corpora47 But what about Mutual information Log-likelihood Chi-square Fisher’s test … Don’t they use cleverer maths?

Kivik 2013Kilgarriff: Web corpora48 Yes but Clever maths is for hypothesis testing –Can you defeat null hypothesis? Language is not random, so … you always can Null hypothesis never true Hypothesis-testing not informative Clever maths irrelevant –Kilgarriff 2006, CLLT

Kivik 2013Kilgarriff: Web corpora49 Varying the parameter BAWE –British Academic Written English Nesi and Thompson 2008 –Student essays Arts/Humanities, Social Sciences, Life Sciences, Physical Sciences – fc: ArtsHum, rc: SocSci –With n=10 and n=1000

Kivik 2013Kilgarriff: Web corpora50

Kivik 2013Kilgarriff: Web corpora51

Parameters for keyword lists Lemmas –Could be word forms, word classes Simplemaths –(default: 100, for mix of lexical and grammar words) Only all-lowercase-letters –Could allow uppercase, or any at all Minimum 2/3/4 characters –Helps get words, not abbreviations etc Kivik 2013Kilgarriff: Web corpora52

enTenTen vs. UKWaC Obama Clinton Hillary McCain Centre Leeds Manchester Edinburgh Kivik 2013Kilgarriff: Web corpora53 With parameters: Simplemaths: 10 Uppercase and lowercase Minimum length =5 (to exclude acronyms)

Two interlocking questions How do two corpora differ How do two text types differ Kivik 2013Kilgarriff: Web corpora54

Two interlocking questions How do two corpora differ –enTenTen vs. UKWaC –Interpret as: Differences of corpus compilation procedures - and/or - Differences of proportions of text types How do two text types differ –BAWE example Arts/humanities essays vs. Social Sciences essays –Any other corpus differences Unwanted biases But we need to know about them Kivik 2013Kilgarriff: Web corpora55

Don’t do bad science Get to know your corpus –Compare with others Qualitatively: keyword lists (Quantitatively: distances) No excuses –The Sketch Engine does all the technical work for you The joy of research Kivik 2013Kilgarriff: Web corpora56