1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass.

Slides:



Advertisements
Similar presentations
Terminology-finding in the Sketch Engine Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vit Suchomel Lexical Computing Ltd., Brighton,
Advertisements

The Cambridge Learner Corpus, English Profile, the Sketch Engine and the Kelly Project Adam Kilgarriff Lexical Computing Ltd
Corpus Processing and NLP
WebBootCaT usage Adam Kilgarriff Lexical Computing Ltd.
Uses of a Corpus “[E]xplore actual patterns of language use”
Using Corpus Tools in Discourse Analysis Discourse and Pragmatics Week 12.
1 Corpora for all Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Measuring Distance between Language Varieties Adam Kilgarriff, Jan Pomikalek, Pavel Rychly, Vit Suchomel Supported by EU Project PRESEMT.
Linking Dictionary and Corpus Adam Kilgarriff Lexicography MasterClass Ltd Lexical Computing Ltd University of Sussex UK.
1 Developing Statistic-based and Rule-based Grammar Checkers for Chinese ESL Learners Howard Chen Department of English National Taiwan Normal University.
Using Corpora for Teaching Chinese Dr. Adam Kilgarriff Lexical Computing Ltd Leeds University UK.
The Sketch Engine -What is The Sketch Engine? -What is a corpus? -Looking at the BASE and the BAWE corpora. -How can this help.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
Making useful wordlists for ELT Topical vocabulary from the WWW Simon Smith & Scott Sommers Ming Chuan University, Taipei Adam Kilgarriff, Lexical Computing.
Measuring Monolinguality Chris Biemann NLP Department, University of Leipzig LREC-06 Workshop on Quality Assurance and Quality Measurement for Language.
Constructing and Evaluating Web Corpora: ukWaC Adriano Ferraresi University of Bologna Aston University Postgraduate Conference.
Talking about your homework News story? –What made you choose…? One of your words? –What made you choose…? (Give your vocabulary books to another student.
1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.
Corpus Linguistics Lexicography. Questions for lexicography in corpus linguistics How common are different words? How common are the different senese.
Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd.
Research methods in corpus linguistics Xiaofei Lu.
What's on the Web? The Web as a Linguistic Corpus Adam Kilgarriff Lexical Computing Ltd University of Leeds.
Memory Strategy – Using Mental Images
Simple Maths for Keywords Adam Kilgarriff Lexical Computing Ltd.
Labels: automation Adam Kilgarriff. Auckland 2012Kilgarriff / Labels: automation2 Which words are:  Most distinctive of business English?  Most often.
1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.
Using Corpora for Teaching Chinese Dr. Adam Kilgarriff Lexical Computing Ltd Leeds University UK.
Using corpora for bespoke language teaching
First International Sketch Grammar Workshop Ljubljana 3-4 February 2010.
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
Terminology, translation, and PRESEMT; word frequency lists and KELLY 1 Adam Kilgarriff Lexical Computing Ltd SKEW-2, March 2011Kilgarriff: PRESEMT and.
Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.
GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing.
1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of.
Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.
Researching language with computers Paul Thompson.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
Genre in a Frequency Dictionary Adam Kilgarriff & Carole Tiberius.
Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.
Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass.
Comparable Corpora BootCaT (CCBC) (or: In Praise of BootCaT) Adam Kilgarriff, Jan Pomikalek, Avinesh PVS Lexical Computing Ltd. Work Supported by EU FP7.
1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds.
1 Evaluating word sketches and corpora Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Corpus Evaluation Adam Kilgarriff Lexical Computing Ltd Corpus evaluationPortsmouth Nov
Using Corpora in Language Research Adam Kilgarriff Lexical Computing Ltd Universities of Leeds January 2013Adam Kilgarriff.
Malta, May 2010Kilgarriff: Corpora by Web Services1 Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities.
Terminology-finding in the Sketch Engine Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vit Suchomel Lexical Computing Ltd., Brighton,
How Can Corpora Help Me To Be Successful in CO150?
Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK.
RESEARCH DESIGN & CORPUS COMPILATION. Corpus design is intrinsic and a fundamental part of the analysis. It is guided by the RQ and affects the results.
Subcorpus configuration Adam Kilgarriff. Feb 2010Kilgarriff: IWSG: Subcorpora2 “you can’t get away from genre” Bonnie Weber, Keynote Lecture ICON (Indian.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Grammar is to Meaning as the Law if to Good Behaviour Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
GDEX: Automatically finding good dictionary examples in a corpus Auckland 2012Kilgarriff: GDEX1.
Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd., UK Supported by EU Project PRESEMT.
Twitter as a Corpus for Sentiment Analysis and Opinion Mining
GDEX: Automatically finding good dictionary examples in a corpus Kivik 2013Kilgarriff: GDEX1.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
GDEX: Automatically finding good dictionary examples in a corpus.
Changes in English 1 In this presentation we are going to look at the way other languages have influenced English and at the similarities and differences.
Measuring Monolinguality
Making useful wordlists for ELT
Evaluating word sketches and corpora
Tomaž Erjavec1, Adam Kilgarriff2, Irena Srdanović Erjavec3
Statistical n-gram David ling.
Applied Linguistics Chapter Four: Corpus Linguistics
Presentation transcript:

1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Universities of Leeds and Sussex

2 Linguistic evidence within and across languages, word frequency lists and language learning Or Word lists are useful, but are they (could they be) scientific?

Leeds April 2010 Kilgarriff: KELLY3 KELLY  EU lifelong learning project  Goal: wordcards Word in one lg on one side, other on other Language learning  9 languages, 36 pairs Arabic Chinese English Greek Italian Norwegian Polish Russian Sweden  Partners (incl Leeds) in 6 countries (Leeds does Arabic Chinese Russian) ‏

Leeds April 2010 Kilgarriff: KELLY4 Method  Prepare monolingual lists  Translate Each into 8 target languages Professional translation services  Integrate, finalise  Produce cards  Goal for each set 9000 pairs at 6 levels

Leeds April 2010 Kilgarriff: KELLY5 (Monolingual) Word Lists  Define a syllabus  Which words get used in Learning-to-read books (NS children) ‏ NNS language learner textbooks Dictionaries Language testing  NS: educational psychologists  NNS: proficiency levels

Leeds April 2010 Kilgarriff: KELLY6 Should be corpus-based  Most aren't Corpora are quite new  Easy to do better  People will use them Maybe also Governments

Leeds April 2010 Kilgarriff: KELLY7 How  Take your corpus  Count  Voila

Leeds April 2010 Kilgarriff: KELLY8 Complications  What is a word  Words and lemmas  Grammatical classes  Numbers, names...  Multiwords  Homonymy All are slightly different issues for each lg

Leeds April 2010 Kilgarriff: KELLY9 What is a word; delimiters  Found between spaces Not for Chinese: segmentation  English co-operate, widely-held, farmer's, can't  Norwegian, Swedish Compounding, separable verbs  Arabic, Italian Clitics, al,... ...

Leeds April 2010 Kilgarriff: KELLY10 Words and lemmas  Word form (in text) ‏ invading  Lemma (dictionary headword) ‏  Invade for forms invade invades invaded invading  Lemmatisation Chinese, none; English, simple Middling: Swe Nor It Gr Tough: Rus, Pol, Ara

Leeds April 2010 Kilgarriff: KELLY11 Grammatical classes  brush (verb) and brush (noun) ‏ Same item or different? Proposal: lempos  Recommendation: different With trepidation  Chinese: weak sense of noun, verb  Required (short) list of word classes for each lg  Same for all unless good reason

Leeds April 2010 Kilgarriff: KELLY12 Marginal cases Numbers  twelve, seventeenth, fifties Closed sets  Days of week, months Countries  Capitals, nationalities, currencies, adjectives, languages regional/dialects, political groups, religions  easter, christmas, islam, republican  Consistency before freq: policies needed

Leeds April 2010 Kilgarriff: KELLY13 Multiwords  According to Linguistically a word but  Multiword frequency list: top item of the Can't use freqs (alone) to select multiwords  Base list: Recommendation: no multiwords But see below

Leeds April 2010 Kilgarriff: KELLY14 Homonymy  bank (river) and bank (money) ‏  Word sense disambiguation We can't do (with decent accuracy) ‏ We can't give freqs for senses  Lists of words not meanings Sometimes disconcerting See also below

Leeds April 2010 Kilgarriff: KELLY15 Corpora  A fairly arbitrary sample of a lg  To limit arbitrariness of wdlist Make it big and diverse  WACKY corpora From web Can do for any language Web language: less formal  not mainly 'reporting' or fiction, cf news, BNC  Good for lg learners

Leeds April 2010 Kilgarriff: KELLY16 Comparing corpora  Corpora: new  We are all beginners  Best way to get sense of a corpus Compare with another Keywords of each vs. other  Case studies  Sketch Engine functions

Leeds April 2010 Kilgarriff: KELLY17 Comparing frequency lists Web1T –Present from google –All 1-, 2-, 3-, 4, 5-grams with f>40 in one trillion (10 12) words of English that’s 1,000,000,000,000 Compare with BNC –Take top 50,000 items of each –105 Web1T words not in BNC top50k –50 words with highest Web1T:BNC ratio –50 words with lowest ratio

Leeds April 2010 Kilgarriff: KELLY18 Web-high (155 terms) ‏ 61 web and computing –config browser spyware url www forum 38 porn 22 US English (incl Spanish influence –los)‏ 18 business/products common on web –poker viagra lingerie ringtone dvd casino rental collectible tiffany –NB: BNC is old 4 legal –trademarks pursuant accordance herein

Leeds April 2010 Kilgarriff: KELLY19 Web-low Exclude British English, transcription/tokenisation anomalies –herself stood seemed she looked yesterday sat considerable had council felt perhaps walked round her towards claimed knew obviously remained himself he him

Leeds April 2010 Kilgarriff: KELLY20 Observations Pronouns and past tense verbs –Fiction Masc vs fem Yesterday –Probably daily newspapers Constancy of ratios: –He/him/himself –She/her/herself

Leeds April 2010 Kilgarriff: KELLY Corpus Factory Many languages General corpus, 100m+ words  Fast  High quality  Comparable across languages

Leeds April 2010 Kilgarriff: KELLY Gather Seed words Wikipedia (Wiki) Corpora  many domains  free  265 languages covered, more to come Extract text from Wiki.  Wikipedia 2 Text Tokenise the text.  Morphology of the language is important  Can use the existing word tokeniser tools.

Leeds April 2010 Kilgarriff: KELLY Web Corpus Statistics

Leeds April 2010 Kilgarriff: KELLY Evaluation For each of the languages, two corpora available:  Web and Wiki  Dutch: also a carefully­designed lexicographic corpus. Hypothesis: Wiki corpora are ‘informational’  Informational --> typical written  Interactional --> typical spoken

Leeds April 2010 Kilgarriff: KELLY Evaluation 1st, 2nd person pronouns  strong indicators of interactional language.  English: I me my mine you your yours we us our For each languages  Ratio: web:wiki

Leeds April 2010 Kilgarriff: KELLY Results

Leeds April 2010 Kilgarriff: KELLY

Leeds April 2010 Kilgarriff: KELLY28 Stages  Sort out corpora, tagging  Automatically generate M1 lists names, numbers, countries... keywords vis-a-vis other corpora  Review, prepare M2 lists  Translate

Leeds April 2010 Kilgarriff: KELLY29 review - how?  points system 2 points for each of 6 levels 12 points for most freq words  deduct points for words in over- represented areas  add in words from other corpora

Leeds April 2010 Kilgarriff: KELLY30 Translation database  On the web  All translations entered into it  Queries like All Swedish words used as translations more than six times All 1:1:1:1... 'simple cases'

Leeds April 2010 Kilgarriff: KELLY31 Translations Usually, of texts Words in context Kelly: no context  Usual principles don't apply  Instructions to translators

Leeds April 2010 Kilgarriff: KELLY32 Using the database  Find words not in M2 lists, that need adding Multiwords English look for Probably, the translation of a high-freq word in several of the 8 other lgs So:  add it to English list Homonyms: could be similar

Leeds April 2010 Kilgarriff: KELLY33 Monolingual master lists (M3) ‏  Based on a WAC corpus  Input from other same-lg corpora  And from translations from 8 lgs Useful words which might not be hi-freq  added words/multiwords must be above a lower freq threshold  Target 9000  Important contribution

Leeds April 2010 Kilgarriff: KELLY34 Numbers  Target: 9000 per list  M2 lists Estimate: needed We add multiwords and other 'back-translations'

Leeds April 2010 Kilgarriff: KELLY35 From M3 lists to T2 lists

Leeds April 2010 Kilgarriff: KELLY36 Current status  M1 lists prepared  Lists checked, compared with other lists Corpus-based and other  M2 lists prepared  Translation underway

Leeds April 2010 Kilgarriff: KELLY37 Big problems  Multiwords (as anticipated) ‏  Homonymy (as anticipated) ‏  orange banana alphabet elbow, Hello Worse than anticipated Lists from spoken corpora, learner corpora, needed Relation between  Competence for communicating  The corpora at our disposal

Leeds April 2010 Kilgarriff: KELLY38 Word lists are useful, but ...are they scientific? A tiny bit, occasionally ...could they be scientific? Yes  article of faith By the end of KELLY, we'll have a clearer idea how

Leeds April 2010 Kilgarriff: KELLY39 