Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass.

Similar presentations


Presentation on theme: "1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass."— Presentation transcript:

1 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Universities of Leeds and Sussex

2 Malta, May 2010Kilgarriff: BUCC 2 Two corpora are comparable iff roughly the same text types, subject matter, proportions

3 Malta, May 2010Kilgarriff: BUCC 3 Two corpora are comparable iff roughly the same text types, subject matter, proportions  Applicable where Different languages Same language  comparable=similar  Any corpus is entirely similar to itself

4 Malta, May 2010Kilgarriff: BUCC 4 Comparing Corpora  Input Word freq list for c1 Word freq list for c2  For top 500 words compute sum of (observed-expected) 2 /expected  Chi-square-based Discriminates well  Better than spearman rank, cross-entropy

5 Malta, May 2010Kilgarriff: BUCC 5 1990s work  Then Very few corpora Purely theoretical interest  Now Web lots of corpora, created to spec Compare…  first question to ask about a new corpus

6 Malta, May 2010Kilgarriff: BUCC 6 (Monolingual) Word Lists  Define a syllabus  Which words get used in Learning-to-read books (NS children) ‏ NNS language learner textbooks Dictionaries Language testing  NS: educational psychologists  NNS: proficiency levels

7 Malta, May 2010Kilgarriff: BUCC 7 Should be corpus-based  Most aren't Corpora are quite new  Easy to do better  People will use them Maybe also Governments

8 Malta, May 2010Kilgarriff: BUCC 8 How  Take your corpus  Count  Voila

9 Malta, May 2010Kilgarriff: BUCC 9 Complications  What is a word  Words and lemmas  Grammatical classes  Numbers, names...  Multiwords  Homonymy All are slightly different issues for each lg

10 Malta, May 2010Kilgarriff: BUCC 10 What is a word; delimiters  Found between spaces Not for Chinese: segmentation  English co-operate, widely-held, farmer's, can't  Norwegian, Swedish Compounding, separable verbs  Arabic, Italian Clitics, al,... ...

11 Malta, May 2010Kilgarriff: BUCC 11 Words and lemmas  Word form (in text) ‏ invading  Lemma (dictionary headword) ‏  Invade for forms invade invades invaded invading  Lemmatisation Chinese, none; English, simple Middling: Swe Nor It Gr Tough: Rus, Pol, Ara

12 Malta, May 2010Kilgarriff: BUCC 12 Word Families  Derivational morphology efficient/efficiently access/accessible/accessibility available/availability/unavailable  ‘Word families’ tradition  eg: Coxhead, Academic word list Pedagogy: one item to learn But  Where do families end? Different meanings

13 Malta, May 2010Kilgarriff: BUCC 13 Grammatical classes  brush (verb) and brush (noun) ‏ Same item or different? (both in same word family)  Required (short) list of word classes POS-tagger  Will make mistakes

14 Malta, May 2010Kilgarriff: BUCC 14 Marginal cases Numbers  twelve, seventeenth, fifties Closed sets  Days of week, months Countries  Capitals, nationalities, currencies, adjectives, languages regional/dialects, political groups, religions  easter, christmas, islam, republican  policies always needed

15 Malta, May 2010Kilgarriff: BUCC 15 Multiwords  According to Linguistically a word but  Multiword frequency list: top item of the Can't use freqs (alone) to select multiwords

16 Malta, May 2010Kilgarriff: BUCC 16 Homonymy  bank (river) and bank (money) ‏  Word sense disambiguation We can't do (with decent accuracy) ‏ We can't give freqs for senses  Lists of words not meanings Sometimes disconcerting

17 Malta, May 2010Kilgarriff: BUCC 17 Corpora  A fairly arbitrary sample of a lg  To limit arbitrariness of wordlist Make it big and diverse  WACKY corpora From web Can do for any language  ??? Comparable ??? Web language: less formal

18 Malta, May 2010Kilgarriff: BUCC 18 Comparing corpora  Corpora: new  We are all beginners  Best way to get sense of a corpus Compare with another Keywords of each vs. other  Case studies  Sketch Engine functions

19 Malta, May 2010Kilgarriff: BUCC 19 Comparing frequency lists Web1T –Present from google –All 1-, 2-, 3-, 4, 5-grams with f>40 in one trillion (10 12) words of English that’s 1,000,000,000,000 Compare with BNC –Take top 50,000 items of each –105 Web1T words not in BNC top50k –50 words with highest Web1T:BNC ratio –50 words with lowest ratio

20 Malta, May 2010Kilgarriff: BUCC 20 Web-high (155 terms) ‏ 61 web and computing –config browser spyware url www forum 38 porn 22 US English (incl Spanish influence –los)‏ 18 business/products common on web –poker viagra lingerie ringtone dvd casino rental collectible tiffany –NB: BNC is old 4 legal –trademarks pursuant accordance herein

21 Malta, May 2010Kilgarriff: BUCC 21 Web-low Exclude British English, transcription/tokenisation anomalies –herself stood seemed she looked yesterday sat considerable had council felt perhaps walked round her towards claimed knew obviously remained himself he him

22 Malta, May 2010Kilgarriff: BUCC 22 Observations Pronouns and past tense verbs –Fiction Masc vs fem Yesterday –Probably daily newspapers Constancy of ratios: –He/him/himself –She/her/herself

23 Malta, May 2010Kilgarriff: BUCC 23 Corpus Factory Many languages General corpus, 100m+ words  Fast  High quality  Comparable across languages

24 Malta, May 2010Kilgarriff: BUCC 24 Gather Seed words Wikipedia (Wiki) Corpora  many domains  free  265 languages covered, more to come Extract text from Wiki.  Wikipedia 2 Text Tokenise the text.  Morphology of the language is important  Can use the existing word tokeniser tools.

25 Malta, May 2010Kilgarriff: BUCC 25 Web Corpus Statistics

26 Malta, May 2010Kilgarriff: BUCC 26 Evaluation For each of the languages, two corpora available:  Web and Wiki  Dutch: also a carefully­designed lexicographic corpus. Hypothesis: Wiki corpora are ‘informational’  Informational --> typical written  Interactional --> typical spoken

27 Malta, May 2010Kilgarriff: BUCC 27 Evaluation 1st, 2nd person pronouns  strong indicators of interactional language.  English: I me my mine you your yours we us our For each language Take ten commonest 1 st and 2 nd person pronouns For each  Calculate ratio: web:wiki

28 Malta, May 2010Kilgarriff: BUCC 28 Results: ratios, web:wiki LanguageAverageMinMax Dutch2.982.0310.03 Hindi5.361.8511.50 Telugu4.960.547.34 Thai2.400.637.87 Vietnamese3.821.8119.41

29 Malta, May 2010Kilgarriff: BUCC 29 KELLY  EU lifelong learning project  Goal: wordcards Word in one lg on one side, other on other Language learning  9 languages, 36 pairs Arabic Chinese English Greek Italian Norwegian Polish Russian Sweden  Partners in 6 countries

30 Malta, May 2010Kilgarriff: BUCC 30 Method  Prepare monolingual lists  Translate Each into 8 target languages Professional translation services  Integrate, finalise  Produce cards  Goal for each set 9000 pairs at 6 levels

31 Malta, May 2010Kilgarriff: BUCC 31 Stages  Sort out corpora, tagging  Automatically generate M1 lists names, numbers, countries... keywords vis-a-vis other corpora  Review, compare, prepare M2 lists  Translate  Use translations: M3 lists  Finalise

32 Malta, May 2010Kilgarriff: BUCC 32 review - how?  points system 2 points for each of 6 levels 12 points for most freq words  deduct points for words in over- represented areas  add in words from other corpora

33 Malta, May 2010Kilgarriff: BUCC 33 Translation database  On the web  All translations entered into it  Queries like All Swedish words used as translations more than six times All 1:1:1:1... 'simple cases'

34 Malta, May 2010Kilgarriff: BUCC 34 Using the translations database  Find words not in M2 lists, that need adding Multiwords English look for Probably, the translation of a high-freq word in several of the 8 other lgs So:  add it to English list Homonyms: could be similar

35 Malta, May 2010Kilgarriff: BUCC 35 Monolingual master lists (M3) ‏  Based on a WAC corpus  Input from other same-lg corpora  And from translations from 8 lgs Useful words which might not be hi-freq  added words/multiwords must be above a lower freq threshold  Target 9000

36 Malta, May 2010Kilgarriff: BUCC 36 Numbers  Target: 9000 per list  M2 lists Estimate: 5000-6000 needed We add 3000-4000 multiwords and other 'back-translations'

37 Malta, May 2010Kilgarriff: BUCC 37 Current status  M1 lists prepared  Lists checked, compared with other lists Corpus-based and other  M2 lists prepared  Translation underway

38 Malta, May 2010Kilgarriff: BUCC 38 Big problems  Multiwords (as anticipated) ‏  Homonymy (as anticipated) ‏  orange banana alphabet elbow, Hello Worse than anticipated Lists from spoken corpora, learner corpora, needed Relation between  Competence for communicating  The corpora at our disposal

39 Malta, May 2010Kilgarriff: BUCC 39 Word lists are useful, but ...are they scientific? A tiny bit, occasionally ...could they be scientific? Yes  article of faith By the end of KELLY, we'll have a clearer idea how

40 Malta, May 2010Kilgarriff: BUCC 40 And now for something completely different: DANTE  Lexical database for English Detailed Accurate Extensive of English Highly corpus-driven 3 yr project 18 expert lexicographers Led by Sue Atkins  BNC, FrameNet, Euralex, COBUILD...  English side, New English-Irish dictionary  Available for NLP research imminently


Download ppt "1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass."

Similar presentations


Ads by Google