Download presentation
Presentation is loading. Please wait.
Published byEverett Hutchinson Modified over 9 years ago
1
1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Universities of Leeds and Sussex
2
Malta, May 2010Kilgarriff: BUCC 2 Two corpora are comparable iff roughly the same text types, subject matter, proportions
3
Malta, May 2010Kilgarriff: BUCC 3 Two corpora are comparable iff roughly the same text types, subject matter, proportions Applicable where Different languages Same language comparable=similar Any corpus is entirely similar to itself
4
Malta, May 2010Kilgarriff: BUCC 4 Comparing Corpora Input Word freq list for c1 Word freq list for c2 For top 500 words compute sum of (observed-expected) 2 /expected Chi-square-based Discriminates well Better than spearman rank, cross-entropy
5
Malta, May 2010Kilgarriff: BUCC 5 1990s work Then Very few corpora Purely theoretical interest Now Web lots of corpora, created to spec Compare… first question to ask about a new corpus
6
Malta, May 2010Kilgarriff: BUCC 6 (Monolingual) Word Lists Define a syllabus Which words get used in Learning-to-read books (NS children) NNS language learner textbooks Dictionaries Language testing NS: educational psychologists NNS: proficiency levels
7
Malta, May 2010Kilgarriff: BUCC 7 Should be corpus-based Most aren't Corpora are quite new Easy to do better People will use them Maybe also Governments
8
Malta, May 2010Kilgarriff: BUCC 8 How Take your corpus Count Voila
9
Malta, May 2010Kilgarriff: BUCC 9 Complications What is a word Words and lemmas Grammatical classes Numbers, names... Multiwords Homonymy All are slightly different issues for each lg
10
Malta, May 2010Kilgarriff: BUCC 10 What is a word; delimiters Found between spaces Not for Chinese: segmentation English co-operate, widely-held, farmer's, can't Norwegian, Swedish Compounding, separable verbs Arabic, Italian Clitics, al,... ...
11
Malta, May 2010Kilgarriff: BUCC 11 Words and lemmas Word form (in text) invading Lemma (dictionary headword) Invade for forms invade invades invaded invading Lemmatisation Chinese, none; English, simple Middling: Swe Nor It Gr Tough: Rus, Pol, Ara
12
Malta, May 2010Kilgarriff: BUCC 12 Word Families Derivational morphology efficient/efficiently access/accessible/accessibility available/availability/unavailable ‘Word families’ tradition eg: Coxhead, Academic word list Pedagogy: one item to learn But Where do families end? Different meanings
13
Malta, May 2010Kilgarriff: BUCC 13 Grammatical classes brush (verb) and brush (noun) Same item or different? (both in same word family) Required (short) list of word classes POS-tagger Will make mistakes
14
Malta, May 2010Kilgarriff: BUCC 14 Marginal cases Numbers twelve, seventeenth, fifties Closed sets Days of week, months Countries Capitals, nationalities, currencies, adjectives, languages regional/dialects, political groups, religions easter, christmas, islam, republican policies always needed
15
Malta, May 2010Kilgarriff: BUCC 15 Multiwords According to Linguistically a word but Multiword frequency list: top item of the Can't use freqs (alone) to select multiwords
16
Malta, May 2010Kilgarriff: BUCC 16 Homonymy bank (river) and bank (money) Word sense disambiguation We can't do (with decent accuracy) We can't give freqs for senses Lists of words not meanings Sometimes disconcerting
17
Malta, May 2010Kilgarriff: BUCC 17 Corpora A fairly arbitrary sample of a lg To limit arbitrariness of wordlist Make it big and diverse WACKY corpora From web Can do for any language ??? Comparable ??? Web language: less formal
18
Malta, May 2010Kilgarriff: BUCC 18 Comparing corpora Corpora: new We are all beginners Best way to get sense of a corpus Compare with another Keywords of each vs. other Case studies Sketch Engine functions
19
Malta, May 2010Kilgarriff: BUCC 19 Comparing frequency lists Web1T –Present from google –All 1-, 2-, 3-, 4, 5-grams with f>40 in one trillion (10 12) words of English that’s 1,000,000,000,000 Compare with BNC –Take top 50,000 items of each –105 Web1T words not in BNC top50k –50 words with highest Web1T:BNC ratio –50 words with lowest ratio
20
Malta, May 2010Kilgarriff: BUCC 20 Web-high (155 terms) 61 web and computing –config browser spyware url www forum 38 porn 22 US English (incl Spanish influence –los) 18 business/products common on web –poker viagra lingerie ringtone dvd casino rental collectible tiffany –NB: BNC is old 4 legal –trademarks pursuant accordance herein
21
Malta, May 2010Kilgarriff: BUCC 21 Web-low Exclude British English, transcription/tokenisation anomalies –herself stood seemed she looked yesterday sat considerable had council felt perhaps walked round her towards claimed knew obviously remained himself he him
22
Malta, May 2010Kilgarriff: BUCC 22 Observations Pronouns and past tense verbs –Fiction Masc vs fem Yesterday –Probably daily newspapers Constancy of ratios: –He/him/himself –She/her/herself
23
Malta, May 2010Kilgarriff: BUCC 23 Corpus Factory Many languages General corpus, 100m+ words Fast High quality Comparable across languages
24
Malta, May 2010Kilgarriff: BUCC 24 Gather Seed words Wikipedia (Wiki) Corpora many domains free 265 languages covered, more to come Extract text from Wiki. Wikipedia 2 Text Tokenise the text. Morphology of the language is important Can use the existing word tokeniser tools.
25
Malta, May 2010Kilgarriff: BUCC 25 Web Corpus Statistics
26
Malta, May 2010Kilgarriff: BUCC 26 Evaluation For each of the languages, two corpora available: Web and Wiki Dutch: also a carefullydesigned lexicographic corpus. Hypothesis: Wiki corpora are ‘informational’ Informational --> typical written Interactional --> typical spoken
27
Malta, May 2010Kilgarriff: BUCC 27 Evaluation 1st, 2nd person pronouns strong indicators of interactional language. English: I me my mine you your yours we us our For each language Take ten commonest 1 st and 2 nd person pronouns For each Calculate ratio: web:wiki
28
Malta, May 2010Kilgarriff: BUCC 28 Results: ratios, web:wiki LanguageAverageMinMax Dutch2.982.0310.03 Hindi5.361.8511.50 Telugu4.960.547.34 Thai2.400.637.87 Vietnamese3.821.8119.41
29
Malta, May 2010Kilgarriff: BUCC 29 KELLY EU lifelong learning project Goal: wordcards Word in one lg on one side, other on other Language learning 9 languages, 36 pairs Arabic Chinese English Greek Italian Norwegian Polish Russian Sweden Partners in 6 countries
30
Malta, May 2010Kilgarriff: BUCC 30 Method Prepare monolingual lists Translate Each into 8 target languages Professional translation services Integrate, finalise Produce cards Goal for each set 9000 pairs at 6 levels
31
Malta, May 2010Kilgarriff: BUCC 31 Stages Sort out corpora, tagging Automatically generate M1 lists names, numbers, countries... keywords vis-a-vis other corpora Review, compare, prepare M2 lists Translate Use translations: M3 lists Finalise
32
Malta, May 2010Kilgarriff: BUCC 32 review - how? points system 2 points for each of 6 levels 12 points for most freq words deduct points for words in over- represented areas add in words from other corpora
33
Malta, May 2010Kilgarriff: BUCC 33 Translation database On the web All translations entered into it Queries like All Swedish words used as translations more than six times All 1:1:1:1... 'simple cases'
34
Malta, May 2010Kilgarriff: BUCC 34 Using the translations database Find words not in M2 lists, that need adding Multiwords English look for Probably, the translation of a high-freq word in several of the 8 other lgs So: add it to English list Homonyms: could be similar
35
Malta, May 2010Kilgarriff: BUCC 35 Monolingual master lists (M3) Based on a WAC corpus Input from other same-lg corpora And from translations from 8 lgs Useful words which might not be hi-freq added words/multiwords must be above a lower freq threshold Target 9000
36
Malta, May 2010Kilgarriff: BUCC 36 Numbers Target: 9000 per list M2 lists Estimate: 5000-6000 needed We add 3000-4000 multiwords and other 'back-translations'
37
Malta, May 2010Kilgarriff: BUCC 37 Current status M1 lists prepared Lists checked, compared with other lists Corpus-based and other M2 lists prepared Translation underway
38
Malta, May 2010Kilgarriff: BUCC 38 Big problems Multiwords (as anticipated) Homonymy (as anticipated) orange banana alphabet elbow, Hello Worse than anticipated Lists from spoken corpora, learner corpora, needed Relation between Competence for communicating The corpora at our disposal
39
Malta, May 2010Kilgarriff: BUCC 39 Word lists are useful, but ...are they scientific? A tiny bit, occasionally ...could they be scientific? Yes article of faith By the end of KELLY, we'll have a clearer idea how
40
Malta, May 2010Kilgarriff: BUCC 40 And now for something completely different: DANTE Lexical database for English Detailed Accurate Extensive of English Highly corpus-driven 3 yr project 18 expert lexicographers Led by Sue Atkins BNC, FrameNet, Euralex, COBUILD... English side, New English-Irish dictionary Available for NLP research imminently
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.