1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Universities of Leeds and Sussex
2 Linguistic evidence within and across languages, word frequency lists and language learning Or Word lists are useful, but are they (could they be) scientific?
Leeds April 2010 Kilgarriff: KELLY3 KELLY EU lifelong learning project Goal: wordcards Word in one lg on one side, other on other Language learning 9 languages, 36 pairs Arabic Chinese English Greek Italian Norwegian Polish Russian Sweden Partners (incl Leeds) in 6 countries (Leeds does Arabic Chinese Russian)
Leeds April 2010 Kilgarriff: KELLY4 Method Prepare monolingual lists Translate Each into 8 target languages Professional translation services Integrate, finalise Produce cards Goal for each set 9000 pairs at 6 levels
Leeds April 2010 Kilgarriff: KELLY5 (Monolingual) Word Lists Define a syllabus Which words get used in Learning-to-read books (NS children) NNS language learner textbooks Dictionaries Language testing NS: educational psychologists NNS: proficiency levels
Leeds April 2010 Kilgarriff: KELLY6 Should be corpus-based Most aren't Corpora are quite new Easy to do better People will use them Maybe also Governments
Leeds April 2010 Kilgarriff: KELLY7 How Take your corpus Count Voila
Leeds April 2010 Kilgarriff: KELLY8 Complications What is a word Words and lemmas Grammatical classes Numbers, names... Multiwords Homonymy All are slightly different issues for each lg
Leeds April 2010 Kilgarriff: KELLY9 What is a word; delimiters Found between spaces Not for Chinese: segmentation English co-operate, widely-held, farmer's, can't Norwegian, Swedish Compounding, separable verbs Arabic, Italian Clitics, al,... ...
Leeds April 2010 Kilgarriff: KELLY10 Words and lemmas Word form (in text) invading Lemma (dictionary headword) Invade for forms invade invades invaded invading Lemmatisation Chinese, none; English, simple Middling: Swe Nor It Gr Tough: Rus, Pol, Ara
Leeds April 2010 Kilgarriff: KELLY11 Grammatical classes brush (verb) and brush (noun) Same item or different? Proposal: lempos Recommendation: different With trepidation Chinese: weak sense of noun, verb Required (short) list of word classes for each lg Same for all unless good reason
Leeds April 2010 Kilgarriff: KELLY12 Marginal cases Numbers twelve, seventeenth, fifties Closed sets Days of week, months Countries Capitals, nationalities, currencies, adjectives, languages regional/dialects, political groups, religions easter, christmas, islam, republican Consistency before freq: policies needed
Leeds April 2010 Kilgarriff: KELLY13 Multiwords According to Linguistically a word but Multiword frequency list: top item of the Can't use freqs (alone) to select multiwords Base list: Recommendation: no multiwords But see below
Leeds April 2010 Kilgarriff: KELLY14 Homonymy bank (river) and bank (money) Word sense disambiguation We can't do (with decent accuracy) We can't give freqs for senses Lists of words not meanings Sometimes disconcerting See also below
Leeds April 2010 Kilgarriff: KELLY15 Corpora A fairly arbitrary sample of a lg To limit arbitrariness of wdlist Make it big and diverse WACKY corpora From web Can do for any language Web language: less formal not mainly 'reporting' or fiction, cf news, BNC Good for lg learners
Leeds April 2010 Kilgarriff: KELLY16 Comparing corpora Corpora: new We are all beginners Best way to get sense of a corpus Compare with another Keywords of each vs. other Case studies Sketch Engine functions
Leeds April 2010 Kilgarriff: KELLY17 Comparing frequency lists Web1T –Present from google –All 1-, 2-, 3-, 4, 5-grams with f>40 in one trillion (10 12) words of English that’s 1,000,000,000,000 Compare with BNC –Take top 50,000 items of each –105 Web1T words not in BNC top50k –50 words with highest Web1T:BNC ratio –50 words with lowest ratio
Leeds April 2010 Kilgarriff: KELLY18 Web-high (155 terms) 61 web and computing –config browser spyware url www forum 38 porn 22 US English (incl Spanish influence –los) 18 business/products common on web –poker viagra lingerie ringtone dvd casino rental collectible tiffany –NB: BNC is old 4 legal –trademarks pursuant accordance herein
Leeds April 2010 Kilgarriff: KELLY19 Web-low Exclude British English, transcription/tokenisation anomalies –herself stood seemed she looked yesterday sat considerable had council felt perhaps walked round her towards claimed knew obviously remained himself he him
Leeds April 2010 Kilgarriff: KELLY20 Observations Pronouns and past tense verbs –Fiction Masc vs fem Yesterday –Probably daily newspapers Constancy of ratios: –He/him/himself –She/her/herself
Leeds April 2010 Kilgarriff: KELLY Corpus Factory Many languages General corpus, 100m+ words Fast High quality Comparable across languages
Leeds April 2010 Kilgarriff: KELLY Gather Seed words Wikipedia (Wiki) Corpora many domains free 265 languages covered, more to come Extract text from Wiki. Wikipedia 2 Text Tokenise the text. Morphology of the language is important Can use the existing word tokeniser tools.
Leeds April 2010 Kilgarriff: KELLY Web Corpus Statistics
Leeds April 2010 Kilgarriff: KELLY Evaluation For each of the languages, two corpora available: Web and Wiki Dutch: also a carefullydesigned lexicographic corpus. Hypothesis: Wiki corpora are ‘informational’ Informational --> typical written Interactional --> typical spoken
Leeds April 2010 Kilgarriff: KELLY Evaluation 1st, 2nd person pronouns strong indicators of interactional language. English: I me my mine you your yours we us our For each languages Ratio: web:wiki
Leeds April 2010 Kilgarriff: KELLY Results
Leeds April 2010 Kilgarriff: KELLY
Leeds April 2010 Kilgarriff: KELLY28 Stages Sort out corpora, tagging Automatically generate M1 lists names, numbers, countries... keywords vis-a-vis other corpora Review, prepare M2 lists Translate
Leeds April 2010 Kilgarriff: KELLY29 review - how? points system 2 points for each of 6 levels 12 points for most freq words deduct points for words in over- represented areas add in words from other corpora
Leeds April 2010 Kilgarriff: KELLY30 Translation database On the web All translations entered into it Queries like All Swedish words used as translations more than six times All 1:1:1:1... 'simple cases'
Leeds April 2010 Kilgarriff: KELLY31 Translations Usually, of texts Words in context Kelly: no context Usual principles don't apply Instructions to translators
Leeds April 2010 Kilgarriff: KELLY32 Using the database Find words not in M2 lists, that need adding Multiwords English look for Probably, the translation of a high-freq word in several of the 8 other lgs So: add it to English list Homonyms: could be similar
Leeds April 2010 Kilgarriff: KELLY33 Monolingual master lists (M3) Based on a WAC corpus Input from other same-lg corpora And from translations from 8 lgs Useful words which might not be hi-freq added words/multiwords must be above a lower freq threshold Target 9000 Important contribution
Leeds April 2010 Kilgarriff: KELLY34 Numbers Target: 9000 per list M2 lists Estimate: needed We add multiwords and other 'back-translations'
Leeds April 2010 Kilgarriff: KELLY35 From M3 lists to T2 lists
Leeds April 2010 Kilgarriff: KELLY36 Current status M1 lists prepared Lists checked, compared with other lists Corpus-based and other M2 lists prepared Translation underway
Leeds April 2010 Kilgarriff: KELLY37 Big problems Multiwords (as anticipated) Homonymy (as anticipated) orange banana alphabet elbow, Hello Worse than anticipated Lists from spoken corpora, learner corpora, needed Relation between Competence for communicating The corpora at our disposal
Leeds April 2010 Kilgarriff: KELLY38 Word lists are useful, but ...are they scientific? A tiny bit, occasionally ...could they be scientific? Yes article of faith By the end of KELLY, we'll have a clearer idea how
Leeds April 2010 Kilgarriff: KELLY39