Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass.

Similar presentations


Presentation on theme: "1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass."— Presentation transcript:

1 1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Universities of Leeds and Sussex

2 2 Linguistic evidence within and across languages, word frequency lists and language learning Or Word lists are useful, but are they (could they be) scientific?

3 Leeds April 2010 Kilgarriff: KELLY3 KELLY  EU lifelong learning project  Goal: wordcards Word in one lg on one side, other on other Language learning  9 languages, 36 pairs Arabic Chinese English Greek Italian Norwegian Polish Russian Sweden  Partners (incl Leeds) in 6 countries (Leeds does Arabic Chinese Russian) ‏

4 Leeds April 2010 Kilgarriff: KELLY4 Method  Prepare monolingual lists  Translate Each into 8 target languages Professional translation services  Integrate, finalise  Produce cards  Goal for each set 9000 pairs at 6 levels

5 Leeds April 2010 Kilgarriff: KELLY5 (Monolingual) Word Lists  Define a syllabus  Which words get used in Learning-to-read books (NS children) ‏ NNS language learner textbooks Dictionaries Language testing  NS: educational psychologists  NNS: proficiency levels

6 Leeds April 2010 Kilgarriff: KELLY6 Should be corpus-based  Most aren't Corpora are quite new  Easy to do better  People will use them Maybe also Governments

7 Leeds April 2010 Kilgarriff: KELLY7 How  Take your corpus  Count  Voila

8 Leeds April 2010 Kilgarriff: KELLY8 Complications  What is a word  Words and lemmas  Grammatical classes  Numbers, names...  Multiwords  Homonymy All are slightly different issues for each lg

9 Leeds April 2010 Kilgarriff: KELLY9 What is a word; delimiters  Found between spaces Not for Chinese: segmentation  English co-operate, widely-held, farmer's, can't  Norwegian, Swedish Compounding, separable verbs  Arabic, Italian Clitics, al,... ...

10 Leeds April 2010 Kilgarriff: KELLY10 Words and lemmas  Word form (in text) ‏ invading  Lemma (dictionary headword) ‏  Invade for forms invade invades invaded invading  Lemmatisation Chinese, none; English, simple Middling: Swe Nor It Gr Tough: Rus, Pol, Ara

11 Leeds April 2010 Kilgarriff: KELLY11 Grammatical classes  brush (verb) and brush (noun) ‏ Same item or different? Proposal: lempos  Recommendation: different With trepidation  Chinese: weak sense of noun, verb  Required (short) list of word classes for each lg  Same for all unless good reason

12 Leeds April 2010 Kilgarriff: KELLY12 Marginal cases Numbers  twelve, seventeenth, fifties Closed sets  Days of week, months Countries  Capitals, nationalities, currencies, adjectives, languages regional/dialects, political groups, religions  easter, christmas, islam, republican  Consistency before freq: policies needed

13 Leeds April 2010 Kilgarriff: KELLY13 Multiwords  According to Linguistically a word but  Multiword frequency list: top item of the Can't use freqs (alone) to select multiwords  Base list: Recommendation: no multiwords But see below

14 Leeds April 2010 Kilgarriff: KELLY14 Homonymy  bank (river) and bank (money) ‏  Word sense disambiguation We can't do (with decent accuracy) ‏ We can't give freqs for senses  Lists of words not meanings Sometimes disconcerting See also below

15 Leeds April 2010 Kilgarriff: KELLY15 Corpora  A fairly arbitrary sample of a lg  To limit arbitrariness of wdlist Make it big and diverse  WACKY corpora From web Can do for any language Web language: less formal  not mainly 'reporting' or fiction, cf news, BNC  Good for lg learners

16 Leeds April 2010 Kilgarriff: KELLY16 Comparing corpora  Corpora: new  We are all beginners  Best way to get sense of a corpus Compare with another Keywords of each vs. other  Case studies  Sketch Engine functions

17 Leeds April 2010 Kilgarriff: KELLY17 Comparing frequency lists Web1T –Present from google –All 1-, 2-, 3-, 4, 5-grams with f>40 in one trillion (10 12) words of English that’s 1,000,000,000,000 Compare with BNC –Take top 50,000 items of each –105 Web1T words not in BNC top50k –50 words with highest Web1T:BNC ratio –50 words with lowest ratio

18 Leeds April 2010 Kilgarriff: KELLY18 Web-high (155 terms) ‏ 61 web and computing –config browser spyware url www forum 38 porn 22 US English (incl Spanish influence –los)‏ 18 business/products common on web –poker viagra lingerie ringtone dvd casino rental collectible tiffany –NB: BNC is old 4 legal –trademarks pursuant accordance herein

19 Leeds April 2010 Kilgarriff: KELLY19 Web-low Exclude British English, transcription/tokenisation anomalies –herself stood seemed she looked yesterday sat considerable had council felt perhaps walked round her towards claimed knew obviously remained himself he him

20 Leeds April 2010 Kilgarriff: KELLY20 Observations Pronouns and past tense verbs –Fiction Masc vs fem Yesterday –Probably daily newspapers Constancy of ratios: –He/him/himself –She/her/herself

21 Leeds April 2010 Kilgarriff: KELLY Corpus Factory Many languages General corpus, 100m+ words  Fast  High quality  Comparable across languages

22 Leeds April 2010 Kilgarriff: KELLY Gather Seed words Wikipedia (Wiki) Corpora  many domains  free  265 languages covered, more to come Extract text from Wiki.  Wikipedia 2 Text Tokenise the text.  Morphology of the language is important  Can use the existing word tokeniser tools.

23 Leeds April 2010 Kilgarriff: KELLY Web Corpus Statistics

24 Leeds April 2010 Kilgarriff: KELLY Evaluation For each of the languages, two corpora available:  Web and Wiki  Dutch: also a carefully­designed lexicographic corpus. Hypothesis: Wiki corpora are ‘informational’  Informational --> typical written  Interactional --> typical spoken

25 Leeds April 2010 Kilgarriff: KELLY Evaluation 1st, 2nd person pronouns  strong indicators of interactional language.  English: I me my mine you your yours we us our For each languages  Ratio: web:wiki

26 Leeds April 2010 Kilgarriff: KELLY Results

27 Leeds April 2010 Kilgarriff: KELLY

28 Leeds April 2010 Kilgarriff: KELLY28 Stages  Sort out corpora, tagging  Automatically generate M1 lists names, numbers, countries... keywords vis-a-vis other corpora  Review, prepare M2 lists  Translate

29 Leeds April 2010 Kilgarriff: KELLY29 review - how?  points system 2 points for each of 6 levels 12 points for most freq words  deduct points for words in over- represented areas  add in words from other corpora

30 Leeds April 2010 Kilgarriff: KELLY30 Translation database  On the web  All translations entered into it  Queries like All Swedish words used as translations more than six times All 1:1:1:1... 'simple cases'

31 Leeds April 2010 Kilgarriff: KELLY31 Translations Usually, of texts Words in context Kelly: no context  Usual principles don't apply  Instructions to translators

32 Leeds April 2010 Kilgarriff: KELLY32 Using the database  Find words not in M2 lists, that need adding Multiwords English look for Probably, the translation of a high-freq word in several of the 8 other lgs So:  add it to English list Homonyms: could be similar

33 Leeds April 2010 Kilgarriff: KELLY33 Monolingual master lists (M3) ‏  Based on a WAC corpus  Input from other same-lg corpora  And from translations from 8 lgs Useful words which might not be hi-freq  added words/multiwords must be above a lower freq threshold  Target 9000  Important contribution

34 Leeds April 2010 Kilgarriff: KELLY34 Numbers  Target: 9000 per list  M2 lists Estimate: 5000-6000 needed We add 3000-4000 multiwords and other 'back-translations'

35 Leeds April 2010 Kilgarriff: KELLY35 From M3 lists to T2 lists

36 Leeds April 2010 Kilgarriff: KELLY36 Current status  M1 lists prepared  Lists checked, compared with other lists Corpus-based and other  M2 lists prepared  Translation underway

37 Leeds April 2010 Kilgarriff: KELLY37 Big problems  Multiwords (as anticipated) ‏  Homonymy (as anticipated) ‏  orange banana alphabet elbow, Hello Worse than anticipated Lists from spoken corpora, learner corpora, needed Relation between  Competence for communicating  The corpora at our disposal

38 Leeds April 2010 Kilgarriff: KELLY38 Word lists are useful, but ...are they scientific? A tiny bit, occasionally ...could they be scientific? Yes  article of faith By the end of KELLY, we'll have a clearer idea how

39 Leeds April 2010 Kilgarriff: KELLY39  http://forbetterenglish.com http://forbetterenglish.com


Download ppt "1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass."

Similar presentations


Ads by Google