Presentation is loading. Please wait.

Presentation is loading. Please wait.

Terminology, translation, and PRESEMT; word frequency lists and KELLY 1 Adam Kilgarriff Lexical Computing Ltd SKEW-2, March 2011Kilgarriff: PRESEMT and.

Similar presentations


Presentation on theme: "Terminology, translation, and PRESEMT; word frequency lists and KELLY 1 Adam Kilgarriff Lexical Computing Ltd SKEW-2, March 2011Kilgarriff: PRESEMT and."— Presentation transcript:

1 Terminology, translation, and PRESEMT; word frequency lists and KELLY 1 Adam Kilgarriff Lexical Computing Ltd SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY

2 PRESEMT EU FP7 project FP7-ICT-4-248307 2010-1012 Pattern Recognition based Statistically Enhanced MT Six partners, five countries Languages: Czech English German Greek Italian Comparable Corpora BootCat (CCBC) Demo by Jan Pomikalek http://www.presemt.eu SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY2

3 KELLY “Keywords for Language Learning for Young and adults alike” EU lifelong learning project: – Goal: wordcards Word in one lg on one side, other on other Language learning – 9 languages, 36 pairs Arabic Chinese English Greek Italian Norwegian Polish Russian Sweden – Partners in 6 countries http://su.avedas.com/converis/contract/321 SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY3

4 SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY4

5 Method Prepare monolingual lists Translate – Each into 8 target languages – Professional translation services Integrate, finalise Produce cards Goal for each set – 9000 pairs at 6 levels SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY5

6 Stages Sort out corpora, tagging Automatically generate M1 lists – names, numbers, countries... – keywords vis-a-vis other corpora Review, compare, prepare M2 lists Translate Use translations: M3 lists Finalise SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY6

7 review - how? points system – 2 points for each of 6 levels – 12 points for most freq words deduct points for words in over-represented areas add in words from other corpora SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY7

8 Translation database On the web All translations entered into it Queries like – All Swedish words used as translations more than six times – All 1:1:1:1... 'simple cases' SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY8

9 Using the translations database Find words not in M2 lists, that need adding – Multiwords – English look for – Probably, the translation of a high-freq word in several of the 8 other lgs – So: add it to English list – Homonyms: could be similar SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY9

10 Monolingual master lists (M3)‏ Based on a WAC corpus Input from other same-lg corpora And from translations from 8 lgs – Useful words which might not be hi-freq added words/multiwords must be above a lower freq threshold Target 9000 SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY10

11 Matches across 9 languages Set of symmetrical relations across all 36 pairs – music – library – sun – hospital – theory SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY11

12 Big problems Multiwords (as anticipated)‏ Homonymy (as anticipated)‏ orange banana alphabet elbow, Hello – Worse than anticipated – Lists from spoken corpora, learner corpora, needed – Relation between Competence for communicating The corpora at our disposal SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY12

13 SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY 13 (Monolingual) Word Lists  Define a syllabus  Which words get used in Learning-to-read books (NS children) ‏ NNS language learner textbooks Dictionaries Language testing  NS: educational psychologists  NNS: proficiency levels

14 SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY 14 Should be corpus-based  Most aren't Corpora are quite new  Easy to do better  People will use them Maybe also Governments

15 SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY 15 How  Take your corpus  Count  Voila

16 SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY 16 Complications  What is a word  Words and lemmas  Grammatical classes  Numbers, names...  Multiwords  Homonymy All are slightly different issues for each lg

17 SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY 17 What is a word; delimiters  Found between spaces Not for Chinese: segmentation  English co-operate, widely-held, farmer's, can't  Norwegian, Swedish Compounding, separable verbs  Arabic, Italian Clitics, al,... ...

18 SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY 18 Words and lemmas  Word form (in text) ‏ invading  Lemma (dictionary headword) ‏  Invade for forms invade invades invaded invading  Lemmatisation Chinese, none; English, simple Middling: Swe Nor It Gr Tough: Rus, Pol, Ara

19 SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY 19 Word Families  Derivational morphology efficient/efficiently access/accessible/accessibility available/availability/unavailable  ‘Word families’ tradition  eg: Coxhead, Academic word list Pedagogy: one item to learn But  Where do families end? Different meanings

20 SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY 20 Grammatical classes  brush (verb) and brush (noun) ‏ Same item or different? (both in same word family)  Required (short) list of word classes POS-tagger  Will make mistakes

21 SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY 21 Marginal cases Numbers  twelve, seventeenth, fifties Closed sets  Days of week, months Countries  Capitals, nationalities, currencies, adjectives, languages regional/dialects, political groups, religions  easter, christmas, islam, republican  policies always needed

22 SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY 22 Multiwords  According to Linguistically a word but  Multiword frequency list: top item of the Can't use freqs (alone) to select multiwords

23 SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY 23 Homonymy  bank (river) and bank (money) ‏  Word sense disambiguation We can't do (with decent accuracy) ‏ We can't give freqs for senses  Lists of words not meanings Sometimes disconcerting

24 SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY 24 Corpora  A fairly arbitrary sample of a lg  To limit arbitrariness of wordlist Make it big and diverse  WACKY corpora From web Can do for any language  ??? Comparable ??? Web language: less formal

25 SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY 25 Word lists are useful, but ...are they scientific? A tiny bit, occasionally ...could they be scientific? Yes  article of faith By the end of KELLY, we'll have a clearer idea how


Download ppt "Terminology, translation, and PRESEMT; word frequency lists and KELLY 1 Adam Kilgarriff Lexical Computing Ltd SKEW-2, March 2011Kilgarriff: PRESEMT and."

Similar presentations


Ads by Google