Download presentation
Presentation is loading. Please wait.
Published byBrook Austin Modified over 9 years ago
1
Terminology, translation, and PRESEMT; word frequency lists and KELLY 1 Adam Kilgarriff Lexical Computing Ltd SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY
2
PRESEMT EU FP7 project FP7-ICT-4-248307 2010-1012 Pattern Recognition based Statistically Enhanced MT Six partners, five countries Languages: Czech English German Greek Italian Comparable Corpora BootCat (CCBC) Demo by Jan Pomikalek http://www.presemt.eu SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY2
3
KELLY “Keywords for Language Learning for Young and adults alike” EU lifelong learning project: – Goal: wordcards Word in one lg on one side, other on other Language learning – 9 languages, 36 pairs Arabic Chinese English Greek Italian Norwegian Polish Russian Sweden – Partners in 6 countries http://su.avedas.com/converis/contract/321 SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY3
4
SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY4
5
Method Prepare monolingual lists Translate – Each into 8 target languages – Professional translation services Integrate, finalise Produce cards Goal for each set – 9000 pairs at 6 levels SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY5
6
Stages Sort out corpora, tagging Automatically generate M1 lists – names, numbers, countries... – keywords vis-a-vis other corpora Review, compare, prepare M2 lists Translate Use translations: M3 lists Finalise SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY6
7
review - how? points system – 2 points for each of 6 levels – 12 points for most freq words deduct points for words in over-represented areas add in words from other corpora SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY7
8
Translation database On the web All translations entered into it Queries like – All Swedish words used as translations more than six times – All 1:1:1:1... 'simple cases' SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY8
9
Using the translations database Find words not in M2 lists, that need adding – Multiwords – English look for – Probably, the translation of a high-freq word in several of the 8 other lgs – So: add it to English list – Homonyms: could be similar SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY9
10
Monolingual master lists (M3) Based on a WAC corpus Input from other same-lg corpora And from translations from 8 lgs – Useful words which might not be hi-freq added words/multiwords must be above a lower freq threshold Target 9000 SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY10
11
Matches across 9 languages Set of symmetrical relations across all 36 pairs – music – library – sun – hospital – theory SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY11
12
Big problems Multiwords (as anticipated) Homonymy (as anticipated) orange banana alphabet elbow, Hello – Worse than anticipated – Lists from spoken corpora, learner corpora, needed – Relation between Competence for communicating The corpora at our disposal SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY12
13
SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY 13 (Monolingual) Word Lists Define a syllabus Which words get used in Learning-to-read books (NS children) NNS language learner textbooks Dictionaries Language testing NS: educational psychologists NNS: proficiency levels
14
SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY 14 Should be corpus-based Most aren't Corpora are quite new Easy to do better People will use them Maybe also Governments
15
SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY 15 How Take your corpus Count Voila
16
SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY 16 Complications What is a word Words and lemmas Grammatical classes Numbers, names... Multiwords Homonymy All are slightly different issues for each lg
17
SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY 17 What is a word; delimiters Found between spaces Not for Chinese: segmentation English co-operate, widely-held, farmer's, can't Norwegian, Swedish Compounding, separable verbs Arabic, Italian Clitics, al,... ...
18
SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY 18 Words and lemmas Word form (in text) invading Lemma (dictionary headword) Invade for forms invade invades invaded invading Lemmatisation Chinese, none; English, simple Middling: Swe Nor It Gr Tough: Rus, Pol, Ara
19
SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY 19 Word Families Derivational morphology efficient/efficiently access/accessible/accessibility available/availability/unavailable ‘Word families’ tradition eg: Coxhead, Academic word list Pedagogy: one item to learn But Where do families end? Different meanings
20
SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY 20 Grammatical classes brush (verb) and brush (noun) Same item or different? (both in same word family) Required (short) list of word classes POS-tagger Will make mistakes
21
SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY 21 Marginal cases Numbers twelve, seventeenth, fifties Closed sets Days of week, months Countries Capitals, nationalities, currencies, adjectives, languages regional/dialects, political groups, religions easter, christmas, islam, republican policies always needed
22
SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY 22 Multiwords According to Linguistically a word but Multiword frequency list: top item of the Can't use freqs (alone) to select multiwords
23
SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY 23 Homonymy bank (river) and bank (money) Word sense disambiguation We can't do (with decent accuracy) We can't give freqs for senses Lists of words not meanings Sometimes disconcerting
24
SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY 24 Corpora A fairly arbitrary sample of a lg To limit arbitrariness of wordlist Make it big and diverse WACKY corpora From web Can do for any language ??? Comparable ??? Web language: less formal
25
SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY 25 Word lists are useful, but ...are they scientific? A tiny bit, occasionally ...could they be scientific? Yes article of faith By the end of KELLY, we'll have a clearer idea how
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.