Terminology, translation, and PRESEMT; word frequency lists and KELLY 1 Adam Kilgarriff Lexical Computing Ltd SKEW-2, March 2011Kilgarriff: PRESEMT and.

Slides:



Advertisements
Similar presentations
Terminology-finding in the Sketch Engine Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vit Suchomel Lexical Computing Ltd., Brighton,
Advertisements

The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
The Cambridge Learner Corpus, English Profile, the Sketch Engine and the Kelly Project Adam Kilgarriff Lexical Computing Ltd
Corpus Processing and NLP
WebBootCaT usage Adam Kilgarriff Lexical Computing Ltd.
1 Corpora for all Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Measuring Distance between Language Varieties Adam Kilgarriff, Jan Pomikalek, Pavel Rychly, Vit Suchomel Supported by EU Project PRESEMT.
A Syntactic Translation Memory Vincent Vandeghinste Centre for Computational Linguistics K.U.Leuven
Academic style.
Linking Dictionary and Corpus Adam Kilgarriff Lexicography MasterClass Ltd Lexical Computing Ltd University of Sussex UK.
Police-Rescue Learner’s Dictionary Epp Leibur, Külli Saluste.
J. Kunzmann, K. Choukri, E. Janke, A. Kießling, K. Knill, L. Lamel, T. Schultz, and S. Yamamoto Automatic Speech Recognition and Understanding ASRU, December.
Using Corpora for Teaching Chinese Dr. Adam Kilgarriff Lexical Computing Ltd Leeds University UK.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
CALL 2008 Antwerp Choosing words and their order for vocabulary CALL Cornelia Tschichold Swansea.
Making useful wordlists for ELT Topical vocabulary from the WWW Simon Smith & Scott Sommers Ming Chuan University, Taipei Adam Kilgarriff, Lexical Computing.
Measuring Monolinguality Chris Biemann NLP Department, University of Leipzig LREC-06 Workshop on Quality Assurance and Quality Measurement for Language.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
The origins of language curriculum development
Data-Driven South Asian Language Learning SALRC Pedagogy Workshop June 8, 2005 J. Scott Payne Penn State University
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
Corpora and Language Teaching
Corpus Linguistics Lexicography. Questions for lexicography in corpus linguistics How common are different words? How common are the different senese.
What's on the Web? The Web as a Linguistic Corpus Adam Kilgarriff Lexical Computing Ltd University of Leeds.
14: THE TEACHING OF GRAMMAR  Should grammar be taught?  When? How? Why?  Grammar teaching: Any strategies conducted in order to help learners understand,
Memory Strategy – Using Mental Images
Simple Maths for Keywords Adam Kilgarriff Lexical Computing Ltd.
Labels: automation Adam Kilgarriff. Auckland 2012Kilgarriff / Labels: automation2 Which words are:  Most distinctive of business English?  Most often.
1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.
Using Corpora for Teaching Chinese Dr. Adam Kilgarriff Lexical Computing Ltd Leeds University UK.
Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd.
1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass.
Text Analysis Everything Data CompSci Spring 2014.
eTwinning project 2012/13 Future Generation Photography as a Pedagogical Tool.
Using corpora for bespoke language teaching
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
First International Sketch Grammar Workshop Ljubljana 3-4 February 2010.
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing.
WordNet ® and its Java API ♦ Introduction to WordNet ♦ WordNet API for Java Name: Hao Li Uni: hl2489.
1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of.
Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Comparable Corpora BootCaT (CCBC) (or: In Praise of BootCaT) Adam Kilgarriff, Jan Pomikalek, Avinesh PVS Lexical Computing Ltd. Work Supported by EU FP7.
1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds.
1 Evaluating word sketches and corpora Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Corpus Evaluation Adam Kilgarriff Lexical Computing Ltd Corpus evaluationPortsmouth Nov
Using Corpora in Language Research Adam Kilgarriff Lexical Computing Ltd Universities of Leeds January 2013Adam Kilgarriff.
Malta, May 2010Kilgarriff: Corpora by Web Services1 Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities.
Terminology-finding in the Sketch Engine Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vit Suchomel Lexical Computing Ltd., Brighton,
Subcorpus configuration Adam Kilgarriff. Feb 2010Kilgarriff: IWSG: Subcorpora2 “you can’t get away from genre” Bonnie Weber, Keynote Lecture ICON (Indian.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
1 Ch 1. VOCABULARY SIZE, TEXT COVERAGE & WORD LISTS Nation& Waring.
Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd., UK Supported by EU Project PRESEMT.
Gardner, D. (2007). Validating the construct of word in applied corpus-based vocabulary research: A critical survey. Applied Linguistics, 28(2), 241–265.
GDEX: Automatically finding good dictionary examples in a corpus Kivik 2013Kilgarriff: GDEX1.
 Written English may be formal and informal  Academic writing is formal in an impersonal or objective style; cautious language is frequently used; vocabulary.
Changes in English 1 In this presentation we are going to look at the way other languages have influenced English and at the similarities and differences.
This morning I’m going to tell you about Language Futures
Measuring Monolinguality
Making useful wordlists for ELT

Evaluating word sketches and corpora
Corpus Linguistics I ENG 617
Tomaž Erjavec1, Adam Kilgarriff2, Irena Srdanović Erjavec3
Applied Linguistics Chapter Four: Corpus Linguistics
Presentation transcript:

Terminology, translation, and PRESEMT; word frequency lists and KELLY 1 Adam Kilgarriff Lexical Computing Ltd SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY

PRESEMT EU FP7 project FP7-ICT Pattern Recognition based Statistically Enhanced MT Six partners, five countries Languages: Czech English German Greek Italian Comparable Corpora BootCat (CCBC) Demo by Jan Pomikalek SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY2

KELLY “Keywords for Language Learning for Young and adults alike” EU lifelong learning project: – Goal: wordcards Word in one lg on one side, other on other Language learning – 9 languages, 36 pairs Arabic Chinese English Greek Italian Norwegian Polish Russian Sweden – Partners in 6 countries SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY3

SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY4

Method Prepare monolingual lists Translate – Each into 8 target languages – Professional translation services Integrate, finalise Produce cards Goal for each set – 9000 pairs at 6 levels SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY5

Stages Sort out corpora, tagging Automatically generate M1 lists – names, numbers, countries... – keywords vis-a-vis other corpora Review, compare, prepare M2 lists Translate Use translations: M3 lists Finalise SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY6

review - how? points system – 2 points for each of 6 levels – 12 points for most freq words deduct points for words in over-represented areas add in words from other corpora SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY7

Translation database On the web All translations entered into it Queries like – All Swedish words used as translations more than six times – All 1:1:1:1... 'simple cases' SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY8

Using the translations database Find words not in M2 lists, that need adding – Multiwords – English look for – Probably, the translation of a high-freq word in several of the 8 other lgs – So: add it to English list – Homonyms: could be similar SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY9

Monolingual master lists (M3)‏ Based on a WAC corpus Input from other same-lg corpora And from translations from 8 lgs – Useful words which might not be hi-freq added words/multiwords must be above a lower freq threshold Target 9000 SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY10

Matches across 9 languages Set of symmetrical relations across all 36 pairs – music – library – sun – hospital – theory SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY11

Big problems Multiwords (as anticipated)‏ Homonymy (as anticipated)‏ orange banana alphabet elbow, Hello – Worse than anticipated – Lists from spoken corpora, learner corpora, needed – Relation between Competence for communicating The corpora at our disposal SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY12

SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY 13 (Monolingual) Word Lists  Define a syllabus  Which words get used in Learning-to-read books (NS children) ‏ NNS language learner textbooks Dictionaries Language testing  NS: educational psychologists  NNS: proficiency levels

SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY 14 Should be corpus-based  Most aren't Corpora are quite new  Easy to do better  People will use them Maybe also Governments

SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY 15 How  Take your corpus  Count  Voila

SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY 16 Complications  What is a word  Words and lemmas  Grammatical classes  Numbers, names...  Multiwords  Homonymy All are slightly different issues for each lg

SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY 17 What is a word; delimiters  Found between spaces Not for Chinese: segmentation  English co-operate, widely-held, farmer's, can't  Norwegian, Swedish Compounding, separable verbs  Arabic, Italian Clitics, al,... ...

SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY 18 Words and lemmas  Word form (in text) ‏ invading  Lemma (dictionary headword) ‏  Invade for forms invade invades invaded invading  Lemmatisation Chinese, none; English, simple Middling: Swe Nor It Gr Tough: Rus, Pol, Ara

SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY 19 Word Families  Derivational morphology efficient/efficiently access/accessible/accessibility available/availability/unavailable  ‘Word families’ tradition  eg: Coxhead, Academic word list Pedagogy: one item to learn But  Where do families end? Different meanings

SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY 20 Grammatical classes  brush (verb) and brush (noun) ‏ Same item or different? (both in same word family)  Required (short) list of word classes POS-tagger  Will make mistakes

SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY 21 Marginal cases Numbers  twelve, seventeenth, fifties Closed sets  Days of week, months Countries  Capitals, nationalities, currencies, adjectives, languages regional/dialects, political groups, religions  easter, christmas, islam, republican  policies always needed

SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY 22 Multiwords  According to Linguistically a word but  Multiword frequency list: top item of the Can't use freqs (alone) to select multiwords

SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY 23 Homonymy  bank (river) and bank (money) ‏  Word sense disambiguation We can't do (with decent accuracy) ‏ We can't give freqs for senses  Lists of words not meanings Sometimes disconcerting

SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY 24 Corpora  A fairly arbitrary sample of a lg  To limit arbitrariness of wordlist Make it big and diverse  WACKY corpora From web Can do for any language  ??? Comparable ??? Web language: less formal

SKEW-2, March 2011Kilgarriff: PRESEMT and KELLY 25 Word lists are useful, but ...are they scientific? A tiny bit, occasionally ...could they be scientific? Yes  article of faith By the end of KELLY, we'll have a clearer idea how