Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING PoS-Tagging theory and terminology COMP3310 Natural Language Processing.
Dependency tree projection across parallel texts David Mareček Charles University in Prague Institute of Formal and Applied Linguistics.
Combining Word-Alignment Symmetrizations in Dependency Tree Projection David Mareček Charles University in Prague Institute of.
Corpus Processing and NLP
CPSC 422, Lecture 16Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 16 Feb, 11, 2015.
Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.
June 6, 20073rd PIRE Meeting1 Tectogrammatical Representation of English in Prague Czech-English Dependency Treebank Lucie Mladová Silvie Cinková, Kristýna.
BİL711 Natural Language Processing
Used in place of a noun pronoun.
Part-of-speech tagging. Parts of Speech Perhaps starting with Aristotle in the West (384–322 BCE) the idea of having parts of speech lexical categories,
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
POS based on Jurafsky and Martin Ch. 8 Miriam Butt October 2003.
BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen.
CS224N Interactive Session Competitive Grammar Writing Chris Manning Sida, Rush, Ankur, Frank, Kai Sheng.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University.
Grammar Skills Workshop
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
NERIL: Named Entity Recognition for Indian FIRE 2013.
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
Evaluating Statistically Generated Phrases University of Melbourne Department of Computer Science and Software Engineering Raymond Wan and Alistair Moffat.
PHRASES & CLAUSES AND WHY COMMAS ARE IMPORTANT!. WORD CLASSES Every word in the English language belongs to a “class”. It will be one of the following:
Fall 2005 Lecture Notes #8 EECS 595 / LING 541 / SI 661 Natural Language Processing.
April 17, 2007MT Marathon: Tree-based Translation1 Tree-based Translation with Tectogrammatical Representation Jan Hajič Institute of Formal and Applied.
The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
Cs target cs target en source Subject-PastParticiple agreement Czech subject and past participle must agree in number and gender. Two-step translation.
Approximating a Deep-Syntactic Metric for MT Evaluation and Tuning Matouš Macháček, Ondřej Bojar; {machacek, Charles University.
Parts of Speech Review English II.   Welcome to the first day of our “GRID”! GRID stands for:  Grammar Day –short lessons on important points of grammar.
Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University.
LANGUAGE ARTS LA WORKS UNIT 3 REVIEW STUDY GUIDE.
CPSC 422, Lecture 15Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 15 Oct, 14, 2015.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
English Review for Final These are the chapters to review. In Textbook: Chapter 9 Nouns Chapter 10 Pronouns Chapter 11 Adjectives Chapter 12 Verbs Chapter.
Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics.
Part-of-speech tagging
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
IELTS Intensive Writing part two. IELTS Writing Two parts of ielts writing Part one writing about a Graph, chart, diagram Part two is an essay.
PARTS OF SPEECH ANSWER: QUESTION: HOW MANY PARTS OF SPEECH ARE IN THE ENGLISH LANGUAGE? A.4 B.6 C.8.
LING/C SC/PSYC 438/538 Lecture 18 Sandiway Fong. Adminstrivia Homework 7 out today – due Saturday by midnight.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
Parts of Speech By: Miaya Nischelle Sample. NOUN A noun is a person place or thing.
Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.
ENGLISH is a language Learning mode of ENGLISH Subject Language(Spoken) Literature Competition.
Identifying Expressions of Opinion in Context Eric Breck and Yejin Choi and Claire Cardie IJCAI 2007.
Lecture 9: Part of Speech
Parts of Speech Review.
Prepositions: Day 1 1/20.
Introduction to Machine Learning and Text Mining
Parts of Speech How Words Function.
Words, Phrases, Clauses, & Sentences
Appendix A: Basic Grammar and Punctuation Reference
David Mareček and Zdeněk Žabokrtský
DGP: Daily Grammar Practice
LING/C SC/PSYC 438/538 Lecture 20 Sandiway Fong.
The Eight Parts of Speech
Core Concepts Lecture 1 Lexical Frequency.
FIRST SEMESTER GRAMMAR
Parts of Speech How Words Function.
Week 3 Warm-Ups English 12 Mrs. Fountain.
Natural Language Processing
Parts of Speech.
Presentation transcript:

Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová Institute of Formal and Applied Linguistics, ÚFAL MFF, Charles University in Prague Motivation Full text, acknowledgement and the list of references in the proceedings of LREC Manual Annotation Automatic Word Alignment Types of connections used to compare annotations: Possible, Sure, Phrasal Connection of Any Type AnnotatorA115,47615,399 A216,63116,246 MismatchA1 but not A22,3431,146 A2 but not A13,4981,714 Relative mismatch18.2 %9.0 % Intersection (1-1)Union (n-n) PrecRecAERPrecRecAER Baseline Lemmas Lemmas + Numbers Lemmas + Singletons backed off with POS HumansGIZA++BaselineImproved encsencs Problems ProblemsOK OKProblems OK Problematic WordsProblematic Parts of Speech EnglishCzechEnglishCzech 361to319,679IN1348N 259the271se519DT1283V 159of146v510NN661R 143a112na386PRP505P 124,74o361TO448Z 107be61že327VB398A 99it55.310JJ280D 95that47a245RB192J 84in41bude216NNP59C 80by37k199VBN22T ………… CzechEnglish Sentences21,141 Running Words475,719494,349 Running Words without Punctuation404,523439,304 BaselineVocabulary57,08530,770 Singletons31,45814,637 LemmasVocabulary28,00725,000 Singletons13,00911,873 Lemmas + Singletons Vocabulary15,04113,150 Singletons122 Where GIZA++ Fails, Humans Were Often in Trouble, Too Details about the Prague Czech-English Dependency Treebank Two human annotations compared against each other. GIZA++ compared against golden alignments (i.e. merged human annotations).  Out of all the positions where GIZA++ failed, 38% were problematic for humans.  The improvement thanks to lemmatization is not observed on words that are difficult for humans anyway. Source: Wall Street Journal section of the Penn Treebank Translated sentence-by-sentence to Czech.  Used twice (Cs->En and En->Cs)  The two guessed alignments can be merged using union, intersection or possibly other techniques. Motivation to manually annotate word alignment: to create evaluation data for automatic alignment methods to learn more about inter-annotator agreement and the limits of the task both annotators mark a sure connection  required connection one of the annotators chooses sure connection and the other any other connection type  required connection at least one of the annotators chooses any connection type  allowed connection otherwise  connection not allowed Two annotators independently annotated 515 sentences using 3 main connection types: the word has no counterpart (null, ) the words can be possibly linked (possible, ) the words are translations of each other (sure, ) Additionally, some segments could have been marked as phrasal translations: whole phrases correspond, but not the individual words (phrasal, ) Top Ten Problematic Words and POSes Steps in statistical machine translation:  Mismatch rate relatively high, but it reduces to a half if the differences in connection type are disregarded. Preprocessing of the input text such as lemmatization significantly reduces data sparseness (see the table Details about the PCEDT below) and helps to achieve better alignments: English Penn Treebank Tag-Set: IN - Preposition or subordinating conjunction, DT - Determiner, NN - Noun, common, singular or mass, PRP - Pronoun, personal, TO - to, VB - Verb, base form, JJ - Adjective, NNP - Noun, proper, singular, VBN - Verb, past participle.Czech Tag-Set: N - Noun, V - Verb, R - Preposition, P - Pronoun, Z - Punctuation, sentence border, A - Adjective, D - Adverb, J - Conjunction, C - Number, T - Particle Verbs and their belongings, including the negative particle. English articles in cases where the rule “connect to the Czech governing noun” cannot be clearly applied. Punctuation: commas are used more frequently in Czech, the dollar symbol ($) is almost always translated and thus rarely repeated in Czech. Most Frequent Problematic Cases Sentence-parallel corpus Automatic word alignment Phrase extraction Evaluation metrics: Precision penalizes superfluous connections (connections generated automatically but not even allowed), recall penalizes forgotten required connections. Alignment-error rate (AER) is a combination of precision and recall. GIZA++ (Och and Ney, 2003) automatically creates asymmetric alignments (1 source word connected to n target words). The test set for GIZA++ was created by merging the two human annotations: The following table displays the percentage of tokens where there was a match (OK) or mismatch (Problems) in the respective languages: Phrase table ~Translation dictionary of multi-word expressions Word-to-word alignments Baseline (raw input text)Zisksevyšvihlna117milionůdolarů Lemmasziskse-1vyšvihnoutna-1117miliondolar Lemmas + Numbersziskse-1vyšvihnoutna-1NUMmiliondolar Lemmas + Singletons Backed off with POSziskse-1VERBna-1117miliondolar GlossRevenuereflsoaredto117milliondollar Results of automatic word alignment: