Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
Dependency tree projection across parallel texts David Mareček Charles University in Prague Institute of Formal and Applied Linguistics.
Combining Word-Alignment Symmetrizations in Dependency Tree Projection David Mareček Charles University in Prague Institute of.
The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
1 Phrase alignment of Estonian-German parallel treebanks Heli Uibo and Krista Liin, University of Tartu Martin Volk, Stockholm University.
June 6, 20073rd PIRE Meeting1 Tectogrammatical Representation of English in Prague Czech-English Dependency Treebank Lucie Mladová Silvie Cinková, Kristýna.
A Syntactic Translation Memory Vincent Vandeghinste Centre for Computational Linguistics K.U.Leuven
Shallow Parsing CS 4705 Julia Hirschberg 1. Shallow or Partial Parsing Sometimes we don’t need a complete parse tree –Information extraction –Question.
Machine Translation (Level 2) Anna Sågvall Hein GSLT Course, September 2004.
Tools and resources Summary of working group discussion.
Introduction to treebanks Session 1: 7/08/
Language Models for Machine Translation: Original vs. Translated Texts Gennadi Lembersky Noam Ordan Shuly Wintner MTML, 2011.
C SC 620 Advanced Topics in Natural Language Processing Lecture 20 4/8.
Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra.
Stemming, tagging and chunking Text analysis short of parsing.
DS-to-PS conversion Fei Xia University of Washington July 29,
Are Linguists Dinosaurs? 1.Statistical language processors seem to be doing away with the need for linguists. –Why do we need linguists when a machine.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.
POS based on Jurafsky and Martin Ch. 8 Miriam Butt October 2003.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004.
1 I256: Applied Natural Language Processing Marti Hearst Sept 25, 2006.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
METIS-II: a hybrid MT system Peter Dirix Vincent Vandeghinste Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven TMI 2007,
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
Workshop on Treebanks, Rochester NY, April 26, 2007 The Penn Treebank: Lessons Learned and Current Methodology Ann Bies Linguistic Data Consortium, University.
Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University.
Natural Language Processing Expectation Maximization.
Automated Essay Evaluation Martin Angert Rachel Drossman.
LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Martin KayCL Introduction1 Martin Kay Stanford University Ling 138/238.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
1/21 Introduction to TectoMT Zdeněk Žabokrtský, Martin Popel Institute of Formal and Applied Linguistics Charles University in Prague CLARA Course on Treebank.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
A Survey of NLP Toolkits Jing Jiang Mar 8, /08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)
Syntactically annotated corpora of Estonian Heli Uibo Institute of Computer Science University of Tartu
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
The CoNLL-2013 Shared Task on Grammatical Error Correction Hwee Tou Ng, Yuanbin Wu, and Christian Hadiwinoto 1 Siew.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
Enhanced Infrastructure for Creation & Collection of Translation Resources Zhiyi Song, Stephanie Strassel (speaker), Gary Krug, Kazuaki Maeda.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
Machine Translation (Level 2) Anna Sågvall Hein GSLT Course, January 2003.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
LREC 2004, 26 May 2004, Lisbon 1 Multimodal Multilingual Resources in the Subtitling Process S.Piperidis, I.Demiros, P.Prokopidis, P.Vanroose, A. Hoethker,
Supertagging CMSC Natural Language Processing January 31, 2006.
Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics.
MACHINE TRANSLATION PAPER 1 Daniel Montalvo, Chrysanthia Cheung-Lau, Jonny Wang CS159 Spring 2011.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
Arabic Syntactic Trees Zdeněk Žabokrtský Otakar Smrž Center for Computational Linguistics Faculty of Mathematics and Physics Charles University in Prague.
Correcting Comma Errors in Learner Essays, and Restoring Commas in Newswire Text Ross Israel Indiana University Joel Tetreault Educational Testing Service.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
A NOMINATIVE on the case of the German has 4 cases NOMINATIVE
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.
1/16 TectoMT Zdeněk Žabokrtský ÚFAL MFF UK Software framework for developing MT systems (and other NLP applications)
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
The University of Illinois System in the CoNLL-2013 Shared Task Alla RozovskayaKai-Wei ChangMark SammonsDan Roth Cognitive Computation Group University.
Language Identification and Part-of-Speech Tagging
Syntax.
Statistical Machine Translation Papers from COLING 2004
Presentation transcript:

Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University

2 23 August 2008 English Syntax Tree

3 23 August 2008

4

5 DE – EN Alignment

6 23 August 2008 SMULTRON Stockholm MULtilingual TReebank 1000 sentences in 3 languages (DE-EN-SV)  500 from Jostein Gaarder’s Sophie’s World (~ tokens, 14 tokens/sentence) and  500 from Economy texts (~ tokens, 22 tokens/sentence) ABB Quarterly report Rainforest Alliance: Banana Certification Program SEB Annual report  Released: January

7 23 August 2008 German Annotation

8 23 August 2008 German sentence: flat annotation

9 23 August 2008 German sentence: deepened

10 23 August 2008 English Annotation

11 23 August 2008 English Syntax Tree

12 23 August 2008 English annotation Follows the Penn Treebank guidelines Slower annotation because of  insertion of traces  secondary edges  deeper trees

13 23 August 2008

14 23 August 2008 Tree Alignment

15 23 August 2008 Sentence alignment Word alignment  input for Statistical MT Phrase alignment  linguistically motivated phrases  input for Example-based MT

16 23 August 2008 Alignment Example

17 23 August 2008 Tools for Parallel Treebanks creating and editing trees  from mono-lingual treebanks  PoS-taggers, chunkers, editor, ’tree-enricher’ aligning phrases  use of word alignment tools  tree alignment editor  Stockholm TreeAligner searching across languages  TIGER-Search for parallel treebanks  Stockholm TreeAligner

18 23 August 2008 Guidelines for Alignment 1. Align words and phrases that represent the same meaning and could serve as translation units in an MT system. 2. Align as many words and phrases as possible. 3. Distinguish between exact and approximate alignments. 4. 1:n word / phrase alignments are allowed, but not m:n word / phrase alignments. 5. m:n sentence alignments are allowed.

19 23 August 2008 Examples Do not align:  die Verwunderung über das Leben  their astonishment at the world Do align:  was für eine seltsame Welt  what an extraordinary world

20 23 August 2008 Specific rules a pronoun in one language shall never be aligned with a full noun in the other names are aligned regardless of spelling, unless the name is changed (fiction) ignore number/case but not voice

21 23 August 2008 Exact vs approximate alignment best vs. ”second-best” translation an acronym in one language shall be aligned as approximate (fuzzy) with a spelled-out term in the other  PT – Power Technologies difficult distinctions  einer der ersten Tage im Mai – early May

22 23 August 2008 Related Research Blinker project (Melamed) Prague Czech-English Treebank Example-based MT in Dublin Linköping English-Swedish Treebank

23 23 August 2008 Experiment 12 students to align 20 tree pairs DE-EN  10 tree pairs from Sophie’s world  10 tree pairs from Economy text advanced CL students received  short introduction  the written guidelines

24 23 August 2008 Gold Standard Alignment (DE-EN) word - wordphrase - phrase exactapprox.exactapprox. 10 sent. Sophie sent. Econ

25 23 August 2008 Experiment: Results The students created a huge variety in number of alignments Sophie part: from 47 to 125 (ø = 94.3) Econ part: from 62 to 259 (ø = 186.9)  the 3 students with the lowest numbers were non-native speakers of German  1 student had misunderstood the task

26 23 August 2008 Experiment: Results The remaining 8 students had a high overlap with the gold standard (Recall):  Sophie part: from 48% to 81% (ø = 68.7%)  Econ part: from 66% to 89% (ø = 75.5%) Precision  Sophie part: from 81% to 97% (ø = 89.1%)  Econ part: from 78% to 94% (ø = 88.2%)

27 23 August 2008 Discrepancies students sometimes aligned a word (or some words) with a node.  e.g. the word natürlich to the phrase of course students sometimes aligned a German verb group with a single verb form in English  e.g. ist zurückzuführen vs. reflecting

28 23 August 2008 Discrepancies based on different grammatical forms: a definite single NP in German with an indefinite plural NP in English  der Umsatz vs. revenues a German genitive NP with a PP in English  der beiden Divisionen vs. of the two divisions

29 23 August 2008 Missed by all students alignment of German word to empty token in English  wenn sie die Hand ausstreckte vs.  herself shaking hands

30 23 August 2008

31 23 August 2008 Conclusions 1. Our alignment guidelines are sufficient for a core of clear alignment decisions. 2. Needed: 1. Better alignment rules with concrete examples. 2. Better support tools (consistency checking). 3. The distinction between exact alignment and approximate alignment is very tricky.

32 23 August 2008 Thank You for Your Attention! Questions???

33 23 August 2008 Applications of Parallel Treebanks For the Translator 1. corpus for translation studies  search tools needed For the Computational Linguist 1. input for Example-based Machine Translation 2. evaluation corpus for word, phrase or clause alignment 3. training corpus for transfer rules

34 23 August 2008 Alignment Example

35 23 August 2008 Parallel Treebanking DE sentence SV sentence flat DE tree ANNOTATE - PoS tagger (STTS) - Chunker (TIGER) flat SV tree PoS tagger (SUC) STTS conversion ANNOTATE - Chunker (SWE-TIGER) DE treeSV tree DeepeningDeepening + Back conv. phrase alignment