Deep Linguistic Information in Hybrid Machine Translation

Slides:



Advertisements
Similar presentations
You have been given a mission and a code. Use the code to complete the mission and you will save the world from obliteration…
Advertisements

Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Fig 2.1 Chapter 2.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Title Subtitle.
Addition Facts
Year 6 mental test 5 second questions
ZMQS ZMQS
Dependency tree projection across parallel texts David Mareček Charles University in Prague Institute of Formal and Applied Linguistics.
Combining Word-Alignment Symmetrizations in Dependency Tree Projection David Mareček Charles University in Prague Institute of.
Lesson 6 CDT301 – Compiler Theory, Spring 2011 Teacher: Linus Källberg.
Mind Mapping Techniques to Create Proposals APMP Colorado Chapter March 6, 2012 James J. Franklin San Diego PMI Chapter PMI is a registered trade and service.
ABC Technology Project
Machine Translation II How MT works Modes of use.
© 2012 National Heart Foundation of Australia. Slide 2.
Chapter 5 Test Review Sections 5-1 through 5-4.
Addition 1’s to 20.
25 seconds left…...
Week 1.
We will resume in: 25 Minutes.
1 Unit 1 Kinematics Chapter 1 Day
Immunobiology: The Immune System in Health & Disease Sixth Edition
Self-training with Products of Latent Variable Grammars Zhongqiang Huang, Mary Harper, and Slav Petrov.
From Model-based to Model-driven Design of User Interfaces.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
En->Cz MT system based on tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague.
June 6, 20073rd PIRE Meeting1 Tectogrammatical Representation of English in Prague Czech-English Dependency Treebank Lucie Mladová Silvie Cinková, Kristýna.
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester1 Treebanks: Layering the Annotation Jan Hajič Institute of Formal and Applied Linguistics.
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester1 Treebanks: Language-specific Issues Czech Jan Hajič Institute of Formal and Applied Linguistics.
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester1 Treebanks and Parsing Jan Hajič Institute of Formal and Applied Linguistics School of.
TectoMT two goals of TectoMT –to allow experimenting with MT based on deep- syntactic (tectogrammatical) transfer –to create a software framework into.
1/36 TectoMT Zdeněk Žabokrtský Institute of Formal and Applied Linguistics MFF UK Software framework for developing MT systems (and other NLP applications)
Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University.
1/36 TectoMT Zdeněk Žabokrtský ÚFAL MFF UK Software framework for developing MT systems (and other NLP applications)
PDT 2.0 Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague.
1/21 Introduction to TectoMT Zdeněk Žabokrtský, Martin Popel Institute of Formal and Applied Linguistics Charles University in Prague CLARA Course on Treebank.
Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague.
March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax 1 The PDT Morphology and Surface Syntax.
Morphological Meanings in the Prague Dependency Treebank Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University,
Tree-based Machine Translation using syntax and semantics
April 17, 2007MT Marathon: Tree-based Translation1 Tree-based Translation with Tectogrammatical Representation Jan Hajič Institute of Formal and Applied.
The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
Cs target cs target en source Subject-PastParticiple agreement Czech subject and past participle must agree in number and gender. Two-step translation.
Jan Hajič Otakar Smrž Petr Zemánek Jan Šnaidauf Emanuel Beška Faculty of Mathematics and Physics Faculty of Philosophy and Arts Charles University in Prague.
Approximating a Deep-Syntactic Metric for MT Evaluation and Tuning Matouš Macháček, Ondřej Bojar; {machacek, Charles University.
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
Resemblances between Meaning-Text Theory and Functional Generative Description Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University,
1 / 5 Zdeněk Žabokrtský: Automatic Functor Assignment in the PDT Automatic Functor Assignment (AFA) in the Prague Dependency Treebank PDT : –a long term.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
nd PIRE project workshop1 Tectogrammatical Representation of English Silvie Cinková Lucie Mladová, Anja Nedoluzhko, Jiří Semecký, Jana Šindlerová,
March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Intro 1 The Prague Dependency Treebank (PDT) Introduction Jan Hajič Institute.
Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics.
Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague.
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
Arabic Syntactic Trees Zdeněk Žabokrtský Otakar Smrž Center for Computational Linguistics Faculty of Mathematics and Physics Charles University in Prague.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
NSF PARTNERSHIP FOR RESEARCH AND EDUCATION : M EANING R EPRESENTATION FOR S TATISTICAL L ANGUAGE P ROCESSING 1 TectoMT TectoMT = highly modular software.
1/16 TectoMT Zdeněk Žabokrtský ÚFAL MFF UK Software framework for developing MT systems (and other NLP applications)
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
Approaches to Machine Translation
Urdu-to-English Stat-XFER system for NIST MT Eval 2008
Natural Language Processing (NLP)
Prague Dependency Treebank 2. 0 Zdeněk Žabokrtský Dept
Approaches to Machine Translation
Natural Language Processing (NLP)
Natural Language Processing (NLP)
Presentation transcript:

Deep Linguistic Information in Hybrid Machine Translation Jan Hajič Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics Czech Republic

Outline: From Data To an MT System “DeepBank:” The Prague Czech-English Dependency Treebank (2.0) Texts, annotation style(s), alignment, tools The platform: Treex TectoMT: hybrid MT English → Czech The (old) idea Overall design Core modules (A Speculation on) The Future Dec. 8, 2012 Hybrid MT Workshop - Coling 2012

The Prague Czech-English Dependency Treebank (PCEDT) 2.0 Parallel treebank Aligned trees Aligned nodes Dec. 8, 2012 Hybrid MT Workshop - Coling 2012

The Prague Czech-English Dependency Treebank (PCEDT) 2.0 surface syntax Parallel treebank Dependency style (“Prague”) (surface) syntax syntax & semantics (“tectogrammatics”) syntax & semantics (and more) = “tectogrammatics” Dec. 8, 2012 Hybrid MT Workshop - Coling 2012

The Prague Czech-English Dependency Treebank (PCEDT) 2.0 Parallel treebank Dependency style (“Prague”) (surface) syntax syntax & semantics (“tectogrammatics”) Penn Treebank translation into Czech Názory na její tříměsíční perspektivu se různí. Dec. 8, 2012 Hybrid MT Workshop - Coling 2012

The Prague Czech-English Dependency Treebank (PCEDT) 2.0 Parallel treebank Dependency style (“Prague”) (surface) syntax syntax & semantics (“tectogrammatics”) Penn Treebank translation into Czech 1 million words Published at LDC, June 2012 (LDC2012T08) Also available through LINDAT-Clarin and META-SHARE Dec. 8, 2012 Hybrid MT Workshop - Coling 2012

PCEDT 2.0 The Alignment(s) Czech-English alignments Sentence-level (manual, natural due to translation) At both syntactic levels Word (node) level automatic, test section manually corrected (in part) Dec. 8, 2012 Hybrid MT Workshop - Coling 2012

PCEDT 2.0 The Alignment(s) tectogrammatics Czech-English alignments Sentence-level (manual, natural due to translation) At both syntactic levels 1 → 1 Word (node) level automatic, test section manually corrected (in part), m → n Between annotation levels Tectogrammatics to surface syntax m → n, incl. 1 → 0 Surface syntax to word level (1 → 1) PTB syntax surface syntax Dec. 8, 2012 Hybrid MT Workshop - Coling 2012

Surface syntax annotation English Dependency (head rules + additions, manual corrections) Function label (PDT-style) at all nodes (from PTB + rules) Lemmatization + „pure“ POS tags from PTB Automatic (from PTB) + a few manual corrections Czech PDT style, no change Syntax: automatic (MST); 2000 sent. fully manual for testing Lemmatization and tagging: auto 99%/96%, Spoustová et al. EACL 2009 (COMPOST tagger) http://ufal.mff.cuni.cz/compost (Czech, English & other) No p-level (of course ) Dec. 8, 2012 Hybrid MT Workshop - Coling 2012

Surface syntax annotation English Dependency (head rules + additions, manual corrections) Function label (PDT-style) at all nodes (from PTB + rules) Lemmatization + „pure“ POS tags from PTB Automatic (from PTB) + a few manual corrections Czech PDT style, no change Syntax: automatic (MST); 2000 sent. fully manual for testing Lemmatization and tagging: auto 99%/96%, Spoustová et al. EACL 2009 (COMPOST tagger) http://ufal.mff.cuni.cz/compost (Czech, English & other) No p-level (of course ) Dec. 8, 2012 Hybrid MT Workshop - Coling 2012

Tectogrammatical annotation Manual (both languages) Major features Nodes with „autosemantic“ words only (no function words) Ellipsis „restored“ (new node for verbal arguments) (Semantic) function (dependent → head relation) Verb arguments + ca 50 functions for other relations Valency lexicons attached (Eng: links to PropBank) “Formemes”: prep+case style label (useful in MT and search) Co-reference integrated (Eng: BBN + more), Czech: manually Alignment To surface syntax & between Czech and English This temblor-prone city dispatched inspectors, firefighters and other  earthquake-trained personnel *-1 to aid San Francisco. Dec. 8, 2012 Hybrid MT Workshop - Coling 2012

Hybrid MT Workshop - Coling 2012 Accompanying Tools TrEd (http://ufal.mff.cuni.cz/tred) Annotation, View/Browse and Search environment Open source, perl Search and visualization: Simple data browser (http://ufal.mff.cuni.cz/pcedt2.0) PML-TQ: Powerful query language for complex tree-based annotation Treex (http://ufal.mff.cuni.cz/treex) Modular NLP processing environment Easy handling of complex NLP-annotated data Modules exists for Czech, English data processing incl. 3rd-party tools integrated into Treex CPAN-distributed Dec. 8, 2012 Hybrid MT Workshop - Coling 2012

PCEDT and Tectogrammatics in (hybrid) MT The famous, (almost) “Vauquois” triangle: ANALYSIS TRANSFER SYNTHESIS deep syntax & semantics: tectogrammatical layer t-layer shallow syntax: analytical layer a-layer POS & lemmatization: morphological layer m-layer w-layer source language (English) target language (Czech) Dec. 8, 2012 Hybrid MT Workshop - Coling 2012

Analysis-Transfer-Synthesis Hybrid System Over 90 steps: both rule-based and statistical ANALYSIS TRANSFER SYNTHESIS Grammatemes, formemes t-layer Structural transfer Convert to t-tree Basic morph. categories Analytical dep. function Agreement a-layer Lexical transfer (dictionary)& lexical choice Parsing (MST) Add function words Tagging (Compost) m-layer Generate forms Lemmatization Concatenate Tokenization w-layer source language (English) target language (Czech) Dec. 8, 2012 Hybrid MT Workshop - Coling 2012

Hybrid MT Workshop - Coling 2012 Example Translation should Pred translation Sb a-layer (parse) + functions be Obj . AuxK machine Atr easy Pnom Lemmatized & POS tagged machine translation should be easy . NN NN MD VB JJ . Tokenized Machine translation should be easy . Dec. 8, 2012 Hybrid MT Workshop - Coling 2012

Hybrid MT Workshop - Coling 2012 Example Translation should Pred Mark function nodes & edges to “collapse” translation Sb be Obj . AuxK machine Atr easy Pnom Dec. 8, 2012 Hybrid MT Workshop - Coling 2012

Hybrid MT Workshop - Coling 2012 Example Translation be v:fin T-tree backbone + formemes translation n:subj easy adj:compl machine n:attr Dec. 8, 2012 Hybrid MT Workshop - Coling 2012

Hybrid MT Workshop - Coling 2012 Example Translation be v:fin Modality=hort Conditional=1 Tense=PresSim T-tree backbone + formemes grammatemes translation n:subj Num=sg easy adj:compl DoC=Positive machine n:attr Dec. 8, 2012 Hybrid MT Workshop - Coling 2012

Hybrid MT Workshop - Coling 2012 Example Translation Fill in target language equivalents:* lemmas formemes mít být v:fin v:inf Modality=hort Conditional=1 Tense=PresSim převod překlad posun n:1 Num=sg DoC=Positive snadný jednoduchý adj:compl n:1 adv: Transfer starts: Clone t-tree počítač strojový stroj n:2 adj:attr n:attr * Dictionary translation: MaxEnt classifier, ~106 features Dec. 8, 2012 Hybrid MT Workshop - Coling 2012

Hybrid MT Workshop - Coling 2012 Example Translation mít být v:fin v:inf Modality=hort Conditional=1 Tense=PresSim převod překlad posun n:1 Select best combination of lemmas & Formemes (HMTM) Num=sg DoC=Positive snadný jednoduchý adj:compl n:1 adv: počítač strojový stroj n:2 adj:attr n:attr Dec. 8, 2012 Hybrid MT Workshop - Coling 2012

Hybrid MT Workshop - Coling 2012 Example Translation mít Gen=MInanim C=PastP Num=sg Clone to a-tree, add core morphological & POS tags + agreement function words překlad Num=sg Case=1 . by být C=inf snadný Deg=pos Case=1 Gen=MInanim strojový Deg=pos Case=1 Gen=MInanim Dec. 8, 2012 Hybrid MT Workshop - Coling 2012

Hybrid MT Workshop - Coling 2012 Example Translation mít Gen=MInanim C=PastP Num=sg překlad Num=sg Case=1 . by být C=inf snadný Deg=pos Case=1 Gen=MInanim strojový Deg=pos Case=1 Gen=MInanim Rearrange clitics Dec. 8, 2012 Hybrid MT Workshop - Coling 2012

Hybrid MT Workshop - Coling 2012 Example Translation měl překlad Synthesize word forms . by být snadný strojový ... and flatten the tree: (capitalize, space) Strojový překlad by měl být snadný. Dec. 8, 2012 Hybrid MT Workshop - Coling 2012

Hybrid MT Workshop - Coling 2012 Results WMT Constrained task en → cs: TectoMT, Moses (Prague), Moses (Edinburgh) tied 1st Unconstrained: (subj. eval.) BLEU All < 0.17 Dec. 8, 2012 Hybrid MT Workshop - Coling 2012

Hybrid MT Workshop - Coling 2012 Acknowledgements: “Information Society” Programme 1ET101120503 Acknowledgements: Czech Science Foundation GA405/09/0729 Acknowledgements: Charles Univ. student grants 116310, 158010, 3537/2011 Acknowledgements: European projects (in part) 034434, 034291, 231720, 247762 Acknowledgements: Charles University research funds (“PRVOUK”) Acknowledgements: Czech Science Foundation GPP406/10/P193 Acknowledgements: Czech Science Foundation GAP406/10/0875 Acknowledgements: Acknowledgements: Ministry of Education Czech Rep. LC536, MSM0021620838 Acknowledgements: Ministry of Education Czech Rep. ME09008, 7Ennnn Acknowledgements: European projects (part) 249119, 257528 The Future Non-isomorphic trees Better breakdown to treelets and/or parameter training (than in STSG) Multiple paths / n-best lists At least until statistical components Combine with Moses (using input lattices) Two „languages“: original & Czech by TectoMT Moses with syntactic and semantic factors Still more generalized syntax and semantics (AMR/MRS and beyond?) Dec. 8, 2012 Hybrid MT Workshop - Coling 2012

Hybrid MT Workshop - Coling 2012 References Zdeněk Žabokrtský, Martin Popel: Hidden Markov Tree Model in Dependency-based Machine Translation. In ACL 2009, pp. 145-148 David Mareček, Martin Popel, Zdeněk Žabokrtský: Maximum Entropy Translation Model in Dependency-Based MT Framework. Joint 5th Workshop on Statistical Machine Translation and MetricsMATR, ACL 2010, Uppsala, Sweden, pp. 201-206. Ondřej Dušek, Zdeněk Žabokrtský, Martin Popel, Martin Majliš, Michal Novák and David Mareček: Formemes in English-Czech Deep Syntactic MT. In WMT’12, Montréal, Canada, pp. 267-274. Martin Popel, Zdeněk Žabokrtský: TectoMT: Modular NLP Framework. IceTAL 2010, 7th International Conference on Natural Language Processing, Reykjavík, Iceland, pp. 293- 304. TectoMT at WMT 12: http://www.statmt.org/wmt12/pdf/WMT02.pdf Thank you! Dec. 8, 2012 Hybrid MT Workshop - Coling 2012