The Learning Non-Isomorphic Tree Mappings for Machine Translation Jason Eisner - Johns Hopkins Univ. a b A B events of misinform wrongly report to-John.

Slides:



Advertisements
Similar presentations
B. Knudsen and J. Hein Department of Genetics and Ecology
Advertisements

Statistical Machine Translation
Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
May 2006CLINT-LN Parsing1 Computational Linguistics Introduction Approaches to Parsing.
10. Lexicalized and Probabilistic Parsing -Speech and Language Processing- 발표자 : 정영임 발표일 :
1 Statistical NLP: Lecture 12 Probabilistic Context Free Grammars.
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing Probabilistic Context Free Grammars (Chapter 14) Muhammed Al-Mulhem March 1,
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
A Tree-to-Tree Alignment- based Model for Statistical Machine Translation Authors: Min ZHANG, Hongfei JIANG, Ai Ti AW, Jun SUN, Sheng LI, Chew Lim TAN.
Transfer-based MT.
Statistical Phrase-Based Translation Authors: Koehn, Och, Marcu Presented by Albert Bertram Titles, charts, graphs, figures and tables were extracted from.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.
Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.
Daniel Gildea (2003): Loosely Tree-Based Alignment for Machine Translation Linguistics 580 (Machine Translation) Scott Drellishak, 2/21/2006.
Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
Binary Trees Chapter 6.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
1 Data-Driven Dependency Parsing. 2 Background: Natural Language Parsing Syntactic analysis String to (tree) structure He likes fish S NP VP NP VNPrn.
Syntax for MT EECS 767 Feb. 1, Outline Motivation Syntax-based translation model  Formalization  Training Using syntax in MT  Using multiple.
Tree-adjoining grammar (TAG) is a grammar formalism defined by Aravind Joshi and introduced in Tree-adjoining grammars are somewhat similar to context-free.
Adaptor Grammars Ehsan Khoddammohammadi Recent Advances in Parsing Technology WS 2012/13 Saarland University 1.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
A sentence (S) is composed of a noun phrase (NP) and a verb phrase (VP). A noun phrase may be composed of a determiner (D/DET) and a noun (N). A noun phrase.
PART I: overview material
Chapter 6 Binary Trees. 6.1 Trees, Binary Trees, and Binary Search Trees Linked lists usually are more flexible than arrays, but it is difficult to use.
Dependency Tree-to-Dependency Tree Machine Translation November 4, 2011 Presented by: Jeffrey Flanigan (CMU) Lori Levin, Jaime Carbonell In collaboration.
12/06/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Statistical Parsing Dr. Jan Hajič CS Dept., Johns Hopkins Univ.
Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.
Reordering Model Using Syntactic Information of a Source Tree for Statistical Machine Translation Kei Hashimoto, Hirohumi Yamamoto, Hideo Okuma, Eiichiro.
CONTEXT FREE GRAMMAR presented by Mahender reddy.
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
Martin KayTranslation—Meaning1 Martin Kay Stanford University with thanks to Kevin Knight.
Probabilistic CKY Roger Levy [thanks to Jason Eisner]
1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 4.
What’s in a translation rule? Paper by Galley, Hopkins, Knight & Marcu Presentation By: Behrang Mohit.
Basic Parsing Algorithms: Earley Parser and Left Corner Parsing
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
2003 (c) University of Pennsylvania1 Better MT Using Parallel Dependency Trees Yuan Ding University of Pennsylvania.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Wei Lu, Hwee Tou Ng, Wee Sun Lee National University of Singapore
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Maximum Entropy … the fact that a certain prob distribution maximizes entropy subject to certain constraints representing our incomplete information, is.
Discriminative Modeling extraction Sets for Machine Translation Author John DeNero and Dan KleinUC Berkeley Presenter Justin Chiu.
/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
Stochastic Methods for NLP Probabilistic Context-Free Parsers Probabilistic Lexicalized Context-Free Parsers Hidden Markov Models – Viterbi Algorithm Statistical.
NATURAL LANGUAGE PROCESSING
A Syntax-Driven Bracketing Model for Phrase-Based Translation Deyi Xiong, et al. ACL 2009.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
1 Local Search for Optimal Permutations Jason Eisner and Roy Tromble with Very Large-Scale Neighborhoods in Machine Translation.
LING 575 Lecture 5 Kristina Toutanova MSR & UW April 27, 2010 With materials borrowed from Philip Koehn, Chris Quirk, David Chiang, Dekai Wu, Aria Haghighi.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Natural Language Processing : Probabilistic Context Free Grammars Updated 8/07.
Statistical NLP Winter 2009
Binary Search Trees.
Statistical NLP: Lecture 13
Statistical NLP Spring 2011
CSCI 5832 Natural Language Processing
Training Tree Transducers
LING/C SC 581: Advanced Computational Linguistics
CSCI 5832 Natural Language Processing
Statistical Machine Translation Papers from COLING 2004
CSCI 5832 Natural Language Processing
A Path-based Transfer Model for Machine Translation
Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.
Presentation transcript:

the Learning Non-Isomorphic Tree Mappings for Machine Translation Jason Eisner - Johns Hopkins Univ. a b A B events of misinform wrongly report to-John events him wrongly report events to-Johnhim misinform of the events 2 words become 1 reorder dependents 0 words become 1

Syntax-Based Machine Translation Previous work assumes essentially isomorphic trees –Wu 1995, Alshawi et al. 2000, Yamada & Knight 2000 But trees are not isomorphic! –Discrepancies between the languages –Free translation in the training data the a b A B events of misinform wrongly report to-John events him

Two training trees, showing a free translation from French to English. Synchronous Tree Substitution Grammar enfants (kids) d (of) beaucoup (lots) Sam donnent (give) baiser (kiss) un (a) à (to) kids Sam kiss quite often beaucoup denfants donnent un baiser à Sam kids kiss Sam quite often

enfants (kids) kids NP d (of) beaucoup (lots) NP Sam NP Synchronous Tree Substitution Grammar kiss donnent (give) baiser (kiss) un (a) à (to) Start NP null Adv quite null Adv often null Adv null Adv beaucoup denfants donnent un baiser à Sam kids kiss Sam quite often Two training trees, showing a free translation from French to English. A possible alignment is shown in orange.

enfants (kids) kids Adv d (of) beaucoup (lots) NP Sam NP Synchronous Tree Substitution Grammar kiss donnent (give) baiser (kiss) un (a) à (to) Start NP quite often beaucoup denfants donnent un baiser à Sam kids kiss Sam quite often Two training trees, showing a free translation from French to English. A possible alignment is shown in orange. A much worse alignment...

enfants (kids) kids NP d (of) beaucoup (lots) NP Sam NP Synchronous Tree Substitution Grammar kiss donnent (give) baiser (kiss) un (a) à (to) Start NP null Adv quite null Adv often null Adv null Adv beaucoup denfants donnent un baiser à Sam kids kiss Sam quite often Two training trees, showing a free translation from French to English. A possible alignment is shown in orange.

enfants (kids) kids NP d (of) beaucoup (lots) NP quite null Adv often null Adv null Adv Sam NP kiss donnent (give) baiser (kiss) un (a) à (to) Start NP null Adv Synchronous Tree Substitution Grammar beaucoup denfants donnent un baiser à Sam kids kiss Sam quite often Start Two training trees, showing a free translation from French to English. A possible alignment is shown in orange. Alignment shows how trees are generated synchronously from little trees...

Sam NP Grammar = Set of Elementary Trees kiss donnent (give) baiser (kiss) un (a) à (to) Start NP null Adv idiomatic translation enfants (kids) kids NP

enfants (kids) kids NP Sam NP Grammar = Set of Elementary Trees kiss donnent (give) baiser (kiss) un (a) à (to) Start NP null Adv idiomatic translation Sam NP enfants (kids) kids NP

Grammar = Set of Elementary Trees kiss donnent (give) baiser (kiss) un (a) à (to) Start NP null Adv Sam NP enfants (kids) kids NP d (of) beaucoup (lots) NP beaucoup d deletes inside the tree

d (of) beaucoup (lots) NP beaucoup d deletes inside the tree Grammar = Set of Elementary Trees kiss donnent (give) baiser (kiss) un (a) à (to) Start NP null Adv Sam NP enfants (kids) kids NP

enfants (kids) kids NP d (of) beaucoup (lots) NP beaucoup d matches nothing in English Grammar = Set of Elementary Trees kiss donnent (give) baiser (kiss) un (a) à (to) Start NP null Adv Sam NP enfants (kids) kids NP

Sam NP enfants (kids) kids NP quite null Adv Grammar = Set of Elementary Trees often null Adv null Adv d (of) beaucoup (lots) NP kiss donnent (give) baiser (kiss) un (a) à (to) Start NP null Adv adverbial subtree matches nothing in French

Probability model similar to PCFG Probability of generating training trees T1, T2 with alignment A P(T1, T2, A) = p(t1,t2,a | n) probabilities of the little trees that are used p( is given by a maximum entropy model wrongly misinform NP report VP | ) VP

could be trained on zillions of target-language trees train on paired trees (hard to get) Form of model of big tree pairs Wise to use noisy-channel form: P (T1 | T2) * P (T2) But any joint model will do. Joint model P (T1,T2). In synchronous TSG, aligned big tree pair is generated by choosing a sequence of little tree pairs: P(T1, T2, A) = p(t1,t2,a | n)

FEATURES report+wrongly misinform? (use dictionary) report misinform? (at root) wrongly misinform? Maxent model of little tree pairs verb incorporates adverb child? verb incorporates child 1 of 3? children 2, 3 switch positions? common tree sizes & shapes?... etc..... p( wrongly misinform NP report VP | ) VP

Inside Probabilities the a b A B events of misinform wrongly report to-John events him VP ( ) =... misinform report VP * ( ) * ( ) +... p( | ) VP

Inside Probabilities the a b A B events of misinform wrongly report to-John events him VP NP ( ) =... misinform report VP events of NP to-John him NP * ( ) * ( ) +... p( | ) VP NP misinform wrongly report VP NP only O(n 2 )

Alignment: find A to max P (T1,T2,A) Decoding: find T2, A to max P (T1,T2,A) Training: find to max A P (T1,T2,A) Do everything on little trees instead! Only need to train & decode a model of p (t1,t2,a) But not sure how to break up big tree correctly –So try all possible little trees & all ways of combining them, by dynamic prog. P(T1, T2, A) = p(t1,t2,a | n)

Alignment Pseudocode for each node c1 of T1 (bottom-up) for each possible little tree t1 rooted at c1 for each node c2 of T2 (bottom-up) for each possible little tree t2 rooted at c2 for each matching a between frontier nodes of t1 and t2 p = p(t1,t2,a) for each pair (d1,d2) of frontier nodes matched by a p = p * (d1,d2) // inside probability of kids (c1,c2) = (c1,c2) + p // our inside probability Nonterminal states are used in practice but not shown here For EM training, also find outside probabilities

An MT Architecture Viterbi alignment yields output T2 dynamic programming engine Probability Model p (t1,t2,a) of Little Trees score little tree pair propose translations t2 of little tree t1 each possible (t1,t2,a) inside-outside estimated counts update parameters for each possible t1, various (t1,t2,a) each proposed (t1,t2,a) DecoderTrainer scores all alignments of two big trees T1,T2 scores all alignments between a big tree T1 & a forest of big trees T2

Related Work Synchronous grammars (Shieber & Schabes 1990) –Statistical work has allowed only 1:1 (isomorphic trees) Stochastic inversion transduction grammars (Wu 1995) Head transducer grammars (Alshawi et al. 2000) Statistical tree translation –Noisy channel model (Yamada & Knight 2000) Infers tree: trains on (string, tree) pair, not (tree, tree) pair But again, allows only 1:1, plus 1:0 at leaves Data-oriented translation (Poutsma 2000) –Synchronous DOP model trained on already aligned trees Statistical tree generation –Similar to our decoding: construct forest of appropriate trees, pick by highest prob –Dynamic prog. search in packed forest (Langkilde 2000) –Stack decoder (Ratnaparkhi 2000)

What Is New Here? Learning full elementary tree pairs, not rule pairs or subcat pairs –Previous statistical formalisms have basically assumed isomorphic trees Maximum-entropy modeling of elementary tree pairs New, flexible formalization of synchronous Tree Subst. Grammar –Allows either dependency trees or phrase-structure trees –Empty trees permit insertion and deletion during translation –Concrete enough for implementation (cf. informal previous descriptions) –TSG is more powerful than CFG for modeling trees, but faster than TAG Observation that dynamic programming is surprisingly fast –Find all possible decompositions into aligned elementary tree pairs –O(n 2 ) if both input trees are fully known and elem. tree size is bounded

Status & Thanks Developed and implemented during JHU CLSP summer workshop 2002 (funded by NSF) Other team members: Jan Hajič, Bonnie Dorr, Dan Gildea, Gerald Penn, Drago Radev, Owen Rambow, and students: Martin Cmejrek, Yuan Ding, Terry Koo, Kristen Parton Also being used for other kinds of tree mappings: –between deep structure and surface structure, or semantics and syntax –between original text and summarized/paraphrased/plagiarized version Results forthcoming (thats why I didnt submit a full paper )

Summary Most MT systems work on strings We want to translate trees – want to respect syntactic structure But dont assume that translated trees are structurally isomorphic! TSG formalism: Translation locally replaces tree structure and content. Parameters: Probabilities of local substitutions (use maxent model) Algorithms: Dynamic programming (local substitutions cant overlap) EM training on pairs can be fast: –Align O(n) tree nodes with O(n) tree nodes, respecting subconstituency –Dynamic programming – find all alignments and retrain using EM –Faster than aligning O(n) words with O(n) words –If correct training tree is unknown, a well-pruned parse forest still has O(n) nodes