Syntax-based Statistical Machine Translation Models

Slides:

Advertisements

Similar presentations

Statistical Machine Translation

Advertisements

The Learning Non-Isomorphic Tree Mappings for Machine Translation Jason Eisner - Johns Hopkins Univ. a b A B events of misinform wrongly report to-John.

Joint Parsing and Alignment with Weakly Synchronized Grammars David Burkett, John Blitzer, & Dan Klein TexPoint fonts used in EMF. Read the TexPoint manual.

Computational language: week 10 Lexical Knowledge Representation concluded Syntax-based computational language Sentence structure: syntax Context free.

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.

Learning Accurate, Compact, and Interpretable Tree Annotation Recent Advances in Parsing Technology WS 2011/2012 Saarland University in Saarbrücken Miloš.

Bayesian Learning of Non- Compositional Phrases with Synchronous Parsing Hao Zhang; Chris Quirk; Robert C. Moore; Daniel Gildea Z honghua li Mentor: Jun.

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.

A Tree-to-Tree Alignment- based Model for Statistical Machine Translation Authors: Min ZHANG, Hongfei JIANG, Ai Ti AW, Jun SUN, Sheng LI, Chew Lim TAN.

Statistical Phrase-Based Translation Authors: Koehn, Och, Marcu Presented by Albert Bertram Titles, charts, graphs, figures and tables were extracted from.

Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.

Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.

Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.

Daniel Gildea (2003): Loosely Tree-Based Alignment for Machine Translation Linguistics 580 (Machine Translation) Scott Drellishak, 2/21/2006.

Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.

Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University.

Natural Language Processing Expectation Maximization.

Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.

Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.

Scalable Inference and Training of Context- Rich Syntactic Translation Models Michel Galley, Jonathan Graehl, Keven Knight, Daniel Marcu, Steve DeNeefe.

Machine Translation Course 5 Diana Trandab ă ț Academic year:

Dependency Tree-to-Dependency Tree Machine Translation November 4, 2011 Presented by: Jeffrey Flanigan (CMU) Lori Levin, Jaime Carbonell In collaboration.

Reordering Model Using Syntactic Information of a Source Tree for Statistical Machine Translation Kei Hashimoto, Hirohumi Yamamoto, Hideo Okuma, Eiichiro.

CSE 517 Natural Language Processing Winter 2015 Syntax-Based Machine Translation Yejin Choi Slides from Philipp Koehn, Matt Post, Luke Zettlemoyer, …

11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.

What’s in a translation rule? Paper by Galley, Hopkins, Knight & Marcu Presentation By: Behrang Mohit.

Statistical Machine Translation Part III – Phrase-based SMT / Decoding Alexander Fraser Institute for Natural Language Processing Universität Stuttgart.

INSTITUTE OF COMPUTING TECHNOLOGY Forest-to-String Statistical Translation Rules Yang Liu, Qun Liu, and Shouxun Lin Institute of Computing Technology Chinese.

A non-contiguous Tree Sequence Alignment-based Model for Statistical Machine Translation Jun Sun ┼, Min Zhang ╪, Chew Lim Tan ┼ ┼╪

Supertagging CMSC Natural Language Processing January 31, 2006.

Haitham Elmarakeby.  Speech recognition

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Discriminative Modeling extraction Sets for Machine Translation Author John DeNero and Dan KleinUC Berkeley Presenter Justin Chiu.

Towards Syntactically Constrained Statistical Word Alignment Greg Hanneman : Advanced Machine Translation Seminar April 30, 2008.

October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.

A Syntax-Driven Bracketing Model for Phrase-Based Translation Deyi Xiong, et al. ACL 2009.

LING 575 Lecture 5 Kristina Toutanova MSR & UW April 27, 2010 With materials borrowed from Philip Koehn, Chris Quirk, David Chiang, Dekai Wu, Aria Haghighi.

Context Free Grammars. Slide 1 Syntax Syntax = rules describing how words can connect to each other * that and after year last I saw you yesterday colorless.

Natural Language Processing Vasile Rus

Natural Language Processing Vasile Rus

Neural Machine Translation

Statistical Machine Translation Part II: Word Alignments and EM

CSC 594 Topics in AI – Natural Language Processing

CSE 517 Natural Language Processing Winter 2015

Basic Parsing with Context Free Grammars Chapter 13

Statistical NLP: Lecture 13

Statistical NLP Spring 2011

--Mengxue Zhang, Qingyang Li

Statistical Machine Translation Part III – Phrase-based SMT / Decoding

Deep Learning based Machine Translation

Training Tree Transducers

CSCI 5832 Natural Language Processing

Probabilistic and Lexicalized Parsing

Zhifei Li and Sanjeev Khudanpur Johns Hopkins University

CSCI 5832 Natural Language Processing

Vamshi Ambati 14 Sept 2007 Student Research Symposium

Parsing and More Parsing

Lecture 7: Introduction to Parsing (Syntax Analysis)

CSCI 5832 Natural Language Processing

Programming Language Syntax 5

Statistical Machine Translation Papers from COLING 2004

Machine Translation(MT)

CSCI 5832 Natural Language Processing

David Kauchak CS159 – Spring 2019

Statistical NLP Spring 2011

Dekai Wu Presented by David Goss-Grubbs

Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.

Pushpak Bhattacharyya CSE Dept., IIT Bombay 31st Jan, 2011

Presentation transcript:

Syntax-based Statistical Machine Translation Models Amr Ahmed March 26th 2008

Outline The Translation Problem The Noisy Channel Model Syntax-light SMT Why Syntax? Syntax-based SMT Models Summary

Statistical Machine Translation Problem Given a sentence (f) in one language, produce it is equivalent in another language (e) Arabic Text English:? Machine Translation System I know how to do this “One naturally wonders if the problem of translation could conceivably be treated as a problem in cryptography. When I look at an article in Arabic, I say: ‘This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.’ ”, Warren Weaver, 1947

Statistical Machine Translation Problem Given a sentence (f) in one language, produce it is equivalent in another language (e) Noisy Channel Model Noisy Channel P(e) We know how to factor P(e)! e f P(e) models good English P(f|e) models good translation Today: How to factor p(f|e)?

Outline The Translation Problem The Noisy Channel Model Syntax-light SMT Word-based Models Phrase-based Models Why Syntax? Syntax-based SMT Models Summary

Word-Translation Models Auf diese Frage habe ich leider keine Antwort bekommen Blue word links aren’t observed in data. NULL I did not unfortunately receive an answer to this question What is the generative Story? IBM Model 1-4 Roughly equivalent to FST (module reordering) Learning and Decoding? Slide Credit: Adapted from Smith et. al.

Word-Based Translation Models deletion fertility Translation Re-ordering -Stochastic operations -Associated with probabilities -Estimated using EM In a Nutshell Q: What are we learning? A: Word movement f Linguistic Hypothesis Phrase based Models 1- Words move in blocks 2- Context is important

Phrase-Based Translation Models Segment -Stochastic operations -Associated with probabilities -Estimated using EM Translation In a Nutshell Re-ordering Q: What are we learning? A: Word movement f Linguistic Hypothesis Phrase based Models 1- Words move in blocks 2- Context is important Markovian Dependency

Phrase-Based Translation Models Segment -Stochastic operations -Associated with probabilities -Estimated using EM Translation In a Nutshell a1 a2 a3 Re-ordering Q: What are we learning? A: Word movement f Linguistic Hypothesis Phrase based Models 1- Words move in blocks 2- Context is important Markovian Dependency

Phrase-Based Models: Example Not necessarily syntactic phrases Division into phrases is hidden Auf diese Frage habe ich leider keine Antwort bekommen I did not unfortunately receive an answer to this question Score each phrase pair using several features Slide Credit: from Smith et. al.

Phrase Table Estimation Basically count and Normalize

Outline The Translation Problem The Noisy Channel Model Syntax-light SMT Word-based Models Phrase-based Models Why Syntax? Syntax-based SMT Models Summary

Outline The Translation Problem The Noisy Channel Model Syntax-light SMT Why Syntax? Syntax-based SMT Models Summary

Why Syntax? Reference: consequently proposals are submitted to parliament under the assent procedure, meaning that parliament can no longer table amendments, as directives in this area were adopted as single market legislation under the codecision procedure on the basis of art.100a tec. Translation: consequently, the proposals parliament after the assent procedure, the tabled amendments for offers no possibility of community directives, because as part of the internal market legislation on the basis of article 100a of the treaty in the codecision procedure have been adopted. Slide Credit: Example from Cowan et. al.

Slide Credit: Adapted from Cowan et. al. Why Syntax? Reference: consequently proposals are submitted to parliament under the assent procedure, meaning that parliament can no longer table amendments, as directives in this area were adopted as single market legislation under the codecision procedure on the basis of art.100a tec. Translation: consequently, the proposals parliament after the assent procedure, the tabled amendments for offers no possibility of community directives, because as part of the internal market legislation on the basis of article 100a of the treaty in the codecision procedure have been adopted. Here syntax Can help! What Went Wrong? phrase-based systems are very good at predicting content words, But are less accurate in producing function words, or producing output that correctly encodes grammatical relations between content words

? Structure Does Help! Does adding more Structure help Se Se x1 x2 x3 Noisy Channel Noisy Channel Sf Sf x2 x1 x3 Syntax-based Phrase-based Word-based Better performance ?

Syntax and the Translation Pipeline Input Pre-reordering Translation system Syntax Syntax in the Translation model Output Post processing (re-ranking)

Early Exposition (Koehn et al 2003) Fix a phrase-based System and vary the way phrases are extracted Frequency-based, Generative, Constituent Adding syntax hurts the performance Phrases like: “there is es gibt” is not a constituent (this eliminate 80% phrase-pairs) Explanation No hierarchical re-ordering Syntax is not fully exploited here! Parse trees produce errors

Outline Syntax-based SMT Models Summary The Translation Problem The Noisy Channel Model Syntax-light SMT Why Syntax? Syntax-based SMT Models Summary

The Big Picture: Translation Models Inter-lingua Inter-lingua Syntax Syntax Syntax Syntax Syntax Syntax String String String String String String Word-based Phrase-based SCFG (Chiang 2005), ITG (Wu 97) Inter-lingua Inter-lingua Inter-lingua Syntax Syntax Syntax Syntax Syntax Syntax String String String String String String Tree-String Transducers String-Tree Transducers Tree-Tree Transducers

Learning Synchronous Grammar String Syntax Inter-lingua SCFG (Chiang 2005), ITG (Wu 97) No linguistic annotation Model P(e,f) jointly Trees are hidden variables EM doesn’t work well with large missing information Structural restrictions Binary rules (ITG, Wu 97) Lexical restriction Chiang 2005 SCFG to represent Hierarchal phrases What is Synchronous Grammar?

Interlude: Synchronous Grammar Extension of monolingual theory to bitext CFG  SCFG TAG  STAG etc. Monolingual parsers are extended for bitext parsing

Synchronous Grammar: SCFG

Learning Synchronous Grammar String Syntax Inter-lingua SCFG (Chiang 2005), ITG (Wu 97) No linguistic annotation Model P(e,f) jointly Trees are hidden variables EM doesn’t work well with large missing information Structural restrictions Binary rules (ITG, Wu 97) Lexical restriction Chiang 2005 SCFG to represent Hierarchal phrases What is Synchronous Grammar? How

Hierarchical Phrase-based Model Hierarchical Phrased-based Models S(x1,x1) S(s1x2,s1x2) X(e3x2e4,f3f4x2) X(e1,f1) X(e5e6, f6f5) SCFG representation S1 S1 x1 x1 S2 S2 x3 x3 f3 f4 x2 e3 x2 e4 f1 e1 e6 e5 e5 e6 Phrased-based Models Se Sf x1 x2 x3 x2 x1 x3

Example (Chiang 2005)

Hierarchical Phrase-based Model x3 S1 S2 e3 x2 e4 x1 e6 f1 f3 f4 e5 Hierarchical Phrased-based Models S1(x1,x1) S1(s1x2,s1x2) Xp (e3x2e4,f3f4x2) Xp(e1,f1) Xp(e5e6, f6f5) SCFG representation Question1 : How to train the model? What are the restrictions -At most two recursive phrases -Restriction on length Question 2 : How to decode?

Training and Decoding Collect initial grammar rules

Training and Decoding Collect initial grammar rules Tune rule weights: count and normalize! Decoding CYK (remember rules has at most two non-terminals) Parse the f part only.

Does it help? Experimental Details Mandarin-to-English (FBIS corpus) 7.2M +9.2 M words Devset : NIST 2002 MT evaluation Test Set: 2003 NIST MT evaluation 7.5% relative improvement over phrase-based models using BLEU score 0.02 absolute improvement over baseline

Does it help? 7.5% relative improvement over phrase-based models Learnt rules are formally SCFG but not linguistically interpretable The model learns re-ordering patterns guided by lexical functional words Capture long-range movements via recursion

Follow-Up study Why not decorate the phrases with their grammatical constituents? Zollmann et. Al. 2006, 2007 If possible decorate the phrase with a constituent Generalize phrases as in Chiang 2005 Parse using chart parsing Moved from 31.85 32.15 over CMU phrase-based system Spanich-English corpus

The Big Picture: Translation Models Inter-lingua Inter-lingua Syntax Syntax Syntax Syntax Syntax Syntax String String String String String String Word-based Phrase-based SCFG (Chiang 2005), ITG (Wu 97) Inter-lingua Inter-lingua Inter-lingua Syntax Syntax Syntax Syntax Syntax Syntax String String String String String String Tree-String Transducers String-Tree Transducers Tree-Tree Transducers

Tree-String Tranceducers Syntax Inter-lingua Tree-String Transducers Linguistic Tools English Parse Trees from statistical parser Alignment from Giza++ Conditional Model P(f |Te) Models differ on How to factor P(f |Te) Domain of locality SCFG (Yamada,Knight 2001) STSG (Galley et. Al 2004) Caveat

Tree-String (Yamada & Knight) Back to noisy channel model Traduces Te into f Stochastic Channel operations (on trees) Reorder children Insert node Lexical Transplantation

Channel operations P(VB T0 T0 VB) P(right|PRP) * Pi(ha)

Learning Learn channel operation probabilities Standard EM-Training Reordering Insertion Translation Standard EM-Training E-Step: compute expected rule counts (Dyn.) M-Step: count and normalize

Decoding As Parsing In a nutshell, we learnt how to parse the foreign side: Add CFG rules from the English side Channel rules: Reordering: If (VB2->VB T0) reordered as (VB2 T0 VB) Add rule VB2p T0 VB Insertion VplXV and VprV X and Xfi Translation eipt fi

Decoding Example

Results and Expressiveness English-Chinese task Short sentences < 20 words (3M word corpus) Test set 347 sentence with at most 14 words Better Bleu score (0.102) than IBM-4 (.072) What it can represent Depends on syntactic divergence between languages pairs Tree must be isomorphic up to child re-reordering Channel rules have the following format: Q: What it can’t model? Child re-ordering

Limitations Can’t model syntactic movements that cross brackets SVO to VSO Modal movement between English and French Not  ne .. pas (from English to French) VP VP VP VP VP …. VB Aux Does Not go ne va pas The span of Not can’t intersect that of Go Can’t Interleave Green with the other two

Limitations: Possible solutions Some follow up study showed relative improvement by Gildea 2003 added cloning operations AER went from .42  0.3 on Korean-English corus VP VP VP VP VP …. VB Aux Does Not go ne va pas The span of Not can’t intersect that of Go Can’t Interleave Green with the other two

Tree-String Tranceducers Syntax Inter-lingua Tree-String Transducers Linguistic Tools English Parse Trees from statistical parser Alignment from Giza++ Conditional Model P(f |Te) Models differ on How to factor P(f |Te) Domain of locality SCFG (Yamada,Knight 2001) STSG (Galley et. Al 2004) Caveat

Learning Expressive Rules (Galley 2004) Yamada & Knight Channel Operation Tables VP VP VP f1,f2,……..,fn Galley, et. al 2004 Parsing Rules For F-side Rule Extraction TSG rules CFG rules Condition on larger fragments of the trees

Rule format: Decoding Tree is build bottom up Current State Rule 1 Derivation Step VP VP = fi+1 NP VP NP ne VB pas = fi ne VB pas X2 Aux PRP VB PRP go Does Not he Does Not go he CFG Tree Fragment Tree is build bottom up Foreign string at each derivation may have non-terminals Rules are extracted from training corpus English side trees Foreign side strings Alignment from Giza++

Rule Extraction Upward projection Frontier Graph S VP{ne va pas} Aux {ne pas} VB{va} Go{va} Not{ne pas} Does{ne pas} pas va ne NP {il} PRP {il} he {il} il S {il ne va pas} RB{ne pas} VP NP Aux RB VB PRP Upward projection Does Not go he il ne va pas S go VP he {il} NP VP Frontier nodes: Nodes whose span is exclusive il Aux VB{va} Rx va Frontier Graph Does Not PRP {il} VB{va} NP {il} S,NP,PRP,he, VP, VB,go ne pas PRP {il} Go{va} he {il} Extract Rule as before

Illustrating Rule Extraction VP Aux VB{va} Not Does pas ne RB ne VB pas Input Output X2 VP NP S NP VP Input Output VB{va} Go{va} va Input Output go VB{va} Go{va} go Input Output VB

Minimality of Extracted rules Other rules can be composed form these minimal rules VP VP Aux VB{va} Aux Rx VB{va} RB VB = + Does Not Go{va} Does Not go ne pas ne va pas VP Aux X2 Not Does ne VB pas va go + ne va pas = VB

Probability Estimation Just EM Modified inside outside for E-Step Decoding as parsing Training can be done using of the shelf tree-transducers (Knight et al. 2004)

Evaluation Coverage: how well the learnt rules explain the corpus 100% coverage on F-E and C-E corpus Translation Results Decoder was still work in progress

The Big Picture: Translation Models Inter-lingua Inter-lingua Syntax Syntax Syntax Syntax Syntax Syntax String String String String String String Word-based Phrase-based SCFG (Chiang 2005), ITG (Wu 97) Inter-lingua Inter-lingua Inter-lingua Syntax Syntax Syntax Syntax Syntax Syntax String String String String String String Tree-String Transducers String-Tree Transducers Tree-Tree Transducers

Tree-Tree Transducers Linguistic Tools E/F Parse Trees from statistical parser Alignment from Giza++ Conditional Model P(Tf |Te) Models differ on How to factor P(Tf |Te) Really many many many … CFG vs. dependency trees How to train EM: most of them Discriminative (Collins 2006) Directly model P(Te |Tf) String Syntax Inter-lingua Tree-Tree Transducers Same Caveat as before

Discriminative tree-tree models Directly model P(Te |Tf) Translate from German to English Extended Projections Just elementary tree from TAG with One verb Lexical functional words NP, PP place holders Learns to map between tree fragments German clause  EP Modeled as structured learning

How to Decode? Why no generative story? Given a German String Because this is a direct model! Given a German String Parse it Break it into clauses Predict an EP from each clause Translate German NP, PP using Pharaoh Map translated German NP and NN to holes in EP Structure learning comes here Stitch clauses to get English translation

How to train Training data: Aligned clauses Extraction procedures Parse English and German Align (NP,PP) in them using GIZE++ Break parse trees into clauses Order clauses based on verb position Discard sentences with different number of clauses Training: (e1,g1)….(en,gn)

How to train (2)

How to train (3) (X,Y) is a training pair - Just our good old perceptron friend!

Results BLEU Score: Human judgment German-English Europol corpus 750k training sentences  441,000 training clauses, test on 2000 sentences BLEU Score: base line 25.26 This system: 23.96 Human judgment 62 equal 16 better under this system 22 better for baseline Largely because lots of restriction were imposed

Outline The Translation Problem The Noisy Channel Model Syntax-light SMT Why Syntax? Syntax-based SMT Models Summary

Summary Syntax Does help but: What is the right representation Is it language-pair specific? How to deal with parser errors? Modeling the uncertainty of the parsing process Large scale syntax-based models Are they possible? What are the trade-offs Better parameter estimation! Should we trust the GIZA alignment results? Block-translation vs. word-word?

Thanks

Related Work Fast parsers for synchronous grammars Grammar binarization Fast K-best list parses Re-ranking Syntax driven evaluation measures Impact of parsing quality on overall system performance

SAMT