Syntax-based Statistical Machine Translation Models Amr Ahmed March 26th 2008
Outline The Translation Problem The Noisy Channel Model Syntax-light SMT Why Syntax? Syntax-based SMT Models Summary
Statistical Machine Translation Problem Given a sentence (f) in one language, produce it is equivalent in another language (e) Arabic Text English:? Machine Translation System I know how to do this “One naturally wonders if the problem of translation could conceivably be treated as a problem in cryptography. When I look at an article in Arabic, I say: ‘This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.’ ”, Warren Weaver, 1947
Statistical Machine Translation Problem Given a sentence (f) in one language, produce it is equivalent in another language (e) Noisy Channel Model Noisy Channel P(e) We know how to factor P(e)! e f P(e) models good English P(f|e) models good translation Today: How to factor p(f|e)?
Outline The Translation Problem The Noisy Channel Model Syntax-light SMT Word-based Models Phrase-based Models Why Syntax? Syntax-based SMT Models Summary
Word-Translation Models Auf diese Frage habe ich leider keine Antwort bekommen Blue word links aren’t observed in data. NULL I did not unfortunately receive an answer to this question What is the generative Story? IBM Model 1-4 Roughly equivalent to FST (module reordering) Learning and Decoding? Slide Credit: Adapted from Smith et. al.
Word-Based Translation Models deletion fertility Translation Re-ordering -Stochastic operations -Associated with probabilities -Estimated using EM In a Nutshell Q: What are we learning? A: Word movement f Linguistic Hypothesis Phrase based Models 1- Words move in blocks 2- Context is important
Phrase-Based Translation Models Segment -Stochastic operations -Associated with probabilities -Estimated using EM Translation In a Nutshell Re-ordering Q: What are we learning? A: Word movement f Linguistic Hypothesis Phrase based Models 1- Words move in blocks 2- Context is important Markovian Dependency
Phrase-Based Translation Models Segment -Stochastic operations -Associated with probabilities -Estimated using EM Translation In a Nutshell a1 a2 a3 Re-ordering Q: What are we learning? A: Word movement f Linguistic Hypothesis Phrase based Models 1- Words move in blocks 2- Context is important Markovian Dependency
Phrase-Based Models: Example Not necessarily syntactic phrases Division into phrases is hidden Auf diese Frage habe ich leider keine Antwort bekommen I did not unfortunately receive an answer to this question Score each phrase pair using several features Slide Credit: from Smith et. al.
Phrase Table Estimation Basically count and Normalize
Outline The Translation Problem The Noisy Channel Model Syntax-light SMT Word-based Models Phrase-based Models Why Syntax? Syntax-based SMT Models Summary
Outline The Translation Problem The Noisy Channel Model Syntax-light SMT Why Syntax? Syntax-based SMT Models Summary
Why Syntax? Reference: consequently proposals are submitted to parliament under the assent procedure, meaning that parliament can no longer table amendments, as directives in this area were adopted as single market legislation under the codecision procedure on the basis of art.100a tec. Translation: consequently, the proposals parliament after the assent procedure, the tabled amendments for offers no possibility of community directives, because as part of the internal market legislation on the basis of article 100a of the treaty in the codecision procedure have been adopted. Slide Credit: Example from Cowan et. al.
Slide Credit: Adapted from Cowan et. al. Why Syntax? Reference: consequently proposals are submitted to parliament under the assent procedure, meaning that parliament can no longer table amendments, as directives in this area were adopted as single market legislation under the codecision procedure on the basis of art.100a tec. Translation: consequently, the proposals parliament after the assent procedure, the tabled amendments for offers no possibility of community directives, because as part of the internal market legislation on the basis of article 100a of the treaty in the codecision procedure have been adopted. Here syntax Can help! What Went Wrong? phrase-based systems are very good at predicting content words, But are less accurate in producing function words, or producing output that correctly encodes grammatical relations between content words
? Structure Does Help! Does adding more Structure help Se Se x1 x2 x3 Noisy Channel Noisy Channel Sf Sf x2 x1 x3 Syntax-based Phrase-based Word-based Better performance ?
Syntax and the Translation Pipeline Input Pre-reordering Translation system Syntax Syntax in the Translation model Output Post processing (re-ranking)
Early Exposition (Koehn et al 2003) Fix a phrase-based System and vary the way phrases are extracted Frequency-based, Generative, Constituent Adding syntax hurts the performance Phrases like: “there is es gibt” is not a constituent (this eliminate 80% phrase-pairs) Explanation No hierarchical re-ordering Syntax is not fully exploited here! Parse trees produce errors
Outline Syntax-based SMT Models Summary The Translation Problem The Noisy Channel Model Syntax-light SMT Why Syntax? Syntax-based SMT Models Summary
The Big Picture: Translation Models Inter-lingua Inter-lingua Syntax Syntax Syntax Syntax Syntax Syntax String String String String String String Word-based Phrase-based SCFG (Chiang 2005), ITG (Wu 97) Inter-lingua Inter-lingua Inter-lingua Syntax Syntax Syntax Syntax Syntax Syntax String String String String String String Tree-String Transducers String-Tree Transducers Tree-Tree Transducers
Learning Synchronous Grammar String Syntax Inter-lingua SCFG (Chiang 2005), ITG (Wu 97) No linguistic annotation Model P(e,f) jointly Trees are hidden variables EM doesn’t work well with large missing information Structural restrictions Binary rules (ITG, Wu 97) Lexical restriction Chiang 2005 SCFG to represent Hierarchal phrases What is Synchronous Grammar?
Interlude: Synchronous Grammar Extension of monolingual theory to bitext CFG SCFG TAG STAG etc. Monolingual parsers are extended for bitext parsing
Synchronous Grammar: SCFG
Learning Synchronous Grammar String Syntax Inter-lingua SCFG (Chiang 2005), ITG (Wu 97) No linguistic annotation Model P(e,f) jointly Trees are hidden variables EM doesn’t work well with large missing information Structural restrictions Binary rules (ITG, Wu 97) Lexical restriction Chiang 2005 SCFG to represent Hierarchal phrases What is Synchronous Grammar? How
Hierarchical Phrase-based Model Hierarchical Phrased-based Models S(x1,x1) S(s1x2,s1x2) X(e3x2e4,f3f4x2) X(e1,f1) X(e5e6, f6f5) SCFG representation S1 S1 x1 x1 S2 S2 x3 x3 f3 f4 x2 e3 x2 e4 f1 e1 e6 e5 e5 e6 Phrased-based Models Se Sf x1 x2 x3 x2 x1 x3
Example (Chiang 2005)
Hierarchical Phrase-based Model x3 S1 S2 e3 x2 e4 x1 e6 f1 f3 f4 e5 Hierarchical Phrased-based Models S1(x1,x1) S1(s1x2,s1x2) Xp (e3x2e4,f3f4x2) Xp(e1,f1) Xp(e5e6, f6f5) SCFG representation Question1 : How to train the model? What are the restrictions -At most two recursive phrases -Restriction on length Question 2 : How to decode?
Training and Decoding Collect initial grammar rules
Training and Decoding Collect initial grammar rules Tune rule weights: count and normalize! Decoding CYK (remember rules has at most two non-terminals) Parse the f part only.
Does it help? Experimental Details Mandarin-to-English (FBIS corpus) 7.2M +9.2 M words Devset : NIST 2002 MT evaluation Test Set: 2003 NIST MT evaluation 7.5% relative improvement over phrase-based models using BLEU score 0.02 absolute improvement over baseline
Does it help? 7.5% relative improvement over phrase-based models Learnt rules are formally SCFG but not linguistically interpretable The model learns re-ordering patterns guided by lexical functional words Capture long-range movements via recursion
Follow-Up study Why not decorate the phrases with their grammatical constituents? Zollmann et. Al. 2006, 2007 If possible decorate the phrase with a constituent Generalize phrases as in Chiang 2005 Parse using chart parsing Moved from 31.85 32.15 over CMU phrase-based system Spanich-English corpus
The Big Picture: Translation Models Inter-lingua Inter-lingua Syntax Syntax Syntax Syntax Syntax Syntax String String String String String String Word-based Phrase-based SCFG (Chiang 2005), ITG (Wu 97) Inter-lingua Inter-lingua Inter-lingua Syntax Syntax Syntax Syntax Syntax Syntax String String String String String String Tree-String Transducers String-Tree Transducers Tree-Tree Transducers
Tree-String Tranceducers Syntax Inter-lingua Tree-String Transducers Linguistic Tools English Parse Trees from statistical parser Alignment from Giza++ Conditional Model P(f |Te) Models differ on How to factor P(f |Te) Domain of locality SCFG (Yamada,Knight 2001) STSG (Galley et. Al 2004) Caveat
Tree-String (Yamada & Knight) Back to noisy channel model Traduces Te into f Stochastic Channel operations (on trees) Reorder children Insert node Lexical Transplantation
Channel operations P(VB T0 T0 VB) P(right|PRP) * Pi(ha)
Learning Learn channel operation probabilities Standard EM-Training Reordering Insertion Translation Standard EM-Training E-Step: compute expected rule counts (Dyn.) M-Step: count and normalize
Decoding As Parsing In a nutshell, we learnt how to parse the foreign side: Add CFG rules from the English side Channel rules: Reordering: If (VB2->VB T0) reordered as (VB2 T0 VB) Add rule VB2p T0 VB Insertion VplXV and VprV X and Xfi Translation eipt fi
Decoding Example
Results and Expressiveness English-Chinese task Short sentences < 20 words (3M word corpus) Test set 347 sentence with at most 14 words Better Bleu score (0.102) than IBM-4 (.072) What it can represent Depends on syntactic divergence between languages pairs Tree must be isomorphic up to child re-reordering Channel rules have the following format: Q: What it can’t model? Child re-ordering
Limitations Can’t model syntactic movements that cross brackets SVO to VSO Modal movement between English and French Not ne .. pas (from English to French) VP VP VP VP VP …. VB Aux Does Not go ne va pas The span of Not can’t intersect that of Go Can’t Interleave Green with the other two
Limitations: Possible solutions Some follow up study showed relative improvement by Gildea 2003 added cloning operations AER went from .42 0.3 on Korean-English corus VP VP VP VP VP …. VB Aux Does Not go ne va pas The span of Not can’t intersect that of Go Can’t Interleave Green with the other two
Tree-String Tranceducers Syntax Inter-lingua Tree-String Transducers Linguistic Tools English Parse Trees from statistical parser Alignment from Giza++ Conditional Model P(f |Te) Models differ on How to factor P(f |Te) Domain of locality SCFG (Yamada,Knight 2001) STSG (Galley et. Al 2004) Caveat
Learning Expressive Rules (Galley 2004) Yamada & Knight Channel Operation Tables VP VP VP f1,f2,……..,fn Galley, et. al 2004 Parsing Rules For F-side Rule Extraction TSG rules CFG rules Condition on larger fragments of the trees
Rule format: Decoding Tree is build bottom up Current State Rule 1 Derivation Step VP VP = fi+1 NP VP NP ne VB pas = fi ne VB pas X2 Aux PRP VB PRP go Does Not he Does Not go he CFG Tree Fragment Tree is build bottom up Foreign string at each derivation may have non-terminals Rules are extracted from training corpus English side trees Foreign side strings Alignment from Giza++
Rule Extraction Upward projection Frontier Graph S VP{ne va pas} Aux {ne pas} VB{va} Go{va} Not{ne pas} Does{ne pas} pas va ne NP {il} PRP {il} he {il} il S {il ne va pas} RB{ne pas} VP NP Aux RB VB PRP Upward projection Does Not go he il ne va pas S go VP he {il} NP VP Frontier nodes: Nodes whose span is exclusive il Aux VB{va} Rx va Frontier Graph Does Not PRP {il} VB{va} NP {il} S,NP,PRP,he, VP, VB,go ne pas PRP {il} Go{va} he {il} Extract Rule as before
Illustrating Rule Extraction VP Aux VB{va} Not Does pas ne RB ne VB pas Input Output X2 VP NP S NP VP Input Output VB{va} Go{va} va Input Output go VB{va} Go{va} go Input Output VB
Minimality of Extracted rules Other rules can be composed form these minimal rules VP VP Aux VB{va} Aux Rx VB{va} RB VB = + Does Not Go{va} Does Not go ne pas ne va pas VP Aux X2 Not Does ne VB pas va go + ne va pas = VB
Probability Estimation Just EM Modified inside outside for E-Step Decoding as parsing Training can be done using of the shelf tree-transducers (Knight et al. 2004)
Evaluation Coverage: how well the learnt rules explain the corpus 100% coverage on F-E and C-E corpus Translation Results Decoder was still work in progress
The Big Picture: Translation Models Inter-lingua Inter-lingua Syntax Syntax Syntax Syntax Syntax Syntax String String String String String String Word-based Phrase-based SCFG (Chiang 2005), ITG (Wu 97) Inter-lingua Inter-lingua Inter-lingua Syntax Syntax Syntax Syntax Syntax Syntax String String String String String String Tree-String Transducers String-Tree Transducers Tree-Tree Transducers
Tree-Tree Transducers Linguistic Tools E/F Parse Trees from statistical parser Alignment from Giza++ Conditional Model P(Tf |Te) Models differ on How to factor P(Tf |Te) Really many many many … CFG vs. dependency trees How to train EM: most of them Discriminative (Collins 2006) Directly model P(Te |Tf) String Syntax Inter-lingua Tree-Tree Transducers Same Caveat as before
Discriminative tree-tree models Directly model P(Te |Tf) Translate from German to English Extended Projections Just elementary tree from TAG with One verb Lexical functional words NP, PP place holders Learns to map between tree fragments German clause EP Modeled as structured learning
How to Decode? Why no generative story? Given a German String Because this is a direct model! Given a German String Parse it Break it into clauses Predict an EP from each clause Translate German NP, PP using Pharaoh Map translated German NP and NN to holes in EP Structure learning comes here Stitch clauses to get English translation
How to train Training data: Aligned clauses Extraction procedures Parse English and German Align (NP,PP) in them using GIZE++ Break parse trees into clauses Order clauses based on verb position Discard sentences with different number of clauses Training: (e1,g1)….(en,gn)
How to train (2)
How to train (3) (X,Y) is a training pair - Just our good old perceptron friend!
Results BLEU Score: Human judgment German-English Europol corpus 750k training sentences 441,000 training clauses, test on 2000 sentences BLEU Score: base line 25.26 This system: 23.96 Human judgment 62 equal 16 better under this system 22 better for baseline Largely because lots of restriction were imposed
Outline The Translation Problem The Noisy Channel Model Syntax-light SMT Why Syntax? Syntax-based SMT Models Summary
Summary Syntax Does help but: What is the right representation Is it language-pair specific? How to deal with parser errors? Modeling the uncertainty of the parsing process Large scale syntax-based models Are they possible? What are the trade-offs Better parameter estimation! Should we trust the GIZA alignment results? Block-translation vs. word-word?
Thanks
Related Work Fast parsers for synchronous grammars Grammar binarization Fast K-best list parses Re-ranking Syntax driven evaluation measures Impact of parsing quality on overall system performance
SAMT