Syntax-based Statistical Machine Translation Models

Syntax-based Statistical Machine Translation Models
Amr Ahmed March 26th 2008

Outline The Translation Problem The Noisy Channel Model
Syntax-light SMT Why Syntax? Syntax-based SMT Models Summary

Statistical Machine Translation
Problem Given a sentence (f) in one language, produce it is equivalent in another language (e) Arabic Text English:? Machine Translation System I know how to do this “One naturally wonders if the problem of translation could conceivably be treated as a problem in cryptography. When I look at an article in Arabic, I say: ‘This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.’ ”, Warren Weaver, 1947

Statistical Machine Translation
Problem Given a sentence (f) in one language, produce it is equivalent in another language (e) Noisy Channel Model Noisy Channel P(e) We know how to factor P(e)! e f P(e) models good English P(f|e) models good translation Today: How to factor p(f|e)?

Syntax-light SMT Word-based Models Phrase-based Models Why Syntax? Syntax-based SMT Models Summary

Word-Translation Models
Auf diese Frage habe ich leider keine Antwort bekommen Blue word links aren’t observed in data. NULL I did not unfortunately receive an answer to this question What is the generative Story? IBM Model 1-4 Roughly equivalent to FST (module reordering) Learning and Decoding? Slide Credit: Adapted from Smith et. al.

Word-Based Translation Models
deletion fertility Translation Re-ordering -Stochastic operations -Associated with probabilities -Estimated using EM In a Nutshell Q: What are we learning? A: Word movement f Linguistic Hypothesis Phrase based Models 1- Words move in blocks 2- Context is important

Phrase-Based Translation Models
Segment -Stochastic operations -Associated with probabilities -Estimated using EM Translation In a Nutshell Re-ordering Q: What are we learning? A: Word movement f Linguistic Hypothesis Phrase based Models 1- Words move in blocks 2- Context is important Markovian Dependency

Phrase-Based Translation Models
Segment -Stochastic operations -Associated with probabilities -Estimated using EM Translation In a Nutshell a1 a2 a3 Re-ordering Q: What are we learning? A: Word movement f Linguistic Hypothesis Phrase based Models 1- Words move in blocks 2- Context is important Markovian Dependency

Phrase-Based Models: Example
Not necessarily syntactic phrases Division into phrases is hidden Auf diese Frage habe ich leider keine Antwort bekommen I did not unfortunately receive an answer to this question Score each phrase pair using several features Slide Credit: from Smith et. al.

Phrase Table Estimation
Basically count and Normalize

Syntax-light SMT Word-based Models Phrase-based Models Why Syntax? Syntax-based SMT Models Summary

Why Syntax? Reference: consequently proposals are submitted to parliament under the assent procedure, meaning that parliament can no longer table amendments, as directives in this area were adopted as single market legislation under the codecision procedure on the basis of art.100a tec. Translation: consequently, the proposals parliament after the assent procedure, the tabled amendments for offers no possibility of community directives, because as part of the internal market legislation on the basis of article 100a of the treaty in the codecision procedure have been adopted. Slide Credit: Example from Cowan et. al.

Slide Credit: Adapted from Cowan et. al.
Why Syntax? Reference: consequently proposals are submitted to parliament under the assent procedure, meaning that parliament can no longer table amendments, as directives in this area were adopted as single market legislation under the codecision procedure on the basis of art.100a tec. Translation: consequently, the proposals parliament after the assent procedure, the tabled amendments for offers no possibility of community directives, because as part of the internal market legislation on the basis of article 100a of the treaty in the codecision procedure have been adopted. Here syntax Can help! What Went Wrong? phrase-based systems are very good at predicting content words, But are less accurate in producing function words, or producing output that correctly encodes grammatical relations between content words

? Structure Does Help! Does adding more Structure help Se Se
x1 x2 x3 Noisy Channel Noisy Channel Sf Sf x2 x1 x3 Syntax-based Phrase-based Word-based Better performance ?

Syntax and the Translation Pipeline
Input Pre-reordering Translation system Syntax Syntax in the Translation model Output Post processing (re-ranking)

Early Exposition (Koehn et al 2003)
Fix a phrase-based System and vary the way phrases are extracted Frequency-based, Generative, Constituent Adding syntax hurts the performance Phrases like: “there is es gibt” is not a constituent (this eliminate 80% phrase-pairs) Explanation No hierarchical re-ordering Syntax is not fully exploited here! Parse trees produce errors

Outline Syntax-based SMT Models Summary The Translation Problem
The Noisy Channel Model Syntax-light SMT Why Syntax? Syntax-based SMT Models Summary

The Big Picture: Translation Models
Inter-lingua Inter-lingua Syntax Syntax Syntax Syntax Syntax Syntax String String String String String String Word-based Phrase-based SCFG (Chiang 2005), ITG (Wu 97) Inter-lingua Inter-lingua Inter-lingua Syntax Syntax Syntax Syntax Syntax Syntax String String String String String String Tree-String Transducers String-Tree Transducers Tree-Tree Transducers

Learning Synchronous Grammar
String Syntax Inter-lingua SCFG (Chiang 2005), ITG (Wu 97) No linguistic annotation Model P(e,f) jointly Trees are hidden variables EM doesn’t work well with large missing information Structural restrictions Binary rules (ITG, Wu 97) Lexical restriction Chiang 2005 SCFG to represent Hierarchal phrases What is Synchronous Grammar?

Interlude: Synchronous Grammar
Extension of monolingual theory to bitext CFG  SCFG TAG  STAG etc. Monolingual parsers are extended for bitext parsing

Synchronous Grammar: SCFG

Learning Synchronous Grammar
String Syntax Inter-lingua SCFG (Chiang 2005), ITG (Wu 97) No linguistic annotation Model P(e,f) jointly Trees are hidden variables EM doesn’t work well with large missing information Structural restrictions Binary rules (ITG, Wu 97) Lexical restriction Chiang 2005 SCFG to represent Hierarchal phrases What is Synchronous Grammar? How

Hierarchical Phrase-based Model
Hierarchical Phrased-based Models S(x1,x1) S(s1x2,s1x2) X(e3x2e4,f3f4x2) X(e1,f1) X(e5e6, f6f5) SCFG representation S1 S1 x1 x1 S2 S2 x3 x3 f3 f4 x2 e3 x2 e4 f1 e1 e6 e5 e5 e6 Phrased-based Models Se Sf x1 x2 x3 x2 x1 x3

Example (Chiang 2005)

Hierarchical Phrase-based Model
x3 S1 S2 e3 x2 e4 x1 e6 f1 f3 f4 e5 Hierarchical Phrased-based Models S1(x1,x1) S1(s1x2,s1x2) Xp (e3x2e4,f3f4x2) Xp(e1,f1) Xp(e5e6, f6f5) SCFG representation Question1 : How to train the model? What are the restrictions -At most two recursive phrases -Restriction on length Question 2 : How to decode?

Training and Decoding Collect initial grammar rules

Training and Decoding Collect initial grammar rules
Tune rule weights: count and normalize! Decoding CYK (remember rules has at most two non-terminals) Parse the f part only.

Does it help? Experimental Details Mandarin-to-English (FBIS corpus) 7.2M +9.2 M words Devset : NIST 2002 MT evaluation Test Set: 2003 NIST MT evaluation 7.5% relative improvement over phrase-based models using BLEU score 0.02 absolute improvement over baseline

Does it help? 7.5% relative improvement over phrase-based models
Learnt rules are formally SCFG but not linguistically interpretable The model learns re-ordering patterns guided by lexical functional words Capture long-range movements via recursion

Follow-Up study Why not decorate the phrases with their grammatical constituents? Zollmann et. Al. 2006, 2007 If possible decorate the phrase with a constituent Generalize phrases as in Chiang 2005 Parse using chart parsing Moved from 32.15 over CMU phrase-based system Spanich-English corpus

Tree-String Tranceducers
Syntax Inter-lingua Tree-String Transducers Linguistic Tools English Parse Trees from statistical parser Alignment from Giza++ Conditional Model P(f |Te) Models differ on How to factor P(f |Te) Domain of locality SCFG (Yamada,Knight 2001) STSG (Galley et. Al 2004) Caveat

Tree-String (Yamada & Knight)
Back to noisy channel model Traduces Te into f Stochastic Channel operations (on trees) Reorder children Insert node Lexical Transplantation

Channel operations P(VB T0 T0 VB) P(right|PRP) * Pi(ha)

Learning Learn channel operation probabilities Standard EM-Training
Reordering Insertion Translation Standard EM-Training E-Step: compute expected rule counts (Dyn.) M-Step: count and normalize

Decoding As Parsing In a nutshell, we learnt how to parse the foreign side: Add CFG rules from the English side Channel rules: Reordering: If (VB2->VB T0) reordered as (VB2 T0 VB) Add rule VB2p T0 VB Insertion VplXV and VprV X and Xfi Translation eipt fi

Decoding Example

Results and Expressiveness
English-Chinese task Short sentences < 20 words (3M word corpus) Test set 347 sentence with at most 14 words Better Bleu score (0.102) than IBM-4 (.072) What it can represent Depends on syntactic divergence between languages pairs Tree must be isomorphic up to child re-reordering Channel rules have the following format: Q: What it can’t model? Child re-ordering

Limitations Can’t model syntactic movements that cross brackets
SVO to VSO Modal movement between English and French Not  ne .. pas (from English to French) VP VP VP VP VP …. VB Aux Does Not go ne va pas The span of Not can’t intersect that of Go Can’t Interleave Green with the other two

Limitations: Possible solutions
Some follow up study showed relative improvement by Gildea 2003 added cloning operations AER went from .42  0.3 on Korean-English corus VP VP VP VP VP …. VB Aux Does Not go ne va pas The span of Not can’t intersect that of Go Can’t Interleave Green with the other two

Tree-String Tranceducers
Syntax Inter-lingua Tree-String Transducers Linguistic Tools English Parse Trees from statistical parser Alignment from Giza++ Conditional Model P(f |Te) Models differ on How to factor P(f |Te) Domain of locality SCFG (Yamada,Knight 2001) STSG (Galley et. Al 2004) Caveat

Learning Expressive Rules (Galley 2004)
Yamada & Knight Channel Operation Tables VP VP VP f1,f2,……..,fn Galley, et. al 2004 Parsing Rules For F-side Rule Extraction TSG rules CFG rules Condition on larger fragments of the trees

Rule format: Decoding Tree is build bottom up
Current State Rule 1 Derivation Step VP VP = fi+1 NP VP NP ne VB pas = fi ne VB pas X2 Aux PRP VB PRP go Does Not he Does Not go he CFG Tree Fragment Tree is build bottom up Foreign string at each derivation may have non-terminals Rules are extracted from training corpus English side trees Foreign side strings Alignment from Giza++

Rule Extraction Upward projection Frontier Graph
S VP{ne va pas} Aux {ne pas} VB{va} Go{va} Not{ne pas} Does{ne pas} pas va ne NP {il} PRP {il} he {il} il S {il ne va pas} RB{ne pas} VP NP Aux RB VB PRP Upward projection Does Not go he il ne va pas S go VP he {il} NP VP Frontier nodes: Nodes whose span is exclusive il Aux VB{va} Rx va Frontier Graph Does Not PRP {il} VB{va} NP {il} S,NP,PRP,he, VP, VB,go ne pas PRP {il} Go{va} he {il} Extract Rule as before

Illustrating Rule Extraction
VP Aux VB{va} Not Does pas ne RB ne VB pas Input Output X2 VP NP S NP VP Input Output VB{va} Go{va} va Input Output go VB{va} Go{va} go Input Output VB

Minimality of Extracted rules
Other rules can be composed form these minimal rules VP VP Aux VB{va} Aux Rx VB{va} RB VB = + Does Not Go{va} Does Not go ne pas ne va pas VP Aux X2 Not Does ne VB pas va go + ne va pas = VB

Probability Estimation
Just EM Modified inside outside for E-Step Decoding as parsing Training can be done using of the shelf tree-transducers (Knight et al. 2004)

Evaluation Coverage: how well the learnt rules explain the corpus
100% coverage on F-E and C-E corpus Translation Results Decoder was still work in progress

Tree-Tree Transducers
Linguistic Tools E/F Parse Trees from statistical parser Alignment from Giza++ Conditional Model P(Tf |Te) Models differ on How to factor P(Tf |Te) Really many many many … CFG vs. dependency trees How to train EM: most of them Discriminative (Collins 2006) Directly model P(Te |Tf) String Syntax Inter-lingua Tree-Tree Transducers Same Caveat as before

Discriminative tree-tree models
Directly model P(Te |Tf) Translate from German to English Extended Projections Just elementary tree from TAG with One verb Lexical functional words NP, PP place holders Learns to map between tree fragments German clause  EP Modeled as structured learning

How to Decode? Why no generative story? Given a German String
Because this is a direct model! Given a German String Parse it Break it into clauses Predict an EP from each clause Translate German NP, PP using Pharaoh Map translated German NP and NN to holes in EP Structure learning comes here Stitch clauses to get English translation

How to train Training data: Aligned clauses Extraction procedures
Parse English and German Align (NP,PP) in them using GIZE++ Break parse trees into clauses Order clauses based on verb position Discard sentences with different number of clauses Training: (e1,g1)….(en,gn)

How to train (2)

How to train (3) (X,Y) is a training pair
- Just our good old perceptron friend!

Results BLEU Score: Human judgment German-English Europol corpus
750k training sentences  441,000 training clauses, test on 2000 sentences BLEU Score: base line 25.26 This system: 23.96 Human judgment 62 equal 16 better under this system 22 better for baseline Largely because lots of restriction were imposed

Summary Syntax Does help but: What is the right representation
Is it language-pair specific? How to deal with parser errors? Modeling the uncertainty of the parsing process Large scale syntax-based models Are they possible? What are the trade-offs Better parameter estimation! Should we trust the GIZA alignment results? Block-translation vs. word-word?

Thanks

Related Work Fast parsers for synchronous grammars
Grammar binarization Fast K-best list parses Re-ranking Syntax driven evaluation measures Impact of parsing quality on overall system performance

Syntax-based Statistical Machine Translation Models

Similar presentations

Presentation on theme: "Syntax-based Statistical Machine Translation Models"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Syntax-based Statistical Machine Translation Models

Similar presentations

Presentation on theme: "Syntax-based Statistical Machine Translation Models"— Presentation transcript:

Similar presentations

About project

Feedback