Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman
Outline “Tricky” Syntactic Features Reranking with Perceptrons Reranking with Minimum Bayes Risk Conclusion In the interest of time I’m skipping some features. If there are one you’d like to talk about, ask me.
“Tricky” Syntactic Features Constituent Alignment Features Markov Assumption Bi-gram model over elementary trees
Constituent Alignment Features Hypothesis: target sentences should be rewarded for having similar syntactic structure to source sentence Experiments: Tree to String Penalty Tree to Tree Penalty Constituent Label Probability Uses: Parse trees in both languages Word Alignments Is that hypothesis really true?
Tree to String Penalty Penalize target words that cross source constituents
Tree to Tree Penalty Penalize target constituents that don’t align to source constituents
Constituent Label Probability Learns Pr( target node label | source node label ) e.g.: Pr( target = VP | source = NP ) = 0.019. Align tree nodes by finding minimum common ancestor of aligned leaves Training: ML counts from training data Also: Pr( target label, target leaf count | source label, source leaf count )
Results Possible problems: noisy parses noisy alignments not stat. significant Possible problems: noisy parses noisy alignments insensitivity of BLEU
Markov Assumption for Tree Models Tree-based translation models from Chapter 4: too slow—they only had time to parse 300 out of 1000-best limited reordering among higher levels of tree Solution: Split trees into independent tree fragments Lesson: Difficult to get complicated features to work when there is so much noise in system.
Markov Example
TAG Elementary Trees Break into tree fragments by head word Build n-gram model on tree fragments Unigram model: Bi-gram model: where ei and fi are source and target word, tei and tfi are source and target tree Give intuition: simple bi-gram model, but basic unit is syntactically motivated
Finding TAG Elementary Trees Heuristically assign head nodes
Finding TAG Elementary Trees Split at head words
Results Why does performance improve w/ independence assumption? Just because coverage increases?
Outline “Tricky” Syntactic Features Reranking with Perceptrons Reranking with Minimum Bayes Risk Conclusion In the interest of time I’m skipping some features. If there are one you’d like to talk about, ask me.
Why rerank? Successful in POS tagging and parsing Easy incorporation of new features Decrease decoding complexity But the entire report is about reranking. Why would perceptron be better than MER on Log-linear model?
Reranking with a linear classifier Log-linear and Perceptron both give linear reranking rule: Difference is in training: Log-linear MER: optimize BLEU of 1-best Perceptron: attempt to separate “good” from “bad” in 1,000-best note: lambda_m trained with MER on dev set
Training Data For a single sentence in dev set: sort 1,000 best translations by BLEU score count top 1/3 of 1,000-best as good count bottom 1/3 as bad Finds features that on average lead to translations with higher BLEU scores This gives us our training data
Separating in feature space For a single sentence:
Separating in feature space For many sentences: Separator could be in different place but restrict to same direction for all sentences
Reranking Distance from hyperplane is score of translation Perceptron is trying to predict BLEU score
Features Baseline: 12 features used in Och’s baseline Full: Baseline + all features from workshop POS sequence: a 0/1 feature for every possible POS sequence Parse trees: a 0/1 feature for every possible subtree last 2 need explaining
Results Log-linear reranking: Perceptron reranking: baseline 31.6 workshop features 33.2 Perceptron reranking: baseline 30.9 workshop features 31.6 POS sequence 30.9 Parse tree 30.5
Analysis Dev set was too small to optimize for POS and tree features Using only POS sequence gives results as good as much more complicated baseline
Outline “Tricky” Syntactic Features Reranking with Perceptrons Reranking with Minimum Bayes Risk Conclusion This one will take a bit of background in machine learning theory.
An Example Say our 4-best list is: MAP: choose single most likely “Ball at the from Bob” .31 “Bob hit the ball” .24 “Bob struck the ball” .23 “Bob hit a ball” .22 MAP: choose single most likely MBR: weight by similarity w/ other hypotheses If they only get one slide from this section, this should be it.
MBR, formally where f = source sentence e, e’ = target sentence a, a’ = word alignment T, T’, T(f) = parse trees L((e’,a’,T’),(e,a,T);f,T(f)) = similarity between translations e and e’
Making MBR tractable Restrict e to 1,000-best list Use translation model score for P(e,a | f )
Loss functions: translation similarity “Tier 1.” String based: L(e, e’) = BLEU score between e and e’ “Tier 2.” Syntax based: L((e, T),(e, T’) = tree kernel between T and T’ “Tier 3.” Alignment based: L((e’,a’,T’),(e,a,T);f,T(f)) = number of source to target node alignments that are the same in T and T’
Results Using loss based on a particular measurement leads to better performance on that measurement Blue score goes up by 0.3
Outline “Tricky” Syntactic Features Reranking with Perceptrons Reranking with Minimum Bayes Risk Conclusion This one will take a bit of background in machine learning theory.
What Works Baseline 31.6 Model 1 (on 250M words) 32.5 All features from workshop 33.2 Human upper bound 35.7 Simple Model 1 is only feature to give statistically significant improvement in isolation Complex features helpful in aggregate
Their wish list Better evaluation metrics Better parameter tuning for individual sentences specific issues, such as missing content words Better parameter tuning larger dev set Take divergence into account in syntax Better quality parses confidence measure in parse