1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department.

1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department Johns Hopkins University Advisor: Sanjeev Khudanpur Co-advisor: Jason Eisner

2 Statistical Machine Translation Pipeline Decoding Bilingual Data Translation Models Monolingual English Language Models Discriminative Training Optimal Weights Unseen Sentences Translation Outputs Generative Training Held-out Bilingual Data

3 垫子上的猫 Training a Translation Model dianzi shangdemao a catonthe mat dianzi shang the mat

4 垫子上的猫 Training a Translation Model dianzi shangdemao a catonthe mat dianzi shang the mat, mao a cat,

5 垫子上的猫 Training a Translation Model dianzi shangde onthe mat dianzi shang de on the mat, dianzi shang the mat, mao a cat,

6 Training a Translation Model dianzi shang de on the mat, dianzi shang the mat, 垫子上的猫 demao a caton de mao a cat on, mao a cat,

7 Training a Translation Model dianzi shang de on the mat, dianzi shang the mat, 垫子上的猫 de on de mao a cat on, de on, mao a cat,

8 dianzi shang de gou 垫子上的狗 the dog on the mat dianzi shang de gou Decoding a Test Sentence Translation is easy? Derivation Tree

9 dianzi shang de mao a cat on the mat 垫子上的猫 zhongguo de shoudu capital of China wo de mao my cat zhifei de mao zhifei ’s cat Translation Ambiguity

10 dianzi shang de mao Joshua (chart parser)

11 dianzi shang de mao the mat a cat a cat on the mat a cat of the mat the mat ’s a cat Joshua (chart parser)

12 dianzi shang de mao hypergraph Joshua (chart parser)

13 A hypergraph is a compact data structure to encode exponentially many trees. hyperedg e edge hyperedg e node FSA Packed Forest

14 A hypergraph is a compact data structure to encode exponentially many trees.

17 A hypergraph is a compact data structure to encode exponentially many trees. Structure sharing

18 Why Hypergraphs? Contains a much larger hypothesis space than a k-best list General compact data structure special cases include finite state machine (e.g., lattice), and/or graph packed forest can be used for speech, parsing, tree-based MT systems, and many more

19 p=2 p=1 p=3 p=2 Weighted Hypergraph Linear model: weights features derivation foreign input

20 Probabilistic Hypergraph Log-linear model: Z=2+1+3+2=8 p=2/8 p=1/8 p=3/8 p=2/8

21 The hypergraph defines a probability distribution over trees! Probabilistic Hypergraph the distribution is parameterized by Θ p=2/8 p=1/8 p=3/8 p=2/8

22 Probabilistic Hypergraph How do we set the parameters Θ? Training Why are the problems difficult? - brute-force will be too slow as there are exponentially many trees, so require sophisticated dynamic programs - sometimes intractable, require approximations The hypergraph defines a probability distribution over trees! the distribution is parameterized by Θ Decoding Which translation do we present to a user? What atomic operations do we need to perform? Atomic Inference

23 Inference, Training and Decoding on Hypergraphs Atomic Inference finding one-best derivations finding k-best derivations computing expectations (e.g., of features) Training Perceptron, conditional random field (CRF), minimum error rate training (MERT), minimum risk, and MIRA Decoding Viterbi decoding, maximum a posterior (MAP) decoding, and minimum Bayes risk (MBR) decoding

24 Outline Hypergraph as Hypothesis Space Unsupervised Discriminative Training ‣ minimum imputed risk ‣ contrastive language model estimation Variational Decoding First- and Second-order Expectation Semirings main focus

25 Training Setup Each training example consists of a foreign sentence (from which a hypergraph is generated to represent many possible translations) a reference translation x: dianzi shang de mao y: a cat on the mat Training adjust the parameters Θ so that the reference translation is preferred by the model

26 Supervised: Minimum Risk Minimum Bayes Risk Training Minimum Empirical Risk Training xy loss given by nature, unknown - MERT - CRF - Peceptron What if the input x is missing? MT decoder MT output negated BLEU

27 Supervised: Minimum Empirical Risk Minimum Empirical Risk Training xy loss empirical distribution - MERT - CRF - Peceptron What if the input x is missing? MT decoder MT output negated BLEU Uniform Empirical Distribution

28 Minimum Empirical Risk Training Unsupervised: Minimum Imputed Risk Minimum Imputed Risk Training loss : reverse model : imputed input : forward system Speech recognition? Round trip translation

29 Training Reverse Model and are parameterized and trained separately Our goal is to train a good forward system is fixed when training

30 Approximating Approximations exponentially many x, stored in a hypergraph - k-best - sampling - lattice variational approximation + lattice decoding (Dyer et al., 2008) CFG is not closed under composition! SCFG

31 The Forward System Deterministic Decoding use one-best translation Randomized Decoding use a distribution of translations the objective is not differentiable ☹ differentiable ☺ expected loss

32 Minimum Error Rate Training Minimum Error Rate Training (MERT) (Och, 2003) (Smith and Eisner, 2006) piecewise constant (not smoothed), not amenable to gradient descent, so cannot be scaled up to a large number of features (Papineni et al., 2001)

33 Experiments Supervised Training require bitext Unsupervised Training require monolingual English Semi-supervised Training interpolation of supervised and unsupervised

34 Semi-supervised Training Adding unsupervised data helps! 40K sent. pairs 551 features

35 Supervised vs. Unsupervised Unsupervised training performs as well as (and often better) the supervised case!

36 Supervised vs. Unsupervised Unsupervised training performs as well as (and often better than) the supervised one! Unsupervised uses 16 times of data as supervised. For example, Chines e English Sup10016*100 Unsup16*10016*16*100 More experiments different k-best size different reverse model But, fair comparison!

37 Unsupervised Training with Different Reverse Models Reverse Model WLM: with a language model NLM: without a language model A reasonably good reverse model is needed for the unsupervised training to work.

38 Unsupervised Training with Different Reverse Models Reverse Model WLM: with a language model NLM: without a language model A reasonably good reverse model is needed for the unsupervised training to work.

39 Unsupervised Training with Different k-best Size Performance does not change much with different k. k-best

40 Unsupervised Training with Different k-best Size k-best Performance does not change much with different k.

41 Outline Hypergraph as Hypothesis Space Unsupervised Discriminative Training ‣ minimum imputed risk ‣ contrastive language model estimation Variational Decoding First- and Second-order Expectation Semirings

42 Language Modeling Language Model assign a probability to an English sentence y typically use an n-gram model Global Log-linear Model (whole-sentence maximum-entropy LM) All English sentences with any length! Sampling ☹ slow (Rosenfeld et al., 2001) a set of n-grams occurred in y Locally normalized Globally normalized

43 Global Log-linear Model (whole-sentence maximum-entropy LM) (Rosenfeld et al., 2001) Contrastive Estimation Contrastive Estimation (CE) neighborhood or contrastive set improve both speed and accuracy not proposed for language modeling Neighborhood Function loss (Smith and Eisner, 2005) train to recover the original English as much as possible a set of alternate Eng. sentences of

44 Contrastive Language Model Estimation Step-1: extract a confusion grammar (CG) an English-to-English SCFG Step-2: for each English sentence, generate a contrastive set (or neighborhood) using the CG Step-3: discriminative training paraphrase insertion re-ordering neighborhood function

45 Step-1: Extracting a Confusion Grammar (CG) Deriving a CG from a bilingual grammar use Chinese side as pivots Bilingual Rule Confusion Rule Our neighborhood function is learned and MT-specific. CG captures the confusion an MT system will have when translating an input.

46 Step-2: Generating Contrastive Sets a cat on the mat CG the cat the mat the cat ’s the mat the mat on the cat the mat of the cat Contrastive set: Translating “dianzi shang de mao”?

47 Step-3: Discriminative Training Training Objective Step-2: for each English sentence, generate a contrastive set (or neighborhood) using the CG Step-3: discriminative training Iterative Training contrastive set CE maximizes the conditional likelihood expected loss

48 Applying the Contrastive Model We can use the contrastive model as a regular language model We can incorporate the contrastive model into an end-to-end MT system as a feature We may also use the contrastive model to generate paraphrase sentences (if the loss function measures semantic similarity) the rules in CG are symmetri c

49 Feature Set Target side of a confusion rule Rule bigram features word penalty baseline LM two big features use glue rules or regular confusion rules? identity rules!

50 Parsing Confusion grammar Monolingual English Training Contrastive LM English Sentence One-best English Test on Synthesized Hypergraphs of English Data Rank Hypergraph (Neighborhood) BLEU Score?

51 baseline LM (5-gram) word penalty Target side of a confusion rule Rule bigram features The contrastive LM better recovers the original English than a regular n-gram LM. All the features look at only the target sides of confusion rules Results on Synthesized Hypergraphs

52 Results on MT Test Set Add CLM as a feature The contrastive LM helps to improve MT performance.

53 Adding Features on the CG itself On English Set On MT Set Paraphrasin g model glue rules or regular confusion rules? one big feature

54 Results on Synthesized English Data

55 Results on English Test Set

56 Results on English Test Set

57 Results on English Test Set Our contrastive LM performs well as an LM alone. Adding more features often helps. Paraphrasin g model

58 Results on MT Test Set Only the RuleBigram features are used in MT decoding, but their weights may get trained with others for CE

59 Only the RuleBigram features are used in MT decoding, but their weights may get trained with others for CE Results on MT Test Set

60 Our contrastive LM helps to improve MT performance. The way of training the contrastive LM does matter. Eng. Only the RuleBigram features are used in MT decoding, but their weights may get trained with others for CE Results on MT Test Set

61 Supervised: Minimum Empirical Risk Unsupervised: Contrastive LM Estimation Summary for Discriminative Training Unsupervised: Minimum Imputed Risk require bitext require monolingual English

62 Supervised Training Unsupervised: Contrastive LM Estimation Summary for Discriminative Training Unsupervised: Minimum Imputed Risk require a reverse model can have both TM and LM features can have LM features only require bitext require monolingual English

64 Variational Decoding We want to do inference under p, but it is intractable Instead, we derive a simpler distribution q* Then, we will use q* as a surrogate for p in inference p(y|x) Q q*(y) P intractable MAP decoding tractable estimation tractable decoding (Sima’an 1996)

65 Variational Decoding for MT: an Overview Sentence-specific decoding Foreign sentence x SMT MAP decoding under P is intractable p(y | x)=∑ d ∈ D(x,y) p(d|x) 1 Generate a hypergraph for the foreign sentence Three steps: p(d | x) 53

66 q* is an n-gram model over output strings. Decode using q* on the hypergraph 1 p(d | x) q*(y | x) 2 3 Estimate a model from the hypergraph by minimizing KL Generate a hypergraph q*(y | x) ≈∑ d ∈ D(x,y) p(d|x) 54 Approximate a hypergraph with a lattice!

68 First-order expectations: - expectation - entropy - expected loss - cross-entropy - KL divergence - feature expectations - first-order gradient of Z Second-order expectations: - expectation over product - interaction between features - Hessian matrix of Z - second-order gradient descent - gradient of expectation - gradient of expected loss or entropy Probabilistic Hypergraph “Decoding” quantities: - Viterbi - K-best - Counting -...... A semiring framework to compute all of these Recipe to compute a quantity : Choose a semiring Specific a semiring weight for each hyperedge Run the inside algorithm

69 Applications of Expectation Semirings: a Summary

70 Unsupervised Discriminative Training ‣ minimum imputed risk ‣ contrastive language model estimation Variational Decoding First- and Second-order Expectation Semirings Inference, Training and Decoding on Hypergraphs (Li et al., ACL 2009) (Li and Eisner, EMNLP 2009) (In Preparation)

71 My Other MT Research Training methods (supervised) Discriminative forest reranking with Perceptron (Li and Khudanpur, GALE book chapter 2009) Discriminative n-gram language models (Li and Khudanpur, AMTA 2008) Algorithms Oracle extraction from hypergraphs (Li and Khudanpur, NAACL 2009) Efficient intersection between n-gram LM and CFG (Li and Khudanpur, ACL SSST 2008) Others System combination (Smith et al., GALE book chapter 2009) Unsupervised translation induction for Chinese abbreviations (Li and Yarowsky, ACL 2008)

72 Research other than MT Information extraction Relation extraction between formal and informal phrases (Li and Yarowsky, EMNLP 2008) Spoken dialog management Optimal dialog in consumer-rating systems using a POMDP (Li et al., SIGDial 2008)

73 Joshua project An open-source parsing-based MT toolkit (Li et al. 2009) support Hiero (Chiang, 2007) and SAMT (Venugopal et al., 2007) Team members Zhifei Li, Chris Callison-Burch, Chris Dyer, Sanjeev Khudanpur, Wren Thornton, Jonathan Weese, Juri Ganitkevitch, Lane Schwartz, and Omar Zaidan All the methods presented have been implemented in Joshua! Only rely on word-aligner and SRI LM!

74 Thank you! XieXie! 谢谢 !

78 Decoding over a hypergraph Pick a single translation to output (why not just pick the tree with the highest weight?) Given a hypergraph of possible translations (generated for a given foreign sentence by already-trained model)

79 Spurious Ambiguity Statistical models in MT exhibit spurious ambiguity Many different derivations (e.g., trees or segmentations) generate the same translation string Tree-based MT systems derivation tree ambiguity Regular phrase-based MT systems phrase segmentation ambiguity

80 Spurious Ambiguity in Derivation Trees 机器翻译软件 S ->( 机器, machine) S ->( 翻译, translation) S ->( 软件, software) S ->( 机器, machine) S ->( 翻译, translation) S ->( 软件, software) S ->( 机器, machine) 翻译 S ->( 软件, software) S->(S0 S1, S0 S1) S->(S0 翻译 S1, S0 translation S1) Same output: “machine translation software” Three different derivation trees jiqi fanyi yuanjian machine translation software machine transfer software Another translation:

81 MAP, Viterbi and N-best Approximations Viterbi approximation N-best approximation (crunching) (May and Knight 2006) Exact MAP decoding NP-hard (Sima’an 1996)

82 red translation blue translation green translation 0.16 0.14 0.13 0.12 0.11 0.10 probability derivation translation string MAP MAP vs. Approximations 0.28 0.44 Viterbi 0.16 0.14 0.13 4-best crunching 0.16 0.28 0.13 Our goal: develop an approximation that considers all the derivations but still allows tractable decoding Viterbi and crunching are efficient, but ignore most derivations Exact MAP decoding under spurious ambiguity is intractable on HG

83 Variational Decoding Decoding using Variational approximation Decoding using a sentence-specific approximate distribution

84 Variational Decoding for MT: an Overview Sentence-specific decoding Foreign sentence x SMT MAP decoding under P is intractable p(y | x) 1 Generate a hypergraph for the foreign sentence Three steps: p(y, d | x) 68 84

85 q* is an n-gram model over output strings. Decode using q* on the hypergraph 1 p(d | x) q*(y | x) 2 3 Estimate a model from the hypergraph Generate a hypergraph q*(y | x) ≈∑ d ∈ D(x,y) p(d|x) 69 85

86 Variational Inference We want to do inference under p, but it is intractable Instead, we derive a simpler distribution q* Then, we will use q* as a surrogate for p in inference p Q q* P

87 constant Variational Approximation q*: an approximation having minimum distance to p Three questions how to parameterize q ? how to estimate q* ? how to use q* for decoding? a family of distributions an n-gram model compute expected n-gram counts and normalize score the hypergraph with the n-gram model

88 KL divergences under different variational models Larger n ==> better approximation q_n ==> smaller KL divergence from p The reduction of KL divergence happens mostly when switching from unigram to bigram

89 BLEU Results on Chinese- English NIST MT 2004 Tasks variational decoding improves over Viterbi, MBR, and crunching (Kumar and Byrne, 2004) New! (May and Knight, 2006)

90 Variational Inference We want to do inference under p, but it is intractable Instead, we derive a simpler distribution q* Then, we will use q* as a surrogate for p in inference p Q q* P intractable tractable

92 First-order quantities: - expectation - entropy - Bayes risk - cross-entropy - KL divergence - feature expectations - first-order gradient of Z Second-order quantities: - expectation over product - interaction between features - Hessian matrix of Z - second-order gradient descent - gradient of expectation - gradient of entropy or Bayes risk Probabilistic Hypergraph “decoding” quantities: - Viterbi - K-best - Counting -...... A semiring framework to compute all of these

93 ‣ choose a semiring ‣ specify a weight for each hyperedge ‣ run the inside algorithm Compute Quantities on a Hypergraph: a Recipe three steps: Semiring-weighted inside algorithm a set with plus and times operations e.g., integer numbers with regular + and × each weight is a semiring member complexity is O(hypergraph size)

94 Semirings “Decoding” time semirings counting, Viterbi, K-best, etc. “Training” time semirings first-order expectation semirings second-order expectation semirings (new) Applications of the Semirings (new) entropy, risk, gradient of them, and many more (Goodman, 1999) (Eisner, 2002)

95 dianzi 0 shang 1 de 2 mao 3 How many trees? four☺ compute it use a semiring?

96 k e =1 ‣ choose a semiring ‣ specify a weight for each hyperedge ‣ run the inside algorithm counting semiring: ordinary integers with regular + and x 1 1 1 1 1 1 11 Compute the Number of Derivation Trees Three steps:

97 k e =1 k(v 1 )= k(e 1 )k(v 2 )= k(e 2 ) k(v 5 )= k(e 7 ) k(v 3 ) k(e 8 ) k(v 4 ) k(v 5 )=4 k(v 4 )= k(e 5 ) k(v 1 ) k(v 2 ) k(e 6 ) k(v 1 ) k(v 2 ) k(v 3 )= k(e 3 ) k(v 1 ) k(v 2 ) k(e 4 ) k(v 1 ) k(v 2 ) k(v 4 )=2 k(v 1 )=1 k(v 3 )=2 k(v 2 )=1 Bottom-up process in computing the number of trees 1 1 1 1 1 1 11

98 the mat a cat a cat on the mat a cat of the mat the mat ’s a cat expected translation length? 2/8×4 + 6/8×5 = 4.75 4 5 5 5 p=2/8 p=1/8 p=3/8 p=2/8 variance ? 2/8×(4-4.75)^2 + 6/8×(5-4.75)^2 ≈ 0.19

99 First- and Second-order Expectation Semirings each member is a 2- tuple: each member is a 4- tuple: First-order: Second- order: (Eisner, 2002)

100 k e =1 k(v 1 )= k(e 1 )k(v 2 )= k(e 2 ) k(v 5 )= k(e 7 ) k(v 3 ) k(e 8 ) k(v 4 ) k(v 4 )= k(e 5 ) k(v 1 ) k(v 2 ) k(e 6 ) k(v 1 ) k(v 2 ) k(v 3 )= k(e 3 ) k(v 1 ) k(v 2 ) k(e 4 ) k(v 1 ) k(v 2 ) k(v 5 )= 〈 8, 4.75 〉 k(v 4 )= 〈 1, 2 〉 k(v 3 )= 〈 0, 1 〉 k(v 2 )= 〈 1, 2 〉 k(v 1 )= 〈 1, 2 〉〈 1, 2 〉〈 2, 0 〉〈 3, 3 〉〈 1, 1 〉〈 2, 2 〉〈 1, 0 〉 First-order: each semiring member is a 2-tuple Fake numbers

101 k e =1 k(v 1 )= k(e 1 )k(v 2 )= k(e 2 ) k(v 5 )= k(e 7 ) k(v 3 ) k(e 8 ) k(v 4 ) k(v 4 )= k(e 5 ) k(v 1 ) k(v 2 ) k(e 6 ) k(v 1 ) k(v 2 ) k(v 3 )= k(e 3 ) k(v 1 ) k(v 2 ) k(e 4 ) k(v 1 ) k(v 2 ) 〈 1,2,2,4 〉〈 2,0,0,0 〉〈 3,3,3,3 〉〈 1,1,1,1 〉〈 2,2,2,2 〉〈 1, 0,0,0 〉〈 1,2,2,4 〉 k(v 5 )= 〈 8,4.5,4.5,5 〉 k(v 4 )= 〈 1,2,1,3 〉 k(v 3 )= 〈 1,1,1,1 〉 k(v 2 )= 〈 1,2,2,4 〉 k(v 1 )= 〈 1,2,2,4 〉 Second-order: each semiring member is a 4-tuple Fake numbers

102 Expectations on Hypergraphs r(d) is a function over a derivation d Expectation over a hypergraph e.g., translation length is additively decomposed! r(d) is additively decomposed e.g., the length of the translation yielded by d exponential size

103 Second-order Expectations on Hypergraphs r and s are additively decomposed Expectation of products over a hypergraph r and s can be identical or different functions. exponential size

104 Compute expectation using expectation semiring: Why? entropy is an expectation Entropy: p e : transition probability or log-linear score at edge e r e ? log p(d) is additively decomposed!

105 Why? cross-entropy is an expectation Entropy: p e : transition probability or log-linear score at edge e r e ? Cross-entropy: log q(d) is additively decomposed! Compute expectation using expectation semiring:

106 Why? Bayes risk is an expectation Entropy: p e : transition probability or log-linear score at edge e r e ? Cross-entropy: Bayes risk: (Tromble et al. 2008) L(Y(d)) is additively decomposed! Compute expectation using expectation semiring:

107 Applications of Expectation Semirings: a Summary

108 First-order quantities: - expectation - entropy - Bayes risk - cross-entropy - KL divergence - feature expectations - first-order gradient of Z Second-order quantities: - Expectation over product - interaction between features - Hessian matrix of Z - second-order gradient descent - gradient of expectation - gradient of entropy or Bayes risk Probabilistic Hypergraph “decoding” quantities: - Viterbi - K-best - Counting -...... A semiring framework to compute all of these

1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department.

Similar presentations

Presentation on theme: "1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department.

Similar presentations

Presentation on theme: "1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department."— Presentation transcript:

Similar presentations

About project

Feedback