Zhifei Li and Sanjeev Khudanpur Johns Hopkins University A Scalable Decoder for Parsing-based Machine Translation with Equivalent Language Model State Maintenance Zhifei Li and Sanjeev Khudanpur Johns Hopkins University
JOSHUA: a scalable open-source parsing-based MT decoder Written in JAVA language Chart-parsing Beam and Cube pruning K-best extraction over a hypergraph m-gram LM Integration Parallel Decoding Distributed LM (Zhang et al., 2006; Brants et al., 2007) Equivalent LM state maintenance We plan to add more functions soon Chiang (2007) New!
Chart-parsing Grammar formalism Chart parsing Synchronous Context-free Grammar (SCFG) Chart parsing Bottom-up parsing It maintains a chart, which contains an array of cells or bins A cell maintains a list of items The parsing process starts from axioms, and proceeds by applying the inference rules to prove more and more items, until a goal item is proved. The hypotheses are stored in a hypergraph.
Hypergraph on the mat a cat X | 3, 4 | a cat | NA X | 0, 2 | the mat | NA X | 0, 4 | a cat | the mat X (猫, a cat) X | 0, 4 | the mat | a cat X (X0 的 X1, X0 X1) X (X0 的 X1, X0 ’s X1) (X0, X0) S Goal Item X (垫子 上, the mat) X (X0 的 X1, X1 of X0) X (X0 的 X1, X1 on X0) 猫3 垫子0 上1 的2 of item hyperedge A hypergraph is a compact way to represent exponential number of derivation trees It contains a list of items and hyperedges The state of LM item Student Seminar
Hypergraph and Trees 猫3 垫子0 上1 的2 X (猫, a cat) X (垫子 上, the mat) 垫子0 上1 的2 X (猫, a cat) X (垫子 上, the mat) X (X0 的 X1, X0 X1) (X0, X0) S the mat a cat X (猫, a cat) 猫3 垫子0 上1 的2 X (垫子 上, the mat) (X0, X0) S X (X0 的 X1, X0 ’s X1) the mat ’s a cat X (猫, a cat) 猫3 垫子0 上1 的2 X (垫子 上, the mat) (X0, X0) S X (X0 的 X1, X1 of X0) A cat of the mat X (猫, a cat) 猫3 a cat on the mat 垫子0 上1 的2 X (垫子 上, the mat) (X0, X0) S X (X0 的 X1, X1 on X0) A hypergraph is a compact way to represent exponential number of derivation trees Student Seminar
How to Integrate an m-gram LM? Three functions Accumulate probability Estimate future cost State extraction X | 0,1 | the olympic | olympic game X (将 在 X0 举行。, will be held in X0 .) X (X0 的 X1, X1 of X0) X (北京, beijing) X (中国, china) X | 5, 6 | beijing | NA X | 3, 4 | china | NA X | 3, 6 | beijing of | of china X | 1, 7 | will be | china . S (X0, X0) S | 0, 1 | the olympic | olympic game S (S0 X1, S0 X1) S | 0, 7 | the olympic | china . S (<s> S0 </s>, <s> S0 </s>) S | 0, 7 | <s> the | . </s> X (奥运会, the olympic game) 北京5 奥运会0 中国3 的4 将1 举行。6 在2 the olympic game will be held in china of beijing . New 3-grams will be held be held in held in beijing in beijing of Estimated total prob 0.01*0.04=0.004 Future prob P(beijing of)=0.01 0.04=0.4*0.2*0.5 New 3-gram beijing of china LM probability Pruning Lm estimation State extraction 0.5 0.4 0.2 Student Seminar
Equivalent State Maintenance: overview In a straightforward implementation, different LM state words lead to different items X (在 X0 的 X1 下, below X1 of X0) X | 0, 3 | below cat | some rat X (在 X0 的 X1 下, below X1 of X0) X | 0, 3 | below cats | many rat We merge multiple items into a single item by replacing some LM state words with asterisk wildcard X (在 X0 的 X1 下, under the X1 of X0) X (在 X0 的 X1 下, below the X1 of X0) X | 0, 3 | below * | * rat Now how to merge lm items? X (在 X0 的 X1 下, under X1 of X0) X | 0, 3 | under cat | some rat X (在 X0 的 X1 下, below X1 of X0) X | 0, 3 | below cat | many rat By merging items, we can explore larger hypothesis space using less time. We only merge items when the length of English span l ≥ m-1 Student Seminar
Back-off Parameterization of m-gram LMs LM probability computation Observations A larger m leads to more backoff Default backoff weight is 1 For a m-gram not listed, β(.) = 1 -4.250922 party files -4.741889 party filled -4.250922 party finance -0.1434139 -4.741889 party financed -4.741889 party finances -0.2361806 -4.741889 party financially -3.33127 party financing -0.1119054 -3.277455 party finished -0.4362795 -4.012205 party fired -4.741889 party fires Before we talk about how to merge items, we need to understand the backoff principle Student Seminar
Equivalent State Maintenance: Right-side Why not right to left? Whether a word can be ignored depends on both its left and right sides, which complicates the procedure. Independent from el-2 Backoff weight is one For the case of a 4-gram LM P(el+1| el-2 el-1 el)=P(el+1| el-1 el) β(el-2 el-1 el)=P(el+1| el-1 el) state words State Prefix IS-A-PREFIX equivalent state future words el-2 el-1 el el+1el+2el+3… el-2 el-1 el no * el-1 el * el-1 el el+1el+2el+3… el-1 el no * el * el no el+1el+2el+3… IS-A-PREFIX(el-1 el)=no implies IS-A-PREFIX(el-1 el el+1)=no
Equivalent State Maintenance: Left-side Why not left to right? Whether a word can be ignored depends on both its left and right sides, which complicates the procedure. Remember to factor in backoff weights later Independent from e3 Finalized probability For the case of a 4-gram LM P(e3| e0 e1 e2)=P(e3| e1 e2) β(e0 e1 e2) state words State Suffix IS-A-SUFFFIX equivalent state future words … e-2e-1e0 e1 e2 e3 e1 e2 e3 no e1 e2 * … e-2e-1e0 e1 e2 * e1 e2 no e1 * e1 * no … e-2e-1e0 P(e1| e-2 e-1 e0)=P(e1) β(e0) β(e-1 e0) β(e-2 e-1 e0) P(e2| e-1 e0 e1)=P(e2| e1) β(e0 e1) β(e-1 e0 e1)
Equivalent State Maintenance: summary Original Cost Function Modified Cost Function Finalized probability Estimated probability State extraction
Experimental Results: Decoding Speed System Training Task: Chinese to English translation Sub-sampling of bitext of about 3M sentence pairs obtain 570k sentence pairs LM training data: Gigaword and English side of bitext Decoding speed Number of rules: 3M Number of m-grams: 49M 38 times faster than the baseline!
Experimental Results: Distributed LM Distributed Language Model Eight 7-gram LMs Decoding speed: 12.2 sec/sent
Experimental Results: Equivalent LM States Search effort versus search quality Equivalent LM State Maintenance Sparse LM: a 7-gram LM built on about 19M words 50 70 90 120 150 200 30 Dense LM: a 3-gram LM build on about 130M words The equivalent LM state maintenance is slower than the regular method. Backoff happens less frequently Inefficient suffix/prefix information lookup
Summary We describe a scalable parsing-based MT decoder The decoder has been successfully used for decoding millions of sentences in a large-scale discriminative training task We propose a method to maintain equivalent LM states The decoder is available at http://www.cs.jhu.edu/~zfli/
Acknowledgements Thanks to Philip Resnik for letting me use the UMD Python decoder Thanks to UMD MT group members for very helpful discussions Thanks to David Chiang for Hiero and his original implementation in Python
Thank you!
Grammar Formalism Synchronous Context-free Grammar (SCFG) Ts: a set of source-language terminal symbols Tt: a set of target-language terminal symbols N: a shared set of nonterminal symbols A set of rules of the form a typical rule looks like:
Chart-parsing Grammar formalism Decoding task is defined as Synchronous Context-free Grammar (SCFG) Decoding task is defined as Chart parsing It maintains a chart, which contains an array of cells or bins A cell maintains a list of items The parsing process starts from axioms, and proceeds by applying the inference rules to prove more and more items, until a goal item is proved. The hypotheses are stored in a structure called hypergraph.
m-gram LM Integration Three Functions Accumulate probability Estimate future cost State extraction Cost Function Finalized probability Estimated probability State extraction
Parallel and Distributed Decoding Parallel Decoding Divide the test set into multiple parts Each part is decoded by a separate thread The threads share the language/translation models in memory Distributed Language Model (DLM) Training Divide the corpora into multiple parts Train a LM on each part Find the optimal weights among the LMs Maximize the likelihood of a dev set Decoding Load the LMs into different servers The decoder remotely calls the servers to obtain the probabilities The decoder then interpolates the probabilities on the fly To save communication overhead, a cache is maintained
Chart-parsing Decoding task is defined as Chart parsing It maintains a chart, which contains an array of cells or bins A cell maintains a list of items The parsing process starts from axioms, and proceeds by applying the inference rules to prove more and more items, until a goal item is proved. The hypotheses are stored in a structure called hypergraph. State of an Item Source span, left-side nonterminal symbol, and left/right LM state Decoding complexity
Hypergraph A hypergraph consists of a set of nodes and hyperedges in parsing, they correspond to item and deductive step, respectively Roughly, a hyperedge can be thought as a rule with pointers State of an item Source span, left-side nonterminal symbol, and left/right LM state Goal Item (X0, X0) S (X0, X0) S X | 0, 4 | the mat | a cat X | 0, 4 | a cat | the mat X (X0 的 X1, X0 X1) X (X0 的 X1, X0 ’s X1) X (X0 的 X1, X1 of X0) X (X0 的 X1, X1 on X0) A hypergraph is a compact way to represent exponential number of derivation trees item X | 0, 2 | the mat | NA X | 3, 4 | a cat | NA hyperedge X (垫子 上, the mat) X (猫, a cat) 猫3 垫子0 上1 的2 on the mat a cat Student Seminar