A Scalable Decoder for Parsing-based Machine Translation with Equivalent Language Model State Maintenance Zhifei Li and Sanjeev Khudanpur Johns Hopkins.

A Scalable Decoder for Parsing-based Machine Translation with Equivalent Language Model State Maintenance Zhifei Li and Sanjeev Khudanpur Johns Hopkins University

Written in JAVA language Chart-parsing Beam and Cube pruning K-best extraction over a hypergraph m-gram LM Integration Parallel Decoding Distributed LM (Zhang et al., 2006; Brants et al., 2007) Equivalent LM state maintenance We plan to add more functions soon JOSHUA: a scalable open-source parsing-based MT decoder New! Chiang (2007)

Grammar formalism Synchronous Context-free Grammar (SCFG) Chart parsing Bottom-up parsing It maintains a chart, which contains an array of cells or bins A cell maintains a list of items The parsing process starts from axioms, and proceeds by applying the inference rules to prove more and more items, until a goal item is proved. The hypotheses are stored in a hypergraph. Chart-parsing

Hypergraph item hyperedge on the mata cat X | 3, 4 | a cat | NA X | 0, 2 | the mat | NA X | 0, 4 | a cat | the mat X ( 猫, a cat) X | 0, 4 | the mat | a cat X (X 0 的 X 1, X 0 X 1 ) X (X 0 的 X 1, X 0 ’s X 1 ) (X 0, X 0 ) S Goal Item (X 0, X 0 ) S X ( 垫子上, the mat) X (X 0 的 X 1, X 1 of X 0 )X (X 0 的 X 1, X 1 on X 0 ) 猫3猫3 垫子 0 上 1 的2的2 of

Hypergraph and Trees 猫3猫3 垫子 0 上 1 的2的2 X ( 猫, a cat) X ( 垫子上, the mat) X (X 0 的 X 1, X 0 X 1 ) (X 0, X 0 ) S the mat a cat X ( 猫, a cat) 猫3猫3 垫子 0 上 1 的2的2 X ( 垫子上, the mat) (X 0, X 0 ) S X (X 0 的 X 1, X 0 ’s X 1 ) the mat ’s a cat X ( 猫, a cat) 猫3猫3 a cat on the mat 垫子 0 上 1 的2的2 X ( 垫子上, the mat) (X 0, X 0 ) S X (X 0 的 X 1, X 1 on X 0 ) X ( 猫, a cat) 猫3猫3 垫子 0 上 1 的2的2 X ( 垫子上, the mat) (X 0, X 0 ) S X (X 0 的 X 1, X 1 of X 0 ) A cat of the mat

How to Integrate an m-gram LM? X | 0,1 | the olympic | olympic game X ( 将在 X 0 举行。, will be held in X 0.) X (X 0 的 X 1, X 1 of X 0 ) X ( 北京, beijing) X ( 中国, china) X | 5, 6 | beijing | NA X | 3, 4 | china | NA X | 3, 6 | beijing of | of china X | 1, 7 | will be | china. S (X 0, X 0 ) S | 0, 1 | the olympic | olympic game S (S 0 X 1, S 0 X 1 ) S | 0, 7 | the olympic | china. S ( S 0, S 0 ) S | 0, 7 | the |. X ( 奥运会, the olympic game) 北京 5 奥运会 0 中国 3 的4的4 将1将1 举行。 6 在2在2 the olympic gamewill beheldinchinaofbeijing. Three functions Accumulate probability Estimate future cost State extraction 0.40.2 New 3-gram beijing of china New 3-grams will be held be held in held in beijing in beijing of 0.04=0.4*0.2*0.5 0.5 Future prob P(beijing of)=0.01 Estimated total prob 0.01*0.04=0.004

Equivalent State Maintenance: overview X ( 在 X 0 的 X 1 下, below X 1 of X 0 ) X | 0, 3 | below cat | some rat X ( 在 X 0 的 X 1 下, below X 1 of X 0 ) X | 0, 3 | below cats | many rat X ( 在 X 0 的 X 1 下, under the X 1 of X 0 ) X ( 在 X 0 的 X 1 下, below the X 1 of X 0 ) X | 0, 3 | below * | * rat In a straightforward implementation, different LM state words lead to different items We merge multiple items into a single item by replacing some LM state words with asterisk wildcard X ( 在 X 0 的 X 1 下, under X 1 of X 0 ) X | 0, 3 | under cat | some rat X ( 在 X 0 的 X 1 下, below X 1 of X 0 ) X | 0, 3 | below cat | many rat By merging items, we can explore larger hypothesis space using less time. We only merge items when the length of English span l ≥ m-1

Back-off Parameterization of m-gram LMs LM probability computation Observations A larger m leads to more backoff Default backoff weight is 1 For a m-gram not listed, β(. ) = 1 -4.250922 party files -4.741889 party filled -4.250922 party finance -0.1434139 -4.741889 party financed -4.741889 party finances -0.2361806 -4.741889 party financially -3.33127 party financing -0.1119054 -3.277455 party finished -0.4362795 -4.012205 party fired -4.741889 party fires

Equivalent State Maintenance: Right-side state wordsState PrefixIS-A-PREFIXequivalent statefuture words e l-2 e l-1 elel e l-2 e l-1 e l no e l+1 e l+2 e l+3 … *e l-1 elel e l-1 e l no e l+1 e l+2 e l+3 … Why not right to left? Whether a word can be ignored depends on both its left and right sides, which complicates the procedure. Independent from e l-2 IS-A-PREFIX(e l-1 e l )=no implies IS-A-PREFIX(e l-1 e l e l+1 )=no For the case of a 4-gram LM P(e l+1 | e l-2 e l-1 e l )=P(e l+1 | e l-1 e l ) β(e l-2 e l-1 e l )=P(e l+1 | e l-1 e l ) *e l-1 elel **elel **elel elel no e l+1 e l+2 e l+3 … *** Backoff weight is one

Equivalent State Maintenance: Left-side state wordsState SuffixIS-A-SUFFFIXequivalent statefuture words e 1 e 2 e 3 no … e -2 e -1 e 0 e1e1 e2e2 * e 1 e 2 no e1e1 e2e2 e3e3 Independent from e 3 P(e 2 | e -1 e 0 e 1 )=P(e 2 | e 1 ) β(e 0 e 1 ) β(e -1 e 0 e 1 ) For the case of a 4-gram LM P(e 3 | e 0 e 1 e 2 )=P(e 3 | e 1 e 2 ) β(e 0 e 1 e 2 ) Why not left to right? Whether a word can be ignored depends on both its left and right sides, which complicates the procedure. … e -2 e -1 e 0 e1e1 ** e1e1 no … e -2 e -1 e 0 *** e1e1 **e1e1 e2e2 * Finalized probability P(e 1 | e -2 e -1 e 0 )=P(e 1 ) β(e 0 ) β(e -1 e 0 ) β(e -2 e -1 e 0 ) Remember to factor in backoff weights later

Equivalent State Maintenance: summary Modified Cost FunctionOriginal Cost Function Finalized probability Estimated probability State extraction

Experimental Results: Decoding Speed System Training Task: Chinese to English translation Sub-sampling of bitext of about 3M sentence pairs obtain 570k sentence pairs LM training data: Gigaword and English side of bitext Decoding speed Number of rules: 3M Number of m-grams: 49M 38 times faster than the baseline!

Experimental Results: Distributed LM Distributed Language Model Eight 7-gram LMs Decoding speed: 12.2 sec/sent

Experimental Results: Equivalent LM States Search effort versus search quality Equivalent LM State Maintenance Sparse LM: a 7-gram LM built on about 19M words Dense LM: a 3-gram LM build on about 130M words The equivalent LM state maintenance is slower than the regular method. Backoff happens less frequently Inefficient suffix/prefix information lookup 50 70 90 120 150 200 30

Summary We describe a scalable parsing-based MT decoder The decoder has been successfully used for decoding millions of sentences in a large-scale discriminative training task We propose a method to maintain equivalent LM states The decoder is available at http://www.cs.jhu.edu/~zfli/

Thanks to Philip Resnik for letting me use the UMD Python decoder Thanks to UMD MT group members for very helpful discussions Thanks to David Chiang for Hiero and his original implementation in Python Acknowledgements

Thank you!

Synchronous Context-free Grammar (SCFG) T s : a set of source-language terminal symbols T t : a set of target-language terminal symbols N: a shared set of nonterminal symbols A set of rules of the form a typical rule looks like: Grammar Formalism

Grammar formalism Synchronous Context-free Grammar (SCFG) Decoding task is defined as Chart parsing It maintains a chart, which contains an array of cells or bins A cell maintains a list of items The parsing process starts from axioms, and proceeds by applying the inference rules to prove more and more items, until a goal item is proved. The hypotheses are stored in a structure called hypergraph. Chart-parsing

m-gram LM Integration Three Functions Accumulate probability Estimate future cost State extraction Cost Function Finalized probability Estimated probability State extraction

Parallel Decoding Divide the test set into multiple parts Each part is decoded by a separate thread The threads share the language/translation models in memory Distributed Language Model (DLM) Training Divide the corpora into multiple parts Train a LM on each part Find the optimal weights among the LMs Maximize the likelihood of a dev set Decoding Load the LMs into different servers The decoder remotely calls the servers to obtain the probabilities The decoder then interpolates the probabilities on the fly To save communication overhead, a cache is maintained Parallel and Distributed Decoding

Decoding task is defined as Chart parsing It maintains a chart, which contains an array of cells or bins A cell maintains a list of items The parsing process starts from axioms, and proceeds by applying the inference rules to prove more and more items, until a goal item is proved. The hypotheses are stored in a structure called hypergraph. State of an Item Source span, left-side nonterminal symbol, and left/right LM state Decoding complexity Chart-parsing

Hypergraph X | 3, 4 | a cat | NA X | 0, 2 | the mat | NA X | 0, 4 | a cat | the mat X ( 猫, a cat) X | 0, 4 | the mat | a cat A hypergraph consists of a set of nodes and hyperedges in parsing, they correspond to item and deductive step, respectively Roughly, a hyperedge can be thought as a rule with pointers State of an item Source span, left-side nonterminal symbol, and left/right LM state X (X 0 的 X 1, X 0 X 1 ) X (X 0 的 X 1, X 0 ’s X 1 ) hyperedge item (X 0, X 0 ) S Goal Item (X 0, X 0 ) S X ( 垫子上, the mat) X (X 0 的 X 1, X 1 of X 0 )X (X 0 的 X 1, X 1 on X 0 ) on the mata cat 猫3猫3 垫子 0 上 1 的2的2

A Scalable Decoder for Parsing-based Machine Translation with Equivalent Language Model State Maintenance Zhifei Li and Sanjeev Khudanpur Johns Hopkins.

Similar presentations

Presentation on theme: "A Scalable Decoder for Parsing-based Machine Translation with Equivalent Language Model State Maintenance Zhifei Li and Sanjeev Khudanpur Johns Hopkins."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Scalable Decoder for Parsing-based Machine Translation with Equivalent Language Model State Maintenance Zhifei Li and Sanjeev Khudanpur Johns Hopkins.

Similar presentations

Presentation on theme: "A Scalable Decoder for Parsing-based Machine Translation with Equivalent Language Model State Maintenance Zhifei Li and Sanjeev Khudanpur Johns Hopkins."— Presentation transcript:

Similar presentations

About project

Feedback