A Scalable Decoder for Parsing-based Machine Translation with Equivalent Language Model State Maintenance Zhifei Li and Sanjeev Khudanpur Johns Hopkins.

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Fast Algorithms For Hierarchical Range Histogram Constructions
Grammars, constituency and order A grammar describes the legal strings of a language in terms of constituency and order. For example, a grammar for a fragment.
May 2006CLINT-LN Parsing1 Computational Linguistics Introduction Approaches to Parsing.
Measuring the Influence of Long Range Dependencies with Neural Network Language Models Le Hai Son, Alexandre Allauzen, Franc¸ois Yvon Univ. Paris-Sud and.
Chapter Chapter Summary Languages and Grammars Finite-State Machines with Output Finite-State Machines with No Output Language Recognition Turing.
Lattices Segmentation and Minimum Bayes Risk Discriminative Training for Large Vocabulary Continuous Speech Recognition Vlasios Doumpiotis, William Byrne.
CPSC Compiler Tutorial 9 Review of Compiler.
Parsing with PCFG Ling 571 Fei Xia Week 3: 10/11-10/13/05.
1 A Tree Sequence Alignment- based Tree-to-Tree Translation Model Authors: Min Zhang, Hongfei Jiang, Aiti Aw, et al. Reporter: 江欣倩 Professor: 陳嘉平.
PZ02A - Language translation
Context-Free Grammars Lecture 7
Statistical Phrase-Based Translation Authors: Koehn, Och, Marcu Presented by Albert Bertram Titles, charts, graphs, figures and tables were extracted from.
ISBN Chapter 4 Lexical and Syntax Analysis The Parsing Problem Recursive-Descent Parsing.
1 Reverse of a Regular Language. 2 Theorem: The reverse of a regular language is a regular language Proof idea: Construct NFA that accepts : invert the.
1 Simplifications of Context-Free Grammars. 2 A Substitution Rule Substitute Equivalent grammar.
Chapter 3: Formal Translation Models
Parsing SLP Chapter 13. 7/2/2015 Speech and Language Processing - Jurafsky and Martin 2 Outline  Parsing with CFGs  Bottom-up, top-down  CKY parsing.
Context-Free Grammar CSCI-GA.2590 – Lecture 3 Ralph Grishman NYU.
1.3 Executing Programs. How is Computer Code Transformed into an Executable? Interpreters Compilers Hybrid systems.
1 Introduction to Parsing Lecture 5. 2 Outline Regular languages revisited Parser overview Context-free grammars (CFG’s) Derivations.
Syntax & Semantic Introduction Organization of Language Description Abstract Syntax Formal Syntax The Way of Writing Grammars Formal Semantic.
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
Efficient Minimal Perfect Hash Language Models David Guthrie, Mark Hepple, Wei Liu University of Sheffield.
1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department.
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
CS 326 Programming Languages, Concepts and Implementation Instructor: Mircea Nicolescu Lecture 2.
Some Probability Theory and Computational models A short overview.
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
Scalable Inference and Training of Context- Rich Syntactic Translation Models Michel Galley, Jonathan Graehl, Keven Knight, Daniel Marcu, Steve DeNeefe.
An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
Bernd Fischer RW713: Compiler and Software Language Engineering.
Efficient Language Model Look-ahead Probabilities Generation Using Lower Order LM Look-ahead Information Langzhou Chen and K. K. Chin Toshiba Research.
Coarse-to-Fine Efficient Viterbi Parsing Nathan Bodenstab OGI RPE Presentation May 8, 2006.
What’s in a translation rule? Paper by Galley, Hopkins, Knight & Marcu Presentation By: Behrang Mohit.
Introduction to Parsing
Topic #1: Introduction EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Chapter 3 Part II Describing Syntax and Semantics.
Programming Languages and Design Lecture 3 Semantic Specifications of Programming Languages Instructor: Li Ma Department of Computer Science Texas Southern.
Basic Parsing Algorithms: Earley Parser and Left Corner Parsing
Introduction to Compiling
A non-contiguous Tree Sequence Alignment-based Model for Statistical Machine Translation Jun Sun ┼, Min Zhang ╪, Chew Lim Tan ┼ ┼╪
Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)
The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech Frank Seide IEEE Transactions on Speech and Audio Processing 2005.
1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.
Natural Language Processing Statistical Inference: n-grams
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
CSCI 4325 / 6339 Theory of Computation Zhixiang Chen Department of Computer Science University of Texas-Pan American.
A Syntax-Driven Bracketing Model for Phrase-Based Translation Deyi Xiong, et al. ACL 2009.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
The estimation of stochastic context-free grammars using the Inside-Outside algorithm Oh-Woog Kwon KLE Lab. CSE POSTECH.
LING 575 Lecture 5 Kristina Toutanova MSR & UW April 27, 2010 With materials borrowed from Philip Koehn, Chris Quirk, David Chiang, Dekai Wu, Aria Haghighi.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
Statistical Machine Translation Part II: Word Alignments and EM
End-To-End Memory Networks
Statistical NLP Winter 2009
Parsing in Multiple Languages
Simplifications of Context-Free Grammars
DHT Routing Geometries and Chord
Zhifei Li and Sanjeev Khudanpur Johns Hopkins University
A Path-based Transfer Model for Machine Translation
Rule Markov Models for Fast Tree-to-String Translation
Neural Machine Translation by Jointly Learning to Align and Translate
Presentation transcript:

A Scalable Decoder for Parsing-based Machine Translation with Equivalent Language Model State Maintenance Zhifei Li and Sanjeev Khudanpur Johns Hopkins University

Written in JAVA language Chart-parsing Beam and Cube pruning K-best extraction over a hypergraph m-gram LM Integration Parallel Decoding Distributed LM (Zhang et al., 2006; Brants et al., 2007) Equivalent LM state maintenance We plan to add more functions soon JOSHUA: a scalable open-source parsing-based MT decoder New! Chiang (2007)

Grammar formalism Synchronous Context-free Grammar (SCFG) Chart parsing Bottom-up parsing It maintains a chart, which contains an array of cells or bins A cell maintains a list of items The parsing process starts from axioms, and proceeds by applying the inference rules to prove more and more items, until a goal item is proved. The hypotheses are stored in a hypergraph. Chart-parsing

Hypergraph item hyperedge on the mata cat X | 3, 4 | a cat | NA X | 0, 2 | the mat | NA X | 0, 4 | a cat | the mat X ( 猫, a cat) X | 0, 4 | the mat | a cat X (X 0 的 X 1, X 0 X 1 ) X (X 0 的 X 1, X 0 ’s X 1 ) (X 0, X 0 ) S Goal Item (X 0, X 0 ) S X ( 垫子 上, the mat) X (X 0 的 X 1, X 1 of X 0 )X (X 0 的 X 1, X 1 on X 0 ) 猫3猫3 垫子 0 上 1 的2的2 of

Hypergraph and Trees 猫3猫3 垫子 0 上 1 的2的2 X ( 猫, a cat) X ( 垫子 上, the mat) X (X 0 的 X 1, X 0 X 1 ) (X 0, X 0 ) S the mat a cat X ( 猫, a cat) 猫3猫3 垫子 0 上 1 的2的2 X ( 垫子 上, the mat) (X 0, X 0 ) S X (X 0 的 X 1, X 0 ’s X 1 ) the mat ’s a cat X ( 猫, a cat) 猫3猫3 a cat on the mat 垫子 0 上 1 的2的2 X ( 垫子 上, the mat) (X 0, X 0 ) S X (X 0 的 X 1, X 1 on X 0 ) X ( 猫, a cat) 猫3猫3 垫子 0 上 1 的2的2 X ( 垫子 上, the mat) (X 0, X 0 ) S X (X 0 的 X 1, X 1 of X 0 ) A cat of the mat

How to Integrate an m-gram LM? X | 0,1 | the olympic | olympic game X ( 将 在 X 0 举行。, will be held in X 0.) X (X 0 的 X 1, X 1 of X 0 ) X ( 北京, beijing) X ( 中国, china) X | 5, 6 | beijing | NA X | 3, 4 | china | NA X | 3, 6 | beijing of | of china X | 1, 7 | will be | china. S (X 0, X 0 ) S | 0, 1 | the olympic | olympic game S (S 0 X 1, S 0 X 1 ) S | 0, 7 | the olympic | china. S ( S 0, S 0 ) S | 0, 7 | the |. X ( 奥运会, the olympic game) 北京 5 奥运会 0 中国 3 的4的4 将1将1 举行。 6 在2在2 the olympic gamewill beheldinchinaofbeijing. Three functions Accumulate probability Estimate future cost State extraction New 3-gram beijing of china New 3-grams will be held be held in held in beijing in beijing of 0.04=0.4*0.2* Future prob P(beijing of)=0.01 Estimated total prob 0.01*0.04=0.004

Equivalent State Maintenance: overview X ( 在 X 0 的 X 1 下, below X 1 of X 0 ) X | 0, 3 | below cat | some rat X ( 在 X 0 的 X 1 下, below X 1 of X 0 ) X | 0, 3 | below cats | many rat X ( 在 X 0 的 X 1 下, under the X 1 of X 0 ) X ( 在 X 0 的 X 1 下, below the X 1 of X 0 ) X | 0, 3 | below * | * rat In a straightforward implementation, different LM state words lead to different items We merge multiple items into a single item by replacing some LM state words with asterisk wildcard X ( 在 X 0 的 X 1 下, under X 1 of X 0 ) X | 0, 3 | under cat | some rat X ( 在 X 0 的 X 1 下, below X 1 of X 0 ) X | 0, 3 | below cat | many rat By merging items, we can explore larger hypothesis space using less time. We only merge items when the length of English span l ≥ m-1

Back-off Parameterization of m-gram LMs LM probability computation Observations A larger m leads to more backoff Default backoff weight is 1 For a m-gram not listed, β(. ) = party files party filled party finance party financed party finances party financially party financing party finished party fired party fires

Equivalent State Maintenance: Right-side state wordsState PrefixIS-A-PREFIXequivalent statefuture words e l-2 e l-1 elel e l-2 e l-1 e l no e l+1 e l+2 e l+3 … *e l-1 elel e l-1 e l no e l+1 e l+2 e l+3 … Why not right to left? Whether a word can be ignored depends on both its left and right sides, which complicates the procedure. Independent from e l-2 IS-A-PREFIX(e l-1 e l )=no implies IS-A-PREFIX(e l-1 e l e l+1 )=no For the case of a 4-gram LM P(e l+1 | e l-2 e l-1 e l )=P(e l+1 | e l-1 e l ) β(e l-2 e l-1 e l )=P(e l+1 | e l-1 e l ) *e l-1 elel **elel **elel elel no e l+1 e l+2 e l+3 … *** Backoff weight is one

Equivalent State Maintenance: Left-side state wordsState SuffixIS-A-SUFFFIXequivalent statefuture words e 1 e 2 e 3 no … e -2 e -1 e 0 e1e1 e2e2 * e 1 e 2 no e1e1 e2e2 e3e3 Independent from e 3 P(e 2 | e -1 e 0 e 1 )=P(e 2 | e 1 ) β(e 0 e 1 ) β(e -1 e 0 e 1 ) For the case of a 4-gram LM P(e 3 | e 0 e 1 e 2 )=P(e 3 | e 1 e 2 ) β(e 0 e 1 e 2 ) Why not left to right? Whether a word can be ignored depends on both its left and right sides, which complicates the procedure. … e -2 e -1 e 0 e1e1 ** e1e1 no … e -2 e -1 e 0 *** e1e1 **e1e1 e2e2 * Finalized probability P(e 1 | e -2 e -1 e 0 )=P(e 1 ) β(e 0 ) β(e -1 e 0 ) β(e -2 e -1 e 0 ) Remember to factor in backoff weights later

Equivalent State Maintenance: summary Modified Cost FunctionOriginal Cost Function Finalized probability Estimated probability State extraction

Experimental Results: Decoding Speed System Training Task: Chinese to English translation Sub-sampling of bitext of about 3M sentence pairs obtain 570k sentence pairs LM training data: Gigaword and English side of bitext Decoding speed Number of rules: 3M Number of m-grams: 49M 38 times faster than the baseline!

Experimental Results: Distributed LM Distributed Language Model Eight 7-gram LMs Decoding speed: 12.2 sec/sent

Experimental Results: Equivalent LM States Search effort versus search quality Equivalent LM State Maintenance Sparse LM: a 7-gram LM built on about 19M words Dense LM: a 3-gram LM build on about 130M words The equivalent LM state maintenance is slower than the regular method. Backoff happens less frequently Inefficient suffix/prefix information lookup

Summary We describe a scalable parsing-based MT decoder The decoder has been successfully used for decoding millions of sentences in a large-scale discriminative training task We propose a method to maintain equivalent LM states The decoder is available at

Thanks to Philip Resnik for letting me use the UMD Python decoder Thanks to UMD MT group members for very helpful discussions Thanks to David Chiang for Hiero and his original implementation in Python Acknowledgements

Thank you!

Synchronous Context-free Grammar (SCFG) T s : a set of source-language terminal symbols T t : a set of target-language terminal symbols N: a shared set of nonterminal symbols A set of rules of the form a typical rule looks like: Grammar Formalism

Grammar formalism Synchronous Context-free Grammar (SCFG) Decoding task is defined as Chart parsing It maintains a chart, which contains an array of cells or bins A cell maintains a list of items The parsing process starts from axioms, and proceeds by applying the inference rules to prove more and more items, until a goal item is proved. The hypotheses are stored in a structure called hypergraph. Chart-parsing

m-gram LM Integration Three Functions Accumulate probability Estimate future cost State extraction Cost Function Finalized probability Estimated probability State extraction

Parallel Decoding Divide the test set into multiple parts Each part is decoded by a separate thread The threads share the language/translation models in memory Distributed Language Model (DLM) Training Divide the corpora into multiple parts Train a LM on each part Find the optimal weights among the LMs Maximize the likelihood of a dev set Decoding Load the LMs into different servers The decoder remotely calls the servers to obtain the probabilities The decoder then interpolates the probabilities on the fly To save communication overhead, a cache is maintained Parallel and Distributed Decoding

Decoding task is defined as Chart parsing It maintains a chart, which contains an array of cells or bins A cell maintains a list of items The parsing process starts from axioms, and proceeds by applying the inference rules to prove more and more items, until a goal item is proved. The hypotheses are stored in a structure called hypergraph. State of an Item Source span, left-side nonterminal symbol, and left/right LM state Decoding complexity Chart-parsing

Hypergraph X | 3, 4 | a cat | NA X | 0, 2 | the mat | NA X | 0, 4 | a cat | the mat X ( 猫, a cat) X | 0, 4 | the mat | a cat A hypergraph consists of a set of nodes and hyperedges in parsing, they correspond to item and deductive step, respectively Roughly, a hyperedge can be thought as a rule with pointers State of an item Source span, left-side nonterminal symbol, and left/right LM state X (X 0 的 X 1, X 0 X 1 ) X (X 0 的 X 1, X 0 ’s X 1 ) hyperedge item (X 0, X 0 ) S Goal Item (X 0, X 0 ) S X ( 垫子 上, the mat) X (X 0 的 X 1, X 1 of X 0 )X (X 0 的 X 1, X 1 on X 0 ) on the mata cat 猫3猫3 垫子 0 上 1 的2的2