Zhifei Li and Sanjeev Khudanpur Johns Hopkins University

Slides:

Advertisements

Similar presentations

Statistical Machine Translation

Advertisements

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.

C O N T E X T - F R E E LANGUAGES ( use a grammar to describe a language) 1.

Fast Algorithms For Hierarchical Range Histogram Constructions

May 2006CLINT-LN Parsing1 Computational Linguistics Introduction Approaches to Parsing.

Measuring the Influence of Long Range Dependencies with Neural Network Language Models Le Hai Son, Alexandre Allauzen, Franc¸ois Yvon Univ. Paris-Sud and.

Chapter Chapter Summary Languages and Grammars Finite-State Machines with Output Finite-State Machines with No Output Language Recognition Turing.

A Scalable Decoder for Parsing-based Machine Translation with Equivalent Language Model State Maintenance Zhifei Li and Sanjeev Khudanpur Johns Hopkins.

1 A Tree Sequence Alignment- based Tree-to-Tree Translation Model Authors: Min Zhang, Hongfei Jiang, Aiti Aw, et al. Reporter: 江欣倩 Professor: 陳嘉平.

PZ02A - Language translation

Context-Free Grammars Lecture 7

“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.

Parsing SLP Chapter 13. 7/2/2015 Speech and Language Processing - Jurafsky and Martin 2 Outline  Parsing with CFGs  Bottom-up, top-down  CKY parsing.

1 Introduction to Parsing Lecture 5. 2 Outline Regular languages revisited Parser overview Context-free grammars (CFG’s) Derivations.

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)

CS 326 Programming Languages, Concepts and Implementation Instructor: Mircea Nicolescu Lecture 2.

Some Probability Theory and Computational models A short overview.

Scalable Inference and Training of Context- Rich Syntactic Translation Models Michel Galley, Jonathan Graehl, Keven Knight, Daniel Marcu, Steve DeNeefe.

NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.

Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,

11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.

Coarse-to-Fine Efficient Viterbi Parsing Nathan Bodenstab OGI RPE Presentation May 8, 2006.

What’s in a translation rule? Paper by Galley, Hopkins, Knight & Marcu Presentation By: Behrang Mohit.

A non-contiguous Tree Sequence Alignment-based Model for Statistical Machine Translation Jun Sun ┼, Min Zhang ╪, Chew Lim Tan ┼ ┼╪

1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.

Natural Language Processing Statistical Inference: n-grams

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

A Syntax-Driven Bracketing Model for Phrase-Based Translation Deyi Xiong, et al. ACL 2009.

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

LING 575 Lecture 5 Kristina Toutanova MSR & UW April 27, 2010 With materials borrowed from Philip Koehn, Chris Quirk, David Chiang, Dekai Wu, Aria Haghighi.

Spring 2010 Lecture 4 Kristina Toutanova MSR & UW With slides borrowed from Philipp Koehn and Hwee Tou Ng LING 575: Seminar on statistical machine translation.

Estimation of Distribution Algorithm and Genetic Programming Structure Complexity Lab,Seoul National University KIM KANGIL.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.

TensorFlow– A system for large-scale machine learning

Statistical Machine Translation Part II: Word Alignments and EM

End-To-End Memory Networks

Statistical NLP Winter 2009

CS 326 Programming Languages, Concepts and Implementation

Parsing in Multiple Languages

Context-Free Grammars: an overview

Context-free grammars, derivation trees, and ambiguity

Database Management System

An overview of decoding techniques for LVCSR

Language translation Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Sections

Simplifications of Context-Free Grammars

Probabilistic and Lexicalized Parsing

Lexical and Syntax Analysis

DHT Routing Geometries and Chord

CHAPTER 2 Context-Free Languages

Parsing Costas Busch - LSU.

CS4705 Natural Language Processing

COMP60621 Fundamentals of Parallel and Distributed Systems

Memory-augmented Chinese-Uyghur Neural Machine Translation

CS510 - Portland State University

Teori Bahasa dan Automata Lecture 9: Contex-Free Grammars

Language translation Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Sections

Language translation Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Sections

Probabilistic Parsing

Compilers Principles, Techniques, & Tools Taught by Jing Zhang

Language translation Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Sections

COMP60611 Fundamentals of Parallel and Distributed Systems

David Kauchak CS159 – Spring 2019

Language translation Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Sections

Language translation Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Sections

Language translation Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Sections

Rule Markov Models for Fast Tree-to-String Translation

Neural Machine Translation by Jointly Learning to Align and Translate

Presentation transcript:

Zhifei Li and Sanjeev Khudanpur Johns Hopkins University A Scalable Decoder for Parsing-based Machine Translation with Equivalent Language Model State Maintenance Zhifei Li and Sanjeev Khudanpur Johns Hopkins University

JOSHUA: a scalable open-source parsing-based MT decoder Written in JAVA language Chart-parsing Beam and Cube pruning K-best extraction over a hypergraph m-gram LM Integration Parallel Decoding Distributed LM (Zhang et al., 2006; Brants et al., 2007) Equivalent LM state maintenance We plan to add more functions soon Chiang (2007) New!

Chart-parsing Grammar formalism Chart parsing Synchronous Context-free Grammar (SCFG) Chart parsing Bottom-up parsing It maintains a chart, which contains an array of cells or bins A cell maintains a list of items The parsing process starts from axioms, and proceeds by applying the inference rules to prove more and more items, until a goal item is proved. The hypotheses are stored in a hypergraph.

Hypergraph on the mat a cat X | 3, 4 | a cat | NA X | 0, 2 | the mat | NA X | 0, 4 | a cat | the mat X (猫, a cat) X | 0, 4 | the mat | a cat X (X0 的 X1, X0 X1) X (X0 的 X1, X0 ’s X1) (X0, X0) S Goal Item X (垫子上, the mat) X (X0 的 X1, X1 of X0) X (X0 的 X1, X1 on X0) 猫3 垫子0 上1 的2 of item hyperedge A hypergraph is a compact way to represent exponential number of derivation trees It contains a list of items and hyperedges The state of LM item Student Seminar

Hypergraph and Trees 猫3 垫子0 上1 的2 X (猫, a cat) X (垫子上, the mat) 垫子0 上1 的2 X (猫, a cat) X (垫子上, the mat) X (X0 的 X1, X0 X1) (X0, X0) S the mat a cat X (猫, a cat) 猫3 垫子0 上1 的2 X (垫子上, the mat) (X0, X0) S X (X0 的 X1, X0 ’s X1) the mat ’s a cat X (猫, a cat) 猫3 垫子0 上1 的2 X (垫子上, the mat) (X0, X0) S X (X0 的 X1, X1 of X0) A cat of the mat X (猫, a cat) 猫3 a cat on the mat 垫子0 上1 的2 X (垫子上, the mat) (X0, X0) S X (X0 的 X1, X1 on X0) A hypergraph is a compact way to represent exponential number of derivation trees Student Seminar

How to Integrate an m-gram LM? Three functions Accumulate probability Estimate future cost State extraction X | 0,1 | the olympic | olympic game X (将在 X0 举行。, will be held in X0 .) X (X0 的 X1, X1 of X0) X (北京, beijing) X (中国, china) X | 5, 6 | beijing | NA X | 3, 4 | china | NA X | 3, 6 | beijing of | of china X | 1, 7 | will be | china . S (X0, X0) S | 0, 1 | the olympic | olympic game S (S0 X1, S0 X1) S | 0, 7 | the olympic | china . S (<s> S0 </s>, <s> S0 </s>) S | 0, 7 | <s> the | . </s> X (奥运会, the olympic game) 北京5 奥运会0 中国3 的4 将1 举行。6 在2 the olympic game will be held in china of beijing . New 3-grams will be held be held in held in beijing in beijing of Estimated total prob 0.01*0.04=0.004 Future prob P(beijing of)=0.01 0.04=0.4*0.2*0.5 New 3-gram beijing of china LM probability Pruning Lm estimation State extraction 0.5 0.4 0.2 Student Seminar

Equivalent State Maintenance: overview In a straightforward implementation, different LM state words lead to different items X (在 X0 的 X1 下, below X1 of X0) X | 0, 3 | below cat | some rat X (在 X0 的 X1 下, below X1 of X0) X | 0, 3 | below cats | many rat We merge multiple items into a single item by replacing some LM state words with asterisk wildcard X (在 X0 的 X1 下, under the X1 of X0) X (在 X0 的 X1 下, below the X1 of X0) X | 0, 3 | below * | * rat Now how to merge lm items? X (在 X0 的 X1 下, under X1 of X0) X | 0, 3 | under cat | some rat X (在 X0 的 X1 下, below X1 of X0) X | 0, 3 | below cat | many rat By merging items, we can explore larger hypothesis space using less time. We only merge items when the length of English span l ≥ m-1 Student Seminar

Back-off Parameterization of m-gram LMs LM probability computation Observations A larger m leads to more backoff Default backoff weight is 1 For a m-gram not listed, β(.) = 1 -4.250922 party files -4.741889 party filled -4.250922 party finance -0.1434139 -4.741889 party financed -4.741889 party finances -0.2361806 -4.741889 party financially -3.33127 party financing -0.1119054 -3.277455 party finished -0.4362795 -4.012205 party fired -4.741889 party fires Before we talk about how to merge items, we need to understand the backoff principle Student Seminar

Equivalent State Maintenance: Right-side Why not right to left? Whether a word can be ignored depends on both its left and right sides, which complicates the procedure. Independent from el-2 Backoff weight is one For the case of a 4-gram LM P(el+1| el-2 el-1 el)=P(el+1| el-1 el) β(el-2 el-1 el)=P(el+1| el-1 el) state words State Prefix IS-A-PREFIX equivalent state future words el-2 el-1 el el+1el+2el+3… el-2 el-1 el no * el-1 el * el-1 el el+1el+2el+3… el-1 el no * el * el no el+1el+2el+3… IS-A-PREFIX(el-1 el)=no implies IS-A-PREFIX(el-1 el el+1)=no

Equivalent State Maintenance: Left-side Why not left to right? Whether a word can be ignored depends on both its left and right sides, which complicates the procedure. Remember to factor in backoff weights later Independent from e3 Finalized probability For the case of a 4-gram LM P(e3| e0 e1 e2)=P(e3| e1 e2) β(e0 e1 e2) state words State Suffix IS-A-SUFFFIX equivalent state future words … e-2e-1e0 e1 e2 e3 e1 e2 e3 no e1 e2 * … e-2e-1e0 e1 e2 * e1 e2 no e1 * e1 * no … e-2e-1e0 P(e1| e-2 e-1 e0)=P(e1) β(e0) β(e-1 e0) β(e-2 e-1 e0) P(e2| e-1 e0 e1)=P(e2| e1) β(e0 e1) β(e-1 e0 e1)

Equivalent State Maintenance: summary Original Cost Function Modified Cost Function Finalized probability Estimated probability State extraction

Experimental Results: Decoding Speed System Training Task: Chinese to English translation Sub-sampling of bitext of about 3M sentence pairs obtain 570k sentence pairs LM training data: Gigaword and English side of bitext Decoding speed Number of rules: 3M Number of m-grams: 49M 38 times faster than the baseline!

Experimental Results: Distributed LM Distributed Language Model Eight 7-gram LMs Decoding speed: 12.2 sec/sent

Experimental Results: Equivalent LM States Search effort versus search quality Equivalent LM State Maintenance Sparse LM: a 7-gram LM built on about 19M words 50 70 90 120 150 200 30 Dense LM: a 3-gram LM build on about 130M words The equivalent LM state maintenance is slower than the regular method. Backoff happens less frequently Inefficient suffix/prefix information lookup

Summary We describe a scalable parsing-based MT decoder The decoder has been successfully used for decoding millions of sentences in a large-scale discriminative training task We propose a method to maintain equivalent LM states The decoder is available at http://www.cs.jhu.edu/~zfli/

Acknowledgements Thanks to Philip Resnik for letting me use the UMD Python decoder Thanks to UMD MT group members for very helpful discussions Thanks to David Chiang for Hiero and his original implementation in Python

Thank you!

Grammar Formalism Synchronous Context-free Grammar (SCFG) Ts: a set of source-language terminal symbols Tt: a set of target-language terminal symbols N: a shared set of nonterminal symbols A set of rules of the form a typical rule looks like:

Chart-parsing Grammar formalism Decoding task is defined as Synchronous Context-free Grammar (SCFG) Decoding task is defined as Chart parsing It maintains a chart, which contains an array of cells or bins A cell maintains a list of items The parsing process starts from axioms, and proceeds by applying the inference rules to prove more and more items, until a goal item is proved. The hypotheses are stored in a structure called hypergraph.

m-gram LM Integration Three Functions Accumulate probability Estimate future cost State extraction Cost Function Finalized probability Estimated probability State extraction

Parallel and Distributed Decoding Parallel Decoding Divide the test set into multiple parts Each part is decoded by a separate thread The threads share the language/translation models in memory Distributed Language Model (DLM) Training Divide the corpora into multiple parts Train a LM on each part Find the optimal weights among the LMs Maximize the likelihood of a dev set Decoding Load the LMs into different servers The decoder remotely calls the servers to obtain the probabilities The decoder then interpolates the probabilities on the fly To save communication overhead, a cache is maintained

Chart-parsing Decoding task is defined as Chart parsing It maintains a chart, which contains an array of cells or bins A cell maintains a list of items The parsing process starts from axioms, and proceeds by applying the inference rules to prove more and more items, until a goal item is proved. The hypotheses are stored in a structure called hypergraph. State of an Item Source span, left-side nonterminal symbol, and left/right LM state Decoding complexity

Hypergraph A hypergraph consists of a set of nodes and hyperedges in parsing, they correspond to item and deductive step, respectively Roughly, a hyperedge can be thought as a rule with pointers State of an item Source span, left-side nonterminal symbol, and left/right LM state Goal Item (X0, X0) S (X0, X0) S X | 0, 4 | the mat | a cat X | 0, 4 | a cat | the mat X (X0 的 X1, X0 X1) X (X0 的 X1, X0 ’s X1) X (X0 的 X1, X1 of X0) X (X0 的 X1, X1 on X0) A hypergraph is a compact way to represent exponential number of derivation trees item X | 0, 2 | the mat | NA X | 3, 4 | a cat | NA hyperedge X (垫子上, the mat) X (猫, a cat) 猫3 垫子0 上1 的2 on the mat a cat Student Seminar