DP-based Search Algorithms for Statistical Machine Translation My name: Mauricio Zuluaga Based on “Christoph Tillmann Presentation” and “ Word Reordering.

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

Yasuhiro Fujiwara (NTT Cyber Space Labs)
Statistical Machine Translation IBM Model 1 CS626/CS460 Anoop Kunchukuttan Under the guidance of Prof. Pushpak Bhattacharyya.
Lattices Segmentation and Minimum Bayes Risk Discriminative Training for Large Vocabulary Continuous Speech Recognition Vlasios Doumpiotis, William Byrne.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.
1 An Introduction to Statistical Machine Translation Dept. of CSIE, NCKU Yao-Sheng Chang Date:
Machine Translation (II): Word-based SMT Ling 571 Fei Xia Week 10: 12/1/05-12/6/05.
Novel Reordering Approaches in Phrase-Based Statistical Machine Translation S. Kanthak, D. Vilar, E. Matusov, R. Zens & H. Ney ACL Workshop on Building.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Forward-backward algorithm LING 572 Fei Xia 02/23/06.
Machine Translation A Presentation by: Julie Conlonova, Rob Chase, and Eric Pomerleau.
Distributional Clustering of English Words Fernando Pereira- AT&T Bell Laboratories, 600 Naftali Tishby- Dept. of Computer Science, Hebrew University Lillian.
C SC 620 Advanced Topics in Natural Language Processing Lecture 24 4/22.
Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Dynamic Time Warping Applications and Derivation
Parameter estimate in IBM Models: Ling 572 Fei Xia Week ??
1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
Microsoft Research Faculty Summit Robert Moore Principal Researcher Microsoft Research.
Jan 2005Statistical MT1 CSA4050: Advanced Techniques in NLP Machine Translation III Statistical MT.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
THE MATHEMATICS OF STATISTICAL MACHINE TRANSLATION Sriraman M Tallam.
Computer vision: models, learning and inference
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Training and Decoding in SMT System) Kushal Ladha M.Tech Student CSE Dept.,
Computer Implementation of Genetic Algorithm
Direct Translation Approaches: Statistical Machine Translation
Graphical models for part of speech tagging
IT 60101: Lecture #201 Foundation of Computing Systems Lecture 20 Classic Optimization Problems.
Statistical Machine Translation Part IV – Log-Linear Models Alex Fraser Institute for Natural Language Processing University of Stuttgart Seminar:
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
Statistical Machine Translation Part IV – Log-Linear Models Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.
8.0 Search Algorithms for Speech Recognition References: of Huang, or of Becchetti, or , of Jelinek 4. “ Progress.
Martin KayTranslation—Meaning1 Martin Kay Stanford University with thanks to Kevin Knight.
Statistical Machine Translation Part III – Phrase-based SMT / Decoding Alexander Fraser Institute for Natural Language Processing Universität Stuttgart.
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Evaluation Decoding Dynamic Programming.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
(Statistical) Approaches to Word Alignment
Phrase-Based Statistical Machine Translation as a Traveling Salesman Problem Mikhail Zaslavskiy Marc Dymetman Nicola Cancedda ACL 2009.
A Statistical Approach to Machine Translation ( Brown et al CL ) POSTECH, NLP lab 김 지 협.
Wei Lu, Hwee Tou Ng, Wee Sun Lee National University of Singapore
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.
September 2004CSAW Extraction of Bilingual Information from Parallel Texts Mike Rosner.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Spring 2010 Lecture 4 Kristina Toutanova MSR & UW With slides borrowed from Philipp Koehn and Hwee Tou Ng LING 575: Seminar on statistical machine translation.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
Bayes Risk Minimization using Metric Loss Functions R. Schlüter, T. Scharrenbach, V. Steinbiss, H. Ney Present by Fang-Hui, Chu.
Design and Analysis of Algorithms (09 Credits / 5 hours per week)
An overview of decoding techniques for LVCSR
8.0 Search Algorithms for Speech Recognition
Statistical Machine Translation Part III – Phrase-based SMT / Decoding
CSCI 5832 Natural Language Processing
Hidden Markov Models Part 2: Algorithms
Word-based SMT Ling 580 Fei Xia Week 1: 1/3/06.
Statistical Machine Translation Papers from COLING 2004
Dynamic Programming Search
Statistical Machine Translation Part VI – Phrase-based Decoding
Presentation transcript:

DP-based Search Algorithms for Statistical Machine Translation My name: Mauricio Zuluaga Based on “Christoph Tillmann Presentation” and “ Word Reordering and a Dynamic Programming Beam Search Algorithm for Statistical Machine Translation”, C. Tillmann, H. Ney

Computational Challenges in M.T. Source sentence f (French) Target sentence e (English) Bayes' rule: Pr(e|f) = Pr(e)*Pr(f|e)/Pr(f)

Computational Challenges in M.T. Estimating the language model probability Pr(e) (L.M. Problem Trigram) Estimating the Translation model probability Pr(f|e) (T. Problem) Finding an efficient way to search for the English sentence that maximizes the product (Search Problem). We want to focus only in the most likely hypothesis during the search.

Approach based on Bayes’ rule: Transformation Inverse Transformation Target Language Text Source Language Text Global Search: over Language ModelTranslation Model

Trigram language model Translation model (simplified) : 1. Lexicon probabilities: 2. Fertilities 3. Class-based distortion probs : “Here, j is the currently covered input sentence position and j0 is the previously covered input sentence position. The input sentence length J is included, since we would like to think of the distortion probability as normalized according to J.” [Tillmann] Model Details

Same except in the handling of distortion probabilities. In model 4 there are 2 separate distortion probabilities for the head of a tablet and the rest of the words of the tablet. Probability depends on the previous tablet and on the identity (class) of the French word being placed. (Ej, appearance of adjectives before nouns in English but after them in French). “We expect dl(-lI.A(e),/3(f)) to be larger than dl(+ llA(e),/3(f)) when e is an adjective and d is a noun. Indeed, this is borne out in the trained distortion probabilities for Model 4, where we find that dl(- l|A(government's),B(developpement)) is , while dl(+ l|A(government's),B(developpement)) is ” A and B are class functions of the English and French words (in this implementation |A|=|B|=50 classes) Model Details (Model 4 vs. Model 3):

Decoder Others have followed different approaches for Decoders This is the part where we have to be efficient !!! Word Reordering and a Dynamic Programming Beam Search Algorithm for Statistical Machine Translation, C. Tillmann, H. Ney DP-based beam search decoder for IBM-model 4 (this is the one described in the previous paper)

Example Alignment besuchen Kollege Sie Mai colleague diesem Fall kann mein am vierten nicht In this case can not visit you on the forth. of May. my Word-to-Word Alignment (source to target): Hidden Alignment: TargetTarget Source

Inverted Alignments i i - 1 Source Positions Target Positions Inverted alignment (target to source) : Coverage constraint: introduce coverage vector

Traveling Salesman Problem Problem: Visit J cities Costs for transitions between cities Visit each city exactly once, minimizing overall costs Dynamic Programming (Held-Karp 1962) Cities correspond to source sentence positions (words,coverage constraint) Costs (negative logarithm of the product of the translation, alignment and language model probabilities).

Traveling Salesman Problem DP with auxiliary quantity Shortest path from city 1 to city j visiting all cities in Complexity using DP: The order in which cities are visited is not important Only costs for the best path reaching j has to be stored Remember Minimum edit distance formulation was also a DP search problem

({1},1) ({1,2},2) ({1,3},3) ({1,4},4) ({1,2,3},3) ({1,2,4},4) ({1,2,5},5) ({1,2,3},2) ({1,3,4},4) ({1,3,5},5) ({1,2,4},2) ({1,3,4},3) ({1,4,5},5) ({1,5},5) ({1,2,5},2) ({1,3,5},3) ({1,4,5},4) ({1,2,3,4,5},2) ({1,2,3,4,5},3) ({1,2,3,4,5},4) ({1,2,3,4,5},5) Final ({1,2,3,5},5) ({1,2,4,5},5) ({1,3,4,5},5) ({1,2,3,4},4) ({1,2,4,5},4) ({1,3,4,5},4) ({1,2,3,4},3) ({1,2,3,5},3) ({1,3,4,5},3) ({1,2,3,4},2) ({1,2,3,5},2) ({1,2,4,5},2)

({1},1) ({1,2},2) ({1,3},3) ({1,4},4) ({1,2,3},3) ({1,2,4},4) ({1,2,5},5) ({1,2,3},2) ({1,3,4},4) ({1,3,5},5) ({1,2,4},2) ({1,3,4},3) ({1,4,5},5) ({1,5},5) ({1,2,5},2) ({1,3,5},3) ({1,4,5},4) ({1,2,3,4,5},2) ({1,2,3,4,5},3) ({1,2,3,4,5},4) ({1,2,3,4,5},5) Final ({1,2,3,5},5) ({1,2,4,5},5) ({1,3,4,5},5) ({1,2,3,4},4) ({1,2,4,5},4) ({1,3,4,5},4) ({1,2,3,4},3) ({1,2,3,5},3) ({1,3,4,5},3) ({1,2,3,4},2) ({1,2,3,5},2) ({1,2,4,5},2)

M.T. Recursion Equation Complexity: where E is the size of the Target language vocabulary (still too large…) Maximum approximation: *Q(e,C,j) is the probability of the best partial hypothesis (e1..ei, b1..bi) where C = {bk | k = 1..i}, bi = j, ei = e, and ei-1 = e’

DP-based Search Algorithm Input: source string initialization for each cardinality do for each pair,where, do for each target word do Trace back: Find shortest tour Recover optimal sequence

IBM-Style Re-ordering (S3) Procedural Restriction: select one of the first 4 empty positions (to extend the hypothesis) Upper bound for word reordering complexity: 1 j J

Kollege Sie Mai Verb Group Re-ordering (GE) besuchen colleague diesem Fall kann mein am vierten nicht In this case can not visit you on the forth. of May. my Complexity: Mostly monotonic traversal from left to right

Beam Search Pruning Search proceeds cardinality-synchronously over coverage vectors : Three pruning types: 1. Coverage pruning 2. Cardinality pruning 3. Observation pruning(number of words produced by a source word f is limited)

Beam Search Pruning 4 kinds of Thresholds: the coverage pruning threshold tC the coverage histogram threshold nC the cardinality pruning threshold tc (looks only at the cardinality) the cardinality histogram threshold nc (looks only at the cardinality) Define new probabilities based on uncovered positions (using only trigrams and lexicon probabilities). Maintain only the ones above the thresholds.

Beam Search Pruning Compute best score and apply threshold: 1. For each coverage vector 2. For each cardinality : Use histogram pruning Observation pruning: for each select best target word :

German-English Verbmobil German to English, IBM-4 Evaluation Measure: m-WER and SSER Training: 58 K sentence pairs Vocabulary: 8K (German), 5K (English) Test-331 (held-out data) (scaling factors for language and distortion models) Test-147 (evaluation)

Effect of Coverage Pruning Re-ordering restriction CPU time [sec] m-WER [%] GE S

TEST-147: Translation Results Re-orderingCPU [sec] m -WER [%] SSER [%] MON (no re-ordering) GE (verb group) S3 (like IBM patent)

References Word Reordering and a Dynamic Programming Beam Search Algorithm for Statistical Machine Translation, C. Tillmann, H. Ney “A DP based Search Using Monotone Alignments in Statistical Translation” C. Tillmann, S. Vogel, H. Ney, A. Zubiaga The Mathematics of Statistical Machine Translation: Parameter Estimation Peter E Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, Robert L. Mercer Accelerated DP Based Search for Statistical Translation, C. Tillmann, S. Vogel, H. Ney, A. Zubiaga, H. Sawaf Word Re-orderign and DP-based Search in Statistical Machine Translation, H. Ney, C. Tillmann