CSE 517 Natural Language Processing Winter 2015

Slides:

Advertisements

Similar presentations

Statistical Machine Translation

Advertisements

1/23 Learning from positive examples Main ideas and the particular case of CProgol4.2 Daniel Fredouille, CIG talk,11/2005.

Information Retrieval in Practice

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Word Alignment Philipp Koehn USC/Information Sciences Institute USC/Computer Science Department School of Informatics University of Edinburgh Some slides.

Why Generative Models Underperform Surface Heuristics UC Berkeley Natural Language Processing John DeNero, Dan Gillick, James Zhang, and Dan Klein.

A Phrase-Based, Joint Probability Model for Statistical Machine Translation Daniel Marcu, William Wong(2002) Presented by Ping Yu 01/17/2006.

EBMT1 Example Based Machine Translation as used in the Pangloss system at Carnegie Mellon University Dave Inman.

Statistical Phrase-Based Translation Authors: Koehn, Och, Marcu Presented by Albert Bertram Titles, charts, graphs, figures and tables were extracted from.

“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.

ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June Competitive Grouping in Integrated Segmentation and Alignment.

Approximate Factoring for A* Search Aria Haghighi, John DeNero, and Dan Klein Computer Science Division University of California Berkeley.

Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.

Translation Model Parameters (adapted from notes from Philipp Koehn & Mary Hearne) 24 th March 2011 Dr. Declan Groves, CNGL, DCU

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Training and Decoding in SMT System) Kushal Ladha M.Tech Student CSE Dept.,

An Introduction to SMT Andy Way, DCU. Statistical Machine Translation (SMT) Translation Model Language Model Bilingual and Monolingual Data* Decoder:

English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.

Living with spreadsheets Dean Buckner Financial Services Authority JULY 2011.

Statistical Machine Translation Part IV – Log-Linear Models Alex Fraser Institute for Natural Language Processing University of Stuttgart Seminar:

Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.

Statistical Machine Translation Part IV – Log-Linear Models Alexander Fraser Institute for Natural Language Processing University of Stuttgart

Machine Translation Course 5 Diana Trandab ă ț Academic year:

CHAPTER 13 NATURAL LANGUAGE PROCESSING. Machine Translation.

NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.

CSE 517 Natural Language Processing Winter 2015 Syntax-Based Machine Translation Yejin Choi Slides from Philipp Koehn, Matt Post, Luke Zettlemoyer, …

Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,

Korea Maritime and Ocean University NLP Jung Tae LEE

Statistical Machine Translation Part III – Phrase-based SMT / Decoding Alexander Fraser Institute for Natural Language Processing Universität Stuttgart.

February 2006Machine Translation II.21 Postgraduate Diploma In Translation Example Based Machine Translation Statistical Machine Translation.

NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.

LREC 2008 Marrakech 29 May Caroline Lavecchia, Kamel Smaïli and David Langlois LORIA / Groupe Parole, Vandoeuvre-Lès-Nancy, France Phrase-Based Machine.

CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.

Haitham Elmarakeby.  Speech recognition

Statistical NLP Spring 2010 Lecture 17: Word / Phrase MT Dan Klein – UC Berkeley.

Jan 2009Statistical MT1 Advanced Techniques in NLP Machine Translation III Statistical MT.

NLP. Machine Translation Source-channel model of communication Parametric probabilistic models of language and translation.

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

LING 575 Lecture 5 Kristina Toutanova MSR & UW April 27, 2010 With materials borrowed from Philip Koehn, Chris Quirk, David Chiang, Dekai Wu, Aria Haghighi.

Spring 2010 Lecture 4 Kristina Toutanova MSR & UW With slides borrowed from Philipp Koehn and Hwee Tou Ng LING 575: Seminar on statistical machine translation.

Ling 575: Machine Translation Yuval Marton Winter 2016 January 19: Spill-over from last class, some prob+stats, word alignment, phrase-based and hierarchical.

A CASE STUDY OF GERMAN INTO ENGLISH BY MACHINE TRANSLATION: MOSES EVALUATED USING MOSES FOR MERE MORTALS. Roger Haycock

Statistical Machine Translation Part II: Word Alignments and EM

Information Retrieval in Practice

Statistical NLP Spring 2011

Ankit Srivastava CNGL, DCU Sergio Penkale CNGL, DCU

Statistical Machine Translation Part IV – Log-Linear Models

Neural Machine Translation by Jointly Learning to Align and Translate

Statistical NLP Spring 2010

Statistical NLP Spring 2011

Statistical NLP: Lecture 13

Neural Lattice Search for Domain Adaptation in Machine Translation

CS 4/527: Artificial Intelligence

Statistical Machine Translation Part III – Phrase-based SMT / Decoding

CSCI 5832 Natural Language Processing

CSCI 5832 Natural Language Processing

Machine Translation and MT tools: Giza++ and Moses

Machine Translation(MT)

Word Alignment David Kauchak CS159 – Fall 2019 Philipp Koehn

Machine Learning in Practice Lecture 27

Living with spreadsheets

Machine Translation and MT tools: Giza++ and Moses

Statistical NLP Spring 2011

Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.

Statistical Machine Translation Part VI – Phrase-based Decoding

Presented By: Sparsh Gupta Anmol Popli Hammad Abdullah Ayyubi

CSC 578 Neural Networks and Deep Learning

Neural Machine Translation by Jointly Learning to Align and Translate

Presentation transcript:

CSE 517 Natural Language Processing Winter 2015 Phrase Based Translation Yejin Choi Didn’t get to Perceptron part. Could be nice to have an algorithms slide for decoding. Had to fill in lots of details when going over all of the search options… Slides from Philipp Koehn, Dan Klein, Luke Zettlemoyer

Phrase-Based Systems Phrase table (translation model) Sentence-aligned corpus cat ||| chat ||| 0.9 the cat ||| le chat ||| 0.8 dog ||| chien ||| 0.8 house ||| maison ||| 0.6 my house ||| ma maison ||| 0.9 language ||| langue ||| 0.9 … Phrase table (translation model) Word alignments

Phrase Translation Tables Defines the space of possible translations each entry has an associated “probability” One learned example, for “den Vorschlag” from Europarl data This table is noisy, has errors, and the entries do not necessarily match our linguistic intuitions about consistency….

Phrase Translation Model

Distortion Model

Extracting Phrases We will use word alignments to find phrases Question: what is the best set of phrases?

Extracting Phrases Phrase alignment must Contain at least one alignment edge Contain all alignments for phrase pair Extract all such phrase pairs!

Phrase Pair Extraction Example (Maria, Mary), (no, did not), (slap, daba una bofetada), (a la, the), (bruja, witch), (verde, green) (Maria no, Mary did not), (no daba una bofetada, did not slap), (daba una bofetada a la, slap the), (bruja verde, green witch) (Maria no daba una bofetada, Mary did not slap), (no daba una bofetada a la, did not slap the), (a la bruja verde, the green witch) (Maria no daba una bofetada a la, Mary did not slap the), (daba una bofetada a la bruja verde, slap the green witch) (Maria no daba una bofetada a la bruja verde, Mary did not slap the green witch)

Linguistic Phrases?

Phrase Size Phrases do help But they don’t need to be long Why should this be?

Bidirectional Alignment

Alignment Heuristics

Size of Phrase Translation Table

Why not Learn Phrases w/ EM?

Phrase Scoring Learning weights has been tried, several times: [Marcu and Wong, 02] [DeNero et al, 06] … and others Seems not to work well, for a variety of partially understood reasons Main issue: big chunks get all the weight, obvious priors don’t help Though, [DeNero et al 08] les chats aiment le poisson cats like fresh fish . frais

Translation: Codebreaking? “Also knowing nothing official about, but having guessed and inferred considerable about, the powerful new mechanized methods in cryptography—methods which I believe succeed even when one does not know what language has been coded—one naturally wonders if the problem of translation could conceivably be treated as a problem in cryptography. When I look at an article in Russian, I say: ‘This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.’ ” Warren Weaver (1955:18, quoting a letter he wrote in 1947)

Examples from Liang Huang

Examples from Liang Huang

Examples from Liang Huang

Examples from Liang Huang

Examples from Liang Huang

Examples from Liang Huang

Scoring: Basic approach, sum up phrase translation scores and a language model Define y = p1p2…pL to be a translation with phrase pairs pi Define e(y) be the output English sentence in y Let h() be the log probability under a tri-gram language model Let g() be a phrase pair score (from last slide) Then, the full translation score is: Goal, compute the best translation

Phrase Scoring Learning weights has been tried, several times: [Marcu and Wong, 02] [DeNero et al, 06] … and others Seems not to work well, for a variety of partially understood reasons Main issue: big chunks get all the weight, obvious priors don’t help Though, [DeNero et al 08] les chats aiment le poisson cats like fresh fish . frais

The Pharaoh Decoder Scores at each step include LM and TM

The Pharaoh Decoder Space of possible translations Phrase table constrains possible translations Output sentence is built left to right but source phrases can match any part of sentence Each source word can only be translated once Each source word must be translated

Scoring: In practice, much like for alignment models, also include a distortion penalty Define y = p1p2…pL to be a translation with phrase pairs pi Let s(pi) be the start position of the foreign phrase Let t(pi) be the end position of the foreign phrase Define η to be the distortion score (usually negative!) Then, we can define a score with distortion penalty: Goal, compute the best translation

Hypothesis Expansion

Hypothesis Explosion! Q: How much time to find the best translation? Exponentially many translations, in length of source sentence NP-hard, just like for word translation models So, we will use approximate search techniques!

Hypothesis Lattices Can recombine if: Last two English words match Foreign word coverage vectors match

Decoder Pseudocode Initialization: Set beam Q={q0} where q0 is initial state with no words translated For i=0 … n-1 [where n in input sentence length] For each state q∈beam(Q) and phrase p∈ph(q) q’=next(q,p) [compute the new state] Add(Q,q’,q,p) [add the new state to the beam] Notes: ph(q): set of phrases that can be added to partial translation in state q next(q,p): updates the translation in q and records which words have been translated from input Add(Q,q’,q,p): updates beam, q’ is added to Q if it is in the top-n overall highest scoring partial translations

Decoder Pseudocode Initialization: Set beam Q={q0} where q0 is initial state with no words translated For i=0 … n-1 [where n in input sentence length] For each state q∈beam(Q) and phrase p∈ph(q) q’=next(q,p) [compute the new state] Add(Q,q’,q,p) [add the new state to the beam] Possible State Representations: Full: q = (e, b, α), e.g. (“Joe did not give,” 11000000, 0.092) e is the partial English sentence b is a bit vector recorded which source words are translated α is score of translation so far

Decoder Pseudocode Initialization: Set beam Q={q0} where q0 is initial state with no words translated For i=0 … n-1 [where n in input sentence length] For each state q∈beam(Q) and phrase p∈ph(q) q’=next(q,p) [compute the new state] Add(Q,q’,q,p) [add the new state to the beam] Possible State Representations: Full: q = (e, b, α), e.g. (“Joe did not give,” 11000000, 0.092) Compact: q = (e1, e2, b, r, α) , e.g. (“not,” “give,” 11000000, 4, 0.092) e1 and e2 are the last two words of partial translation r is the length of the partial translation Compact representation is more efficient, but requires back pointers to get the final translation

Pruning Problem: easy partial analyses are cheaper Solution 1: separate bean for each number of foreign words Solution 2: estimate forward costs (A*-like)

Stack Decoding

Stack Decoding

Decoder Pseudocode (Multibeam) Initialization: set Q0={q0}, Qi={} for I = 1 … n [n is input sent length] For i=0 … n-1 For each state q∈beam(Qi) and phrase p∈ph(q) q’=next(q,p) Add(Qj,q’,q,p) where j = len(q’) Notes: Qi is a beam of all partial translations where i input words have been translated len(q) is the number of bits equal to one in q (the number of words that have been translated)

Making it Work (better)

Making it Work (better)

Making it Work (better)

Making it Work (better)

Making it Work (better)

No Data like More Data! Discussed for LMs, but can new understand full model!

Tuning for MT Features encapsulate lots of information Basic MT systems have around 6 features P(e|f), P(f|e), lexical weighting, language model How to tune feature weights? Idea 1: Use your favorite classifier

Why Tuning is Hard Problem 1: There are latent variables Alignments and segementations Possibility: forced decoding (but it can go badly)

Why Tuning is Hard Problem 2: There are many right answers The reference or references are just a few options No good characterization of the whole class BLEU isn’t perfect, but even if you trust it, it’s a corpus-level metric, not sentence-level