Statistical NLP Spring 2010 Lecture 17: Word / Phrase MT Dan Klein – UC Berkeley.

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

GIZA ++ A review of how to run GIZA++ By: Bridget McInnes
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Word Alignment Philipp Koehn USC/Information Sciences Institute USC/Computer Science Department School of Informatics University of Edinburgh Some slides.
Translation Model Parameters & Expectation Maximization Algorithm Lecture 2 (adapted from notes from Philipp Koehn & Mary Hearne) Dr. Declan Groves, CNGL,
Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Hybridity in MT: Experiments on the Europarl Corpus Declan Groves 24 th May, NCLT Seminar Series 2006.
Why Generative Models Underperform Surface Heuristics UC Berkeley Natural Language Processing John DeNero, Dan Gillick, James Zhang, and Dan Klein.
Discriminative Learning of Extraction Sets for Machine Translation John DeNero and Dan Klein UC Berkeley TexPoint fonts used in EMF. Read the TexPoint.
Novel Reordering Approaches in Phrase-Based Statistical Machine Translation S. Kanthak, D. Vilar, E. Matusov, R. Zens & H. Ney ACL Workshop on Building.
A Phrase-Based, Joint Probability Model for Statistical Machine Translation Daniel Marcu, William Wong(2002) Presented by Ping Yu 01/17/2006.
Statistical Phrase-Based Translation Authors: Koehn, Och, Marcu Presented by Albert Bertram Titles, charts, graphs, figures and tables were extracted from.
Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.
ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June Competitive Grouping in Integrated Segmentation and Alignment.
Machine Translation A Presentation by: Julie Conlonova, Rob Chase, and Eric Pomerleau.
C SC 620 Advanced Topics in Natural Language Processing Lecture 24 4/22.
Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.
Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
111 CS 388: Natural Language Processing Machine Transla tion Raymond J. Mooney University of Texas at Austin.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
MACHINE TRANSLATION AND MT TOOLS: GIZA++ AND MOSES -Nirdesh Chauhan.
Natural Language Processing Expectation Maximization.
Translation Model Parameters (adapted from notes from Philipp Koehn & Mary Hearne) 24 th March 2011 Dr. Declan Groves, CNGL, DCU
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Training and Decoding in SMT System) Kushal Ladha M.Tech Student CSE Dept.,
An Introduction to SMT Andy Way, DCU. Statistical Machine Translation (SMT) Translation Model Language Model Bilingual and Monolingual Data* Decoder:
Technical Report of NEUNLPLab System for CWMT08 Xiao Tong, Chen Rushan, Li Tianning, Ren Feiliang, Zhang Zhuyu, Zhu Jingbo, Wang Huizhen
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
CSP 517 Natural Language Processing Winter 2015 Machine Translation: Word Alignment Yejin Choi Slides from Dan Klein, Luke Zettlemoyer, Dan Jurafsky, Ray.
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
Machine Translation Course 5 Diana Trandab ă ț Academic year:
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
Reordering Model Using Syntactic Information of a Source Tree for Statistical Machine Translation Kei Hashimoto, Hirohumi Yamamoto, Hideo Okuma, Eiichiro.
Bayesian Subtree Alignment Model based on Dependency Trees Toshiaki Nakazawa Sadao Kurohashi Kyoto University 1 IJCNLP2011.
Korea Maritime and Ocean University NLP Jung Tae LEE
Statistical Machine Translation Part III – Phrase-based SMT / Decoding Alexander Fraser Institute for Natural Language Processing Universität Stuttgart.
National Centre for Language Technology Comparing Example-Based & Statistical Machine Translation Andy Way*† Nano Gough*, Declan Groves† National Centre.
NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.
LREC 2008 Marrakech 29 May Caroline Lavecchia, Kamel Smaïli and David Langlois LORIA / Groupe Parole, Vandoeuvre-Lès-Nancy, France Phrase-Based Machine.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
(Statistical) Approaches to Word Alignment
A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.
Phrase-Based Statistical Machine Translation as a Traveling Salesman Problem Mikhail Zaslavskiy Marc Dymetman Nicola Cancedda ACL 2009.
A Statistical Approach to Machine Translation ( Brown et al CL ) POSTECH, NLP lab 김 지 협.
Jan 2009Statistical MT1 Advanced Techniques in NLP Machine Translation III Statistical MT.
NLP. Machine Translation Source-channel model of communication Parametric probabilistic models of language and translation.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Discriminative Modeling extraction Sets for Machine Translation Author John DeNero and Dan KleinUC Berkeley Presenter Justin Chiu.
Spring 2010 Lecture 4 Kristina Toutanova MSR & UW With slides borrowed from Philipp Koehn and Hwee Tou Ng LING 575: Seminar on statistical machine translation.
Ling 575: Machine Translation Yuval Marton Winter 2016 January 19: Spill-over from last class, some prob+stats, word alignment, phrase-based and hierarchical.
Statistical Machine Translation Part II: Word Alignments and EM
Statistical NLP Spring 2011
CSE 517 Natural Language Processing Winter 2015
Statistical NLP Spring 2010
Statistical NLP Spring 2011
CSCI 5832 Natural Language Processing
CSCI 5832 Natural Language Processing
Machine Translation and MT tools: Giza++ and Moses
Statistical Machine Translation Papers from COLING 2004
Word Alignment David Kauchak CS159 – Fall 2019 Philipp Koehn
Lecture 12: Machine Translation (II) November 4, 2004 Dan Jurafsky
Machine Translation and MT tools: Giza++ and Moses
Statistical NLP Spring 2011
Presented By: Sparsh Gupta Anmol Popli Hammad Abdullah Ayyubi
CS224N Section 2: PA2 & EM Shrey Gupta January 21,2011.
Presentation transcript:

Statistical NLP Spring 2010 Lecture 17: Word / Phrase MT Dan Klein – UC Berkeley

Corpus-Based MT Modeling correspondences between languages Sentence-aligned parallel corpus: Yo lo haré mañana I will do it tomorrow Hasta pronto See you soon Hasta pronto See you around Yo lo haré pronto Novel Sentence I will do it soonI will do it around See you tomorrow Machine translation system: Model of translation

Unsupervised Word Alignment  Input: a bitext: pairs of translated sentences  Output: alignments: pairs of translated words  When words have unique sources, can represent as a (forward) alignment function a from French to English positions nous acceptons votre opinion. we accept your view.

Evaluating TMs  How do we measure quality of a word-to-word model?  Method 1: use in an end-to-end translation system  Hard to measure translation quality  Option: human judges  Option: reference translations (NIST, BLEU)  Option: combinations (HTER)  Actually, no one uses word-to-word models alone as TMs  Method 2: measure quality of the alignments produced  Easy to measure  Hard to know what the gold alignments should be  Often does not correlate well with translation quality (like perplexity in LMs)

Alignment Error Rate  Alignment Error Rate Sure align. Possible align. Predicted align. = = =

A:A: IBM Models 1/2 Thankyou,Ishalldosogladly Model Parameters Transitions: P( A 2 = 3) Emissions: P( F 1 = Gracias | E A 1 = Thank ) Gracias,loharédemuybuengrado E:E: F:F:

Problems with Model 1 There’s a reason they designed models 2-5! Problems: alignments jump around, align everything to rare words Experimental setup: Training data: 1.1M sentences of French-English text, Canadian Hansards Evaluation metric: alignment error Rate (AER) Evaluation data: 447 hand- aligned sentences

Intersected Model 1 Post-intersection: standard practice to train models in each direction then intersect their predictions [Och and Ney, 03] Second model is basically a filter on the first Precision jumps, recall drops End up not guessing hard alignments ModelP/RAER Model 1 E  F 82/ Model 1 F  E 85/ Model 1 AND96/4634.8

Joint Training? Overall: Similar high precision to post-intersection But recall is much higher More confident about positing non-null alignments ModelP/RAER Model 1 E  F 82/ Model 1 F  E 85/ Model 1 AND96/ Model 1 INT93/6919.5

Monotonic Translation Le Japon secoué par deux nouveaux séismes Japan shaken by two new quakes

Local Order Change Le Japon est au confluent de quatre plaques tectoniques Japan is at the junction of four tectonic plates

IBM Model 2  Alignments tend to the diagonal (broadly at least)  Other schemes for biasing alignments towards the diagonal:  Relative vs absolute alignment  Asymmetric distances  Learning a full multinomial over distances

EM for Models 1/2  Model parameters: Translation probabilities (1+2) Distortion parameters (2 only)  Start with uniform, including  For each sentence:  For each French position j  Calculate posterior over English positions  (or just use best single alignment)  Increment count of word f j with word e i by these amounts  Also re-estimate distortion probabilities for model 2  Iterate until convergence

Example: Model 2 Helps

Phrase Movement Des tremblements de terre ont à nouveau touché le Japon jeudi 4 novembre. On Tuesday Nov. 4, earthquakes rocked Japan once again

A:A: The HMM Model Thankyou,Ishalldosogladly Model Parameters Transitions: P( A 2 = 3 | A 1 = 1) Emissions: P( F 1 = Gracias | E A 1 = Thank ) Gracias,loharédemuybuengrado E:E: F:F:

The HMM Model  Model 2 preferred global monotonicity  We want local monotonicity:  Most jumps are small  HMM model (Vogel 96)  Re-estimate using the forward-backward algorithm  Handling nulls requires some care  What are we still missing?

HMM Examples

AER for HMMs ModelAER Model 1 INT19.5 HMM E  F 11.4 HMM F  E 10.8 HMM AND7.1 HMM INT4.7 GIZA M4 AND6.9

IBM Models 3/4/5 Mary did not slap the green witch Mary not slap slap slap the green witch Mary not slap slap slap NULL the green witch n(3|slap) Mary no daba una botefada a la verde bruja Mary no daba una botefada a la bruja verde P(NULL) t(la|the) d(j|i) [from Al-Onaizan and Knight, 1998]

Examples: Translation and Fertility

Example: Idioms il hoche la tête he is nodding

Example: Morphology

Some Results  [Och and Ney 03]

Decoding  In these word-to-word models  Finding best alignments is easy  Finding translations is hard (why?)

Bag “Generation” (Decoding)

Bag Generation as a TSP  Imagine bag generation with a bigram LM  Words are nodes  Edge weights are P(w|w’)  Valid sentences are Hamiltonian paths  Not the best news for word-based MT! it is not clear.

IBM Decoding as a TSP

Decoding, Anyway  Simplest possible decoder:  Enumerate sentences, score each with TM and LM  Greedy decoding:  Assign each French word it’s most likely English translation  Operators:  Change a translation  Insert a word into the English (zero-fertile French)  Remove a word from the English (null-generated French)  Swap two adjacent English words  Do hill-climbing (or annealing)

Greedy Decoding

Stack Decoding  Stack decoding:  Beam search  Usually A* estimates for completion cost  One stack per candidate sentence length  Other methods:  Dynamic programming decoders possible if we make assumptions about the set of allowable permutations

Stack Decoding  Stack decoding:  Beam search  Usually A* estimates for completion cost  One stack per candidate sentence length  Other methods:  Dynamic programming decoders possible if we make assumptions about the set of allowable permutations

Phrase-Based Systems Sentence-aligned corpus cat ||| chat ||| 0.9 the cat ||| le chat ||| 0.8 dog ||| chien ||| 0.8 house ||| maison ||| 0.6 my house ||| ma maison ||| 0.9 language ||| langue ||| 0.9 … Phrase table (translation model) Word alignments

Phrase-Based Decoding 这 7 人 中包括 来自 法国 和 俄罗斯 的 宇航 员. Decoder design is important: [Koehn et al. 03]

The Pharaoh “Model” [Koehn et al, 2003] SegmentationTranslationDistortion

The Pharaoh “Model” Where do we get these counts?

Phrase Weights

Phrase Scoring leschats aiment le poisson cats like fresh fish..frais.  Learning weights has been tried, several times:  [Marcu and Wong, 02]  [DeNero et al, 06]  … and others  Seems not to work well, for a variety of partially understood reasons  Main issue: big chunks get all the weight, obvious priors don’t help  Though, [DeNero et al 08]

Phrase Size  Phrases do help  But they don’t need to be long  Why should this be?

Lexical Weighting

The Pharaoh Decoder  Probabilities at each step include LM and TM

Hypotheis Lattices

Pruning  Problem: easy partial analyses are cheaper  Solution 1: use beams per foreign subset  Solution 2: estimate forward costs (A*-like)

WSD?  Remember when we discussed WSD?  Word-based MT systems rarely have a WSD step  Why not?

Overview: Extracting Phrases Sentence-aligned corpus cat ||| chat ||| 0.9 the cat ||| le chat ||| 0.8 dog ||| chien ||| 0.8 house ||| maison ||| 0.6 my house ||| ma maison ||| 0.9 language ||| langue ||| 0.9 … Phrase table (translation model) Intersected and grown word alignments Directional word alignments

Complex Configurations

Extracting Phrases

Bidirectional Alignment

Alignment Heuristics

Sources of Alignments