Parameter estimate in IBM Models: Ling 572 Fei Xia Week ??

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Clustering Beyond K-means
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Learning HMM parameters
Statistical Machine Translation IBM Model 1 CS626/CS460 Anoop Kunchukuttan Under the guidance of Prof. Pushpak Bhattacharyya.
Word Alignment Philipp Koehn USC/Information Sciences Institute USC/Computer Science Department School of Informatics University of Edinburgh Some slides.
The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.
Part of Speech Tagging The DT students NN went VB to P class NN Plays VB NN well ADV NN with P others NN DT Fruit NN flies NN VB NN VB like VB P VB a DT.
Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University.
Statistical Topic Modeling part 1
Mixture Language Models and EM Algorithm
Maximum Entropy Model (I) LING 572 Fei Xia Week 5: 02/05-02/07/08 1.
1 An Introduction to Statistical Machine Translation Dept. of CSIE, NCKU Yao-Sheng Chang Date:
The EM algorithm (Part 1) LING 572 Fei Xia 02/23/06.
. Learning Bayesian networks Slides by Nir Friedman.
Machine Translation (II): Word-based SMT Ling 571 Fei Xia Week 10: 12/1/05-12/6/05.
A Phrase-Based, Joint Probability Model for Statistical Machine Translation Daniel Marcu, William Wong(2002) Presented by Ping Yu 01/17/2006.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Forward-backward algorithm LING 572 Fei Xia 02/23/06.
The EM algorithm LING 572 Fei Xia 03/01/07. What is EM? EM stands for “expectation maximization”. A parameter estimation method: it falls into the general.
Expectation Maximization Algorithm
. Hidden Markov Models with slides from Lise Getoor, Sebastian Thrun, William Cohen, and Yair Weiss.
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Expectation-Maximization
Bootstrapping LING 572 Fei Xia 1/31/06.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Expectation-Maximization (EM) Chapter 3 (Duda et al.) – Section 3.9
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.
Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.
MACHINE TRANSLATION AND MT TOOLS: GIZA++ AND MOSES -Nirdesh Chauhan.
Natural Language Processing Expectation Maximization.
12/08/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Statistical Translation: Alignment and Parameter Estimation.
Recitation on EM slides taken from:
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Bayesian Word Alignment for Statistical Machine Translation Authors: Coskun Mermer, Murat Saraclar Present by Jun Lang I2R SMT-Reading Group.
Introduction to MT CSE 415 Fei Xia Linguistics Dept 02/24/06.
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
CSE 517 Natural Language Processing Winter 2015
2003 (c) University of Pennsylvania1 Better MT Using Parallel Dependency Trees Yuan Ding University of Pennsylvania.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
NLP. Machine Translation Source-channel model of communication Parametric probabilistic models of language and translation.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Maximum Entropy … the fact that a certain prob distribution maximizes entropy subject to certain constraints representing our incomplete information, is.
A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05
Machine Translation Course 4 Diana Trandab ă ț Academic year:
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.
Statistical Machine Translation Part II: Word Alignments and EM
ELN – Natural Language Processing
Expectation-Maximization
Statistical Machine Translation
Introduction to EM algorithm
CSCI 5832 Natural Language Processing
Expectation-Maximization Algorithm
Word-based SMT Ling 580 Fei Xia Week 1: 1/3/06.
Machine Translation and MT tools: Giza++ and Moses
Word Alignment David Kauchak CS159 – Fall 2019 Philipp Koehn
Machine Translation and MT tools: Giza++ and Moses
CS224N Section 2: PA2 & EM Shrey Gupta January 21,2011.
Presentation transcript:

Parameter estimate in IBM Models: Ling 572 Fei Xia Week ??

Outline IBM Model 1 Review: (from LING571) –Word alignment –Modeling –Training: formula Formulae

IBM Model Basics Classic paper: Brown et. al. (1993) Translation: F  E (or Fr  Eng) Resource required: –Parallel data (a set of “sentence” pairs) Main concepts: –Source channel model –Hidden word alignment –EM training

Intuition Sentence pairs: word mapping is one-to-one. –(1) S: a b c d e T: l m n o p –(2) S: c a e T: p n m –(3) S: d a c T: n p l  (b, o), (d, l), (e, m), and (a, p), (c, n), or (a, n), (c, p)

Source channel model for MT Eng sent Noisy channel Fr sent P(E)P(F | E) Two types of parameters: Language model: P(E) Translation model: P(F | E)

a(j)=i  a j = i a = (a 1, …, a m ) Ex: –F: f 1 f 2 f 3 f 4 f 5 –E: e 1 e 2 e 3 e 4 –a 4 =3 –a = (0, 1, 1, 3, 2) Word alignment

An alignment, a, is a function from Fr word position to Eng word position: a(j)=i means that the f j is generated by e i. The constraint: each fr word is generated by exactly one Eng word (including e0):

Modeling p(F | E) with alignment

Notation E: the Eng sentence: E = e 1 …e l e i : the i-th Eng word. F: the Fr sentence: f 1 … f m f j : the j-th Fr word. e 0 : the Eng NULL word F 0 : the Fr NULL word. a j : the position of Eng word that generates f j.

Notation (cont) l: Eng sent leng m: Fr sent leng i: Eng word position j: Fr word position e: an Eng word f: a Fr word

Generative process To generate F from E: –Pick a length m for F, with prob P(m | l) –Choose an alignment a, with prob P(a | E, m) –Generate Fr sent given the Eng sent and the alignment, with prob P(F | E, a, m). Another way to look at it: –Pick a length m for F, with prob P(m | l). –For j=1 to m Pick an Eng word index a j, with prob P(a j | j, m, l). Pick a Fr word f j according to the Eng word e i, where a j =I, with prob P(f j | e i ).

Decomposition

Approximation Fr sent length depends only on Eng sent length: Fr word depends only on the Eng word that generates it: Estimating P(a | E, m): All alignments are equally likely:

Decomposition

Final formula and parameters for Model 1 Two types of parameters: Length prob: P(m | l) Translation prob: P(f j | e i ), or t(f j | e i ),

Training Mathematically motivated: –Having an objective function to optimize –Using several clever tricks The resulting formulae –are intuitively expected –can be calculated efficiently EM algorithm –Hill climbing, and each iteration guarantees to improve objective function –It does not guaranteed to reach global optimal.

Training: Fractional counts Let Ct(f, e) be the fractional count of (f, e) pair in the training data, given alignment prob P. Alignment probActual count of times e and f are linked in (E,F) by alignment a

Estimating P(a|E,F) We could list all the alignments, and estimate P(a | E, F).

Formulae so far  New estimate for t(f|e)

The algorithm 1.Start with an initial estimate of t(f | e): e.g., uniform distribution 2.Calculate P(a | F, E) 3.Calculate Ct (f, e), Normalize to get t(f|e) 4.Repeat Steps 2-3 until the “improvement” is too small.

No need to enumerate all word alignments Luckily, for Model 1, there is a way to calculate Ct(f, e) efficiently.

The algorithm 1.Start with an initial estimate of t(f | e): e.g., uniform distribution 2.Calculate P(a | F, E) 3.Calculate Ct (f, e), Normalize to get t(f|e) 4.Repeat Steps 2-3 until the “improvement” is too small.

Summary of Model 1 Modeling: –Pick the length of F with prob P(m | l). –For each position j Pick an English word position a j, with prob P(a j | j, m, l). Pick a Fr word f j according to the Eng word e i, with t(f j | e i ), where i=a j –The resulting formula can be calculated efficiently. Training: EM algorithm. The update can be done efficiently. Finding the best alignment: can be easily done.

New stuff

EM algorithm EM: expectation maximization In a model with hidden states (e.g., word alignment), how can we estimate model parameters? EM does the following: –E-step: Take an initial model parameterization and calculate the expected values of the hidden data. –M-step: Use the expected values to maximize the likelihood of the training data.

Objective function

Training Summary Mathematically motivated: –Having an objective function to optimize –Using several clever tricks The resulting formulae –are intuitively expected –can be calculated efficiently EM algorithm –Hill climbing, and each iteration guarantees to improve objective function –It does not guaranteed to reach global optimal.