1 An Introduction to Statistical Machine Translation Dept. of CSIE, NCKU Yao-Sheng Chang Date: 2011.04.12.

Slides:



Advertisements
Similar presentations
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Advertisements

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Visual Recognition Tutorial
Classification and risk prediction
First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.
Visual Recognition Tutorial
Expectation Maximization Algorithm
Expectation-Maximization
Visual Recognition Tutorial
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Expectation-Maximization (EM) Chapter 3 (Duda et al.) – Section 3.9
EM Algorithm Likelihood, Mixture Models and Clustering.
Parameter estimate in IBM Models: Ling 572 Fei Xia Week ??
MLVQ(EM 演算法 ) Speaker: 楊志民 Date: training Remove Dc_bias Feature extraction 411.C Silence.c Duration.c Breath.c Test data recognize Recognize.
EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.
Today Wrap up of probability Vectors, Matrices. Calculus
THE MATHEMATICS OF STATISTICAL MACHINE TRANSLATION Sriraman M Tallam.
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
1 Bayesian Learning for Latent Semantic Analysis Jen-Tzung Chien, Meng-Sun Wu and Chia-Sheng Wu Presenter: Hsuan-Sheng Chiu.
Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.
EM and expected complete log-likelihood Mixture of Experts
Model Inference and Averaging
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
An Empirical Likelihood Ratio Based Goodness-of-Fit Test for Two-parameter Weibull Distributions Presented by: Ms. Ratchadaporn Meksena Student ID:
First topic: clustering and pattern recognition Marc Sobel.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
HMM - Part 2 The EM algorithm Continuous density HMM.
CS Statistical Machine learning Lecture 24
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica ext. 1819
Lecture 2: Statistical learning primer for biologists
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: MLLR For Two Gaussians Mean and Variance Adaptation MATLB Example Resources:
M.Sc. in Economics Econometrics Module I Topic 4: Maximum Likelihood Estimation Carol Newman.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Design and Implementation of Speech Recognition Systems Fall 2014 Ming Li Special topic: the Expectation-Maximization algorithm and GMM Sep Some.
EM Algorithm 主講人:虞台文 大同大學資工所 智慧型多媒體研究室. Contents Introduction Example  Missing Data Example  Mixed Attributes Example  Mixture Main Body Mixture Model.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Data Modeling Patrice Koehl Department of Biological Sciences
Statistical Machine Translation Part II: Word Alignments and EM
LECTURE 33: STATISTICAL SIGNIFICANCE AND CONFIDENCE (CONT.)
Model Inference and Averaging
Classification of unlabeled data:
Statistical Models for Automatic Speech Recognition
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
Latent Variables, Mixture Models and EM
Hidden Markov Models Part 2: Algorithms
Bayesian Models in Machine Learning
Mathematical Foundations of BME Reza Shadmehr
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Parametric Methods Berlin Chen, 2005 References:
Biointelligence Laboratory, Seoul National University
Pushpak Bhattacharyya CSE Dept., IIT Bombay 31st Jan, 2011
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Presentation transcript:

1 An Introduction to Statistical Machine Translation Dept. of CSIE, NCKU Yao-Sheng Chang Date:

2 Outline  Introduction  Peter Brown  The Mathematics of Machine Translation: Parameter Estimation, computational linguistics, vol. 19,1993, pp  Model 1

3 Introduction (1)  Machine translation is available  Statistical method, information theory  Faster computer, large storage  Machine-readable corpora  Statistical method have proven their value  Automatic speech recognition  Lexicography, Natural language processing

4 Introduction (2)  Translations involve many cultural respects  We only consider the translation of individual sentence, just acceptable sentences.  Every sentence in one language is a possible translation of any sentence in the other  Assign (S,T) a probability, Pr(T|S), to be the probability that a translator will produce T in the target language when presented with S in the source language.

5 Statistical Machine Translation (SMT)  Noise channel problem

6 Fundamental of SMT  Given a string of French f, the job of our translation system is to find the string e that the speaker had in mind when he produced f. (Baye’s theorem)  Since denominator Pr(f) here is a constant, the best e is one which has the greatest probability.

7 Practical Challenges  Computation of translation model Pr(f|e)  Computation of language model Pr(e)  Decoding (i.e., search for e that maximize Pr(f|e)  Pr(e))

8 Alignment of case 1

9 Alignment of case 2

10 Alignment of case 3

11 Formulation of Alignment(1)  Let e = e 1 l  e 1 e 2 …e l and f = f 1 m  f 1 f 2 …f m  An alignment between a pair of strings e and f use a mapping of every word e i to some word f j  In other words, an alignment a between e and f tells that the word e i, 1  i  l is generated by the word f a j, a j  {1,…,m}  There are (l+1) m different alignments between e and f. (Including Null – no mapping ) e = e 1 e 2 …e i …e l f = f 1 f 2 … f j … f m aj =iaj =i f a j =e i

12 Formulation of Alignment(2)  Probability of an alignment

13 Translation Model  The alignment, a, can be represented by a series, a 1 m = a l a 2... a m, of m values, each between 0 and l such that if the word in position j of the French string is connected to the word in position i of the English string, then a j = i, and if it is not connected to any English word, then a j = 0 (null).

14 IBM Model I (1) 

15 IBM Model I (2)  The alignment is determined by specifying the values of a j for j from 1 to m, each of which can take any value from 0 to l.

16 Constrained Maximization  We wish to adjust the translation probabilities so as to maximize Pr(f|e ) subject to the constraints that for each e

17 Lagrange Multipliers (1)  Method of Lagrange multipliers (拉格朗乘數法) : Lagrange multipliers with one constraint  If there is a maximum or minimum subject to the constraint g(x,y) = 0, then it will occur at one of the critical numbers of the function F defined by is called the  f(x,y) is called the objective function (目標函數).  g(x,y) is called the constrained equation (條件限制方程式).  F(x, y, ) is called the Lagrange function (拉格朗函數).  is called the Lagrange multiplier (拉格朗乘數).

18 Lagrange Multipliers (2)  Example 1: Maximize  Subject to  Let  Set  代入 (2) 與 (3) ,可得  ( 5) 與 (6) 代入 (4) ,可得  ,由此可得 因此,最大值為

19 Lagrange Multipliers (3)  Following standard practice for constrained maximization, we introduce Lagrange multipliers e, and seek an unconstrained extremum of the auxiliary function

20 Derivation (1)  The partial derivative of h with respect to t(f|e) is  where  is the Kronecker delta function, equal to one when both of its arguments are the same and equal to zero otherwise

21 Derivation (2)  We call the expected number of times that e connects to f in the translation (f|e) the count of f given e for (f|e) and denote it by c(f|e; f, e). By definition,

22 Derivation (3)  replacing e by e Pr(f|e), then Equation (11) can be written very compactly as  In practice, our training data consists of a set of translations, (f (1) le (1) ), (f (2) le (2) ),..., (f (s) le (s) ),, so this equation becomes

23 Derivation (4)  For an expression that can be evaluated efficiently.

24 Derivation (5)  Thus, the number of operations necessary to calculate a count is proportional to l + m rather than to (l + 1) m as Equation (12)

25 EM Algorithm

26 EM Algorithm

27 Introduction(1) In statistical computing, an expectation-maximization (EM) algorithm is an algorithm for finding maximum likelihood estimates of parameters in probabilistic models, where the model depends on unobserved latent variables. EM is frequently used for data clustering in machine learning and computer vision.

28 Introduction(2) EM alternates between performing an expectation (E) step, which computes an expectation of the likelihood by including the latent variables as if they were observed, and a maximization (M) step, which computes the maximum likelihood estimates of the parameters by maximizing the expected likelihood found on the E step. The parameters found on the M step are then used to begin another E step, and the process is repeated.  (From: maximization_algorithm ) maximization_algorithm

29  EM algorithm is a soft version of K- means clustering.  The idea is that the observed data are generated by several underlying causes.  Each cause contributes independently to the generation process, bur we only see the final mixture –without information about which cause contributed what.

30  Observable data  Each  Unobservable / hidden data  Each z ij can be interpreted as cluster membership probabilities.  The component z ij is 1 if object i is a member of cluster j.

31 Initial Assumption At first, suppose we have a data set, where each is the vector that correspond to the i th data point. Further, assume the samples are drawn from k mixture Gaussians,. Notice that the p.d.f. of multivariate normal distribution is A normal distribution in a x variate with  mean and  2 variance is a statistic distribution with probability functionvariatemean varianceprobability function

32 E-step Let be a n by k matrix, where Notice that, if we set for then by Bayes formula we have

33 M-step.

34 log likelihood The log likelihood of the data set X given the parameters is, where, and is the weight of cluster j. Notice that

35 計算示範 (1) 假設 則

36 計算示範 (2)