. EM algorithm and applications Lecture #9 Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.

Slides:



Advertisements
Similar presentations
. Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters.
Advertisements

. Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution.
. Inference and Parameter Estimation in HMM Lecture 11 Computational Genomics © Shlomo Moran, Ydo Wexler, Dan Geiger (Technion) modified by Benny Chor.
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Chapter 4 Probability and Probability Distributions
The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.
EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and.
Lectures prepared by: Elchanan Mossel Yelena Shvets Introduction to probability Stat 134 FAll 2005 Berkeley Follows Jim Pitman’s book: Probability Section.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Phylogenetic Trees Lecture 4
. Learning – EM in ABO locus Tutorial #08 © Ydo Wexler & Dan Geiger.
. Learning – EM in The ABO locus Tutorial #8 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Statistical NLP: Lecture 11
Hidden Markov Models Fundamentals and applications to bioinformatics.
. Hidden Markov Model Lecture #6. 2 Reminder: Finite State Markov Chain An integer time stochastic process, consisting of a domain D of m states {1,…,m}
Markov Chains Lecture #5
HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the.
. EM algorithm and applications Lecture #9 Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Parameter Estimation For HMM Background Readings: Chapter 3.3 in the book, Biological Sequence Analysis, Durbin et al., 2001.
The EM algorithm (Part 1) LING 572 Fei Xia 02/23/06.
. Learning Bayesian networks Slides by Nir Friedman.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Learning – EM in The ABO locus Tutorial #9 © Ilan Gronau.
The EM algorithm LING 572 Fei Xia 03/01/07. What is EM? EM stands for “expectation maximization”. A parameter estimation method: it falls into the general.
CASE STUDY: Genetic Linkage Analysis via Bayesian Networks
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
. Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger.
. Class 5: Hidden Markov Models. Sequence Models u So far we examined several probabilistic model sequence models u These model, however, assumed that.
Hidden Markov Model Continues …. Finite State Markov Chain A discrete time stochastic process, consisting of a domain D of m states {1,…,m} and 1.An m.
. Learning – EM in The ABO locus Tutorial #9 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
. Markov Chains Lecture #5 Background Readings: Durbin et. al. Section 3.1 Prepared by Shlomo Moran, based on Danny Geiger’s and Nir Friedman’s.
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al.,  Shlomo.
EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.
. PGM 2002/3 – Tirgul6 Approximate Inference: Sampling.
Lectures prepared by: Elchanan Mossel Yelena Shvets Introduction to probability Stat 134 FAll 2005 Berkeley Follows Jim Pitman’s book: Probability Section.
Chapter 8 Probability Section R Review. 2 Barnett/Ziegler/Byleen Finite Mathematics 12e Review for Chapter 8 Important Terms, Symbols, Concepts  8.1.
. EM with Many Random Variables Another Example of EM Sequence Alignment via HMM Lecture # 10 This class has been edited from Nir Friedman’s lecture. Changes.
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Quality Improvement PowerPoint presentation to accompany Besterfield, Quality Improvement, 9e PowerPoint presentation to accompany Besterfield, Quality.
Random Variables. A random variable X is a real valued function defined on the sample space, X : S  R. The set { s  S : X ( s )  [ a, b ] is an event}.
Channel Capacity.
. EM and variants of HMM Lecture #9 Background Readings: Chapters 11.2, 11.6, 3.4 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10 This class has been edited from Nir Friedman’s lecture. Changes made.
Week 21 Conditional Probability Idea – have performed a chance experiment but don’t know the outcome (ω), but have some partial information (event A) about.
. EM algorithm and applications Lecture #9 Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.
Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
+ Chapter 5 Overview 5.1 Introducing Probability 5.2 Combining Events 5.3 Conditional Probability 5.4 Counting Methods 1.
Week 21 Rules of Probability for all Corollary: The probability of the union of any two events A and B is Proof: … If then, Proof:
Chapter 2: Probability. Section 2.1: Basic Ideas Definition: An experiment is a process that results in an outcome that cannot be predicted in advance.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Probability Distribution. Probability Distributions: Overview To understand probability distributions, it is important to understand variables and random.
Theory of Computational Complexity Probability and Computing Ryosuke Sasanuma Iwama and Ito lab M1.
. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.
Hidden Markov Models BMI/CS 576
Comp. Genomics Recitation 6 14/11/06 ML and EM.
What is Probability? Quantification of uncertainty.
Review of Probability and Estimators Arun Das, Jason Rebello
Learning Bayesian networks
Hidden Markov Model Lecture #6
Presentation transcript:

. EM algorithm and applications Lecture #9 Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.

2 Reminder: Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and q is defined by H(p||q) = D(p||q) = ∑ x p(x)log[p(x)/q(x)] = ∑ x p(x)log(1/(q(x)) - -∑ x p(x)log(1/(p(x)). “The inefficiency of assuming distribution q when the correct distribution is p”. H(p)H(p)

3 Non negativity of relative entropy Claim (proved last week) D(p||q)= ∑ x p(x)log[p(x)/q(x)] = ∑ x p(x)log(1/(q(x)) -∑ x p(x)log(1/(p(x)) ≥ 0. Equality if and only if q=p. This claim is used in the correctness proof of the EM algorithm, which we present next.

4 log P(x| λ ) General idea of EM: Use “current point” θ to construct alternative function Q θ (which is “nice”) Guarantee: if Q θ (λ)>Q θ (θ), than λ has higher likelihood than θ. EM algorithm: approximating MLE from Incomplete Data u Finding MLE parameters: nonlinear optimization problem λ E  [log P(x,y| λ )] θλ

5 The EM algorithm Consider a model where, for observed data x and model parameters θ: p(x|θ)=∑ y p(x,y|θ). (y are “hidden data”). The EM algorithm receives x and parameters θ, and returns new parameters * s.t. p(x| *) > p(x|θ). Note: In Durbin et. al. book, the initial parameters are denoted by θ 0, and the new parameters by θ.

6 Finding * which maximizes p(x| *)=∑ y p(x,y| *). is equivalent to finding * which maximizes the logarithm log p(x| *) = log (∑ y p(x,y| *)) Which is what the EM algorithm attempts to do. In the following we: 1. Present the EM algorithm. 2. Give few examples of implementations 3. Prove its correctness. The EM algorithm

7 In each iteration the EM algorithm does the following. u (E step): Calculate Q θ ( ) = ∑ y p(y|x,θ)log p(x,y| ) u (M step): Find * which maximizes Q θ ( ) (Next iteration sets   * and repeats). The EM algorithm Comments: 1. When θ is clear, we shall write Q( ) instead of Q θ ( ) 2. At the M-step we only need that Q θ ( *)>Q θ (θ). This change yields the so called Generalized EM algorithm. It is important when it is hard to find the optimal *.

8 The Baum-Welsh algorithm is the EM algorithm for HMM : u E step for HMM: Q θ ( ) = ∑ S p(s|x,θ)log p(s,x| ), where λ are the new parameters {a kl,e k (b)}. Example: Baum Welsh = EM for HMM (The are the counts of state transitions and symbol emissions in (s,x)).

9 Baum Welsh = EM for HMM u M step For HMM: Find * which maximizes Q θ ( ). As we proved, λ * is given by the relative frequencies of the A kl ’s and the E k (b)’s

10 A simplest example: EM for 2 coin tosses Consider the following experiment: Given a coin with two possible outcomes: H (head) and T (tail), with probabilities q H, q T = 1- q H. The coin is tossed twice, but only the 1 st outcome, T, is seen. So the data is x = (T,*). We wish to apply the EM algorithm to get parameters that increase the likelihood of the data. Let the initial parameters be θ = (q H, q T ) = ( ¼, ¾ ).

11 EM for 2 coin tosses (cont) The hidden data which can produce x are the sequences y 1 = (T,H); y 2 =(T,T); Hence the likelihood of x with parameters (q H, q T ), is p(x| θ) = P(x,y 1 |  ) + P(x,y 2 |  ) = q H q T +q T 2 For the initial parameters θ = ( ¼, ¾ ), we have: p(x| θ) = ¾ * ¼ + ¾ * ¾ = ¾ Note that in this case P(x,y i |  ) = P(y i |  ), for i = 1,2. we can always define y so that (x,y) = y (otherwise we set y’  (x,y) and replace the “ y ”s by “ y’ ”s).

12 EM for 2 coin tosses - E step Calculate Q θ ( ) = Q θ (q H,q T ). Note: q H,q T are variables Q θ ( ) = p(y 1 |x,θ)log p(x,y 1 | ) + p(y 2 |x,θ)log p(x,y 2 | ) p(y 1 |x,θ) = p(y 1,x|θ)/p(x|θ) = (¾∙ ¼)/ (¾) = ¼ p(y 2 |x,θ) = p(y 2,x|θ)/p(x|θ) = (¾∙ ¾)/ (¾) = ¾ Thus we have Q θ ( ) = ¼ log p(x,y 1 | ) + ¾ log p(x,y 2 | )

13 EM for 2 coin tosses - E step For a sequence y of coin tosses, let N H (y) be the number of H’s in y, and N T (y) be the number of T’s in y. Then log p(y| ) = N H (y) log q H + N T (y) log q T In our example: y 1 = (T,H); y 2 =(T,T), hence: N H (y 1 ) = N T (y 1 )=1, N H (y 2 ) =0, N T (y 2 )=2

14 Example: 2 coin tosses - E step Thus ¼ log p(x,y 1 | ) = ¼ (N H (y 1 ) log q H + N T (y 1 ) log q T ) = ¼ (log q H + log q T ) ¾ log p(x,y 2 | ) = ¾ ( N H (y 2 ) log q H + N T (y 2 ) log q T ) = ¾ (2 log q T ) Substituting in the equation for Q θ ( ) : Q θ ( ) = ¼ log p(x,y 1 | )+ ¾ log p(x,y 2 | ) = ( ¼ N H (y 1 )+ ¾ N H (y 2 ))log q H + ( ¼ N T (y 1 )+ ¾ N T (y 2 ))log q T Q θ ( ) = N H log q H + N T log q T N T = 7 / 4 N H = ¼

15 EM for 2 coin tosses - M step Find * which maximizes Q θ ( ) Q θ ( ) = N H log q H + N T log q T = ¼ log q H + 7 / 4 log q T We saw earlier that this is maximized when: [The optimal parameters (0,1), will never be reached by the EM algorithm!]

16 Let N l be the expected value of N l (y), given x and θ: N l =E(N l |x,θ) = ∑ y p(y|x,θ) N l (y), EM for single random variable (dice) Now, the probability of each y (≡(x,y)) is given by a sequence of dice tosses. The dice has m outcomes, with probabilities q 1,..,q m. Let N l (y) = #(outcome l occurs in y). Then Then we have:

17 Q  (λ) for one dice NlNl

18 EM algorithm for n independent observations x 1,…, x n : Expectation step It can be shown that, if the x j are independent, then:

19 Example: The ABO locus A locus is a particular place on the chromosome. Each locus’ state (called genotype) consists of two alleles – one parental and one maternal. Some loci (plural of locus) determine distinguished features. The ABO locus, for example, determines blood type. Suppose we randomly sampled N individuals and found that N a/a have genotype a/a, N a/b have genotype a/b, etc. Then, the MLE is given by: The ABO locus has six possible genotypes {a/a, a/o, b/o, b/b, a/b, o/o}. The first two genotypes determine blood type A, the next two determine blood type B, then blood type AB, and finally blood type O. We wish to estimate the proportion in a population of the 6 genotypes.

20 The ABO locus (Cont.) However, testing individuals for their genotype is a very expensive. Can we estimate the proportions of genotype using the common cheap blood test with outcome being one of the four blood types (A, B, AB, O) ? The problem is that among individuals measured to have blood type A, we don’t know how many have genotype a/a and how many have genotype a/o. So what can we do ?

21 The ABO locus (Cont.) The Hardy-Weinberg equilibrium rule states that in equilibrium the frequencies of the three alleles q a,q b,q o in the population determine the frequencies of the genotypes as follows: q a/b = 2q a q b, q a/o = 2q a q o, q b/o = 2q b q o, q a/a = [q a ] 2, q b/b = [q b ] 2, q o/o = [q o ] 2. In fact, Hardy-Weinberg equilibrium rule follows from modeling this problem as data x with hidden parameters y:

22 The ABO locus (Cont.) The dice’ outcome are the three possible alleles a, b and o. The observed data are the blood types A, B, AB or O. Each blood type is determined by two successive random sampling of alleles, which is an “ordered genotypes pair” – this is the hidden data. For instance blood type A corresponds to the ordered genotypes pairs (a,a), (a,o) and (o,a). So we have three parameters of one dice – q a,q b,q o - that we need to estimate.

23 EM setting for the ABO locus The observed data x =(x 1,..,x n ) is a sequence of letters (blood types) from the alphabet {A,B,AB,O}. eg: (B,A,B,B,O,A,B,A,O,B, AB) are observations (x 1,…x 11 ). The hidden data (ie the y’s) for each letter x j is the set of ordered pairs of alleles that generates it. For instance, for A it is the set {aa, ao, oa}. The parameters  = {q a,q b, q o } are the probabilities of the alleles. We need is to find the parameters  = {q a,q b, q o } that maximize the likelihood of the given data. We do this by the EM algorithm:

24 EM for ABO loci  For each observed blood type x j  {A,B,AB,O} and for each allele z in {a,b,o} we compute N z (x j ), the expected number of times that z appear in x j. Where the sum is taken over the ordered “genotype pairs” y j, and N z (y j ) is the number of times allele z occurs in the pair y j. eg, N a (o,b)=0; N b (o,b) = N o (o,b) = 1.

25 EM for ABO loci The computation for blood type B: P(B|  ) = P((b,b)|  ) + p((b,o)|  ) +p((o,b)|  )) = q b 2 + 2q b q o. Since N b ((b,b))=2, and N b ((b,o))=N b ((o,b)) =N o ((o,b))=N o ((b,o))=1, N o (B) and N b (B), the expected number of occurrences of o and b in B, are given by: Observe that N b (B) + N o (B) = 2

26 EM for ABO loci Similarly, P(A|  ) = q a 2 + 2q a q o. P(AB|  ) = p((b,a)|  ) + p((a,b)|  )) = 2q a q b ; P(O|  ) = p((o,o)|  ) = q o 2 N a (AB) = N b (AB) = 1 N o (O) = 2 [ N b (O) = N a (O) = N o (AB) = N b (A) = N a (B) = 0 ]

27 E step: compute N a, N b and N o Let #(A)=3, #(B)=5, #(AB)=1, #(O)=2 be the number of observations of A, B, AB, and O respectively. M step: set λ*=( q a *, q b *, q o *)

28 EM for a general discrete stochastic processes But this time experiment (x,y) is generated by a general stochastic process. The only assumption we make is that the outcome of each experiment consists of a (finite) sequence of samplings of r discrete random variables (dices) Z 1,..., Z r, each of the Z i ‘s can be sampled few times. This can be realized by a probabilistic acyclic state machine, where at each state some Z i is sampled, and the next state is determined by the outcome – until a final state is reached. Now we wish to maximize likelihood of observation x with hidden data as before, ie maximize p(x| )=∑ y p(x,y| ).

29 EM for processes with many dices Example: In HMM, the random variables are the transmissions and emission probabilities: a kl, e k (b). x is the visible information y is the sequence s of states (x,y) is the complete HMM s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi As before, we can redefine y so that (x,y) = y.

30 EM for processes with many dices Each random variable Z k (k =1,...,r) has m k values z k,1,...z k,m k with probabilities {q kl,|l=1,...,m k }. Each y defines a sequence of outcomes (z k 1, l 1,...,z k n, l n ) of the random variables used in y. In the HMM, these are the specific transitions and emissions, defined by the states and outputs of the sequence y j. Let N kl (y) = #(z kl appears in y).

31 Define N kl as the expected value of N kl (y), given x and θ: N kl =E(N kl |x,θ) = ∑ y p(y|x,θ) N kl (y), Then we have: EM for processes with many dices Similarly to the single dice case, we have:

32 Q  (λ) for processes with many dices N kl

33 EM algorithm for processes with many dices Maximization step Set q kl =N kl / (∑ l’ N kl’ ) Similarly to the one dice case we get: Expectation step Set N kl to E (N kl (y)|x,θ), ie: N kl = ∑ y p(y|x,θ) N kl (y)

34 EM algorithm for n independent observations x 1,…, x n : Expectation step It can be shown that, if the x j are independent, then:

35 Correctness proof of EM Theorem: Let x = {y:y  Y} be a collection of events, as in the setting of the EM algorithm, and let: Q  (λ) = ∑ y p(y|x,θ)log p(y| λ) Then the following holds: if Q  (λ * )> Q  (θ), then P(x| λ * )  P(x| θ).

36 By the definition of conditional probability, for each y we have, p(x| ) p(y|x, ) = p(y,x| ), and hence: log p(x| ) = log p( y,x| ) – log p( y |x, ) Hence log p(x| λ) = ∑ y p(y|x,θ) [log p(y|λ) – log p(y|x,λ)] log p(x|λ) Proof (cont.) =1 (Next..)

37 Proof (end) log p(x|λ) = ∑ y p(y|x, θ) log p(y|λ) - ∑ y p(y|x,θ) log [p(y|x,λ)] Qθ(λ)Qθ(λ) Substituting λ=λ* and λ=θ, and then subtracting, we get log p(x|λ*) - log p(x|θ) = Q(λ*) – Q(θ) + D(p(y|x,θ) || p(y|x,λ*)) ≥ Q(λ*) – Q(θ) ≥ 0. QED Relative entropy 0 ≤ Hint to relative entropy…

38 EM in Practice Initial parameters: u Random parameters setting u “Best” guess from other source Stopping criteria: u Small change in likelihood of data u Small change in parameter values Avoiding bad local maxima: u Multiple restarts u Early “pruning” of unpromising ones