Download presentation
Presentation is loading. Please wait.
Published byStephen McBride Modified over 9 years ago
1
. EM algorithm and applications Lecture #9 Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
2
2 Reminder: Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and q is defined by H(p||q) = D(p||q) = ∑ x p(x)log[p(x)/q(x)] = ∑ x p(x)log(1/(q(x)) - -∑ x p(x)log(1/(p(x)). “The inefficiency of assuming distribution q when the correct distribution is p”. H(p)H(p)
3
3 Non negativity of relative entropy Claim (proved last week) D(p||q)= ∑ x p(x)log[p(x)/q(x)] = ∑ x p(x)log(1/(q(x)) -∑ x p(x)log(1/(p(x)) ≥ 0. Equality if and only if q=p. This claim is used in the correctness proof of the EM algorithm, which we present next.
4
4 log P(x| λ ) Expectation Maximization (EM): Use “current point” to construct alternative function (which is “nice”) Guarantee: maximum of new function has a higher likelihood than the current point EM algorithm: approximating MLE from Incomplete Data u Finding MLE parameters: nonlinear optimization problem λ E [log P(x,y| λ )]
5
5 The EM algorithm Consider a model where, for observed data x and model parameters θ, p(x|θ) is defined by: p(x|θ)=∑ y p(x,y|θ). Where y are “hidden data”. The EM algorithm receives x and parameters θ, and return new parameters * s.t. p(x| *) > p(x|θ). Note: In Durbin et. al. book, the initial parameters are denoted by θ 0, and the new parameters by θ.
6
6 Maximizing p(x| *)=∑ y p(x,y| *). is equivalent to maximizing the logarithm log p(x| *) = log (∑ y p(x,y| *)) Which is what the EM algorithm does. In the following we: 1. Present the EM algorithm. 2. Give few examples of implementations 3. Prove its correctness. The EM algorithm
7
7 In each iteration the EM algorithm does the following. u (E step): Calculate Q θ ( ) = ∑ y p(y|x,θ)log p(x,y| ) u (M step): Find * which maximizes Q θ ( ) (Next iteration sets * and repeats). The EM algorithm Comments: 1. When θ is clear, we shall use Q( ) instead of Q θ ( ) 2. At the M-step we only need that Q θ ( *)>Q θ (θ). This change yields the so called Generalized EM algorithm. It is important when it is hard to find the optimal *.
8
8 Example: EM for 2 coin tosses Consider the following experiment: Given a coin with two possible outcomes: H (head) and T (tail), with probabilities q H, q T = 1- q H. The coin is tossed twice, but only the 1 st outcome, T, is seen. So the data is x = (T,*). We wish to apply the EM algorithm to get parameters that increase the likelihood of the data. Let the initial parameters be θ = (q H, q T ) = ( ¼, ¾ ).
9
9 Example: 2 coin tosses (cont) The hidden data which can produce x are the sequences y 1 = (T,H); y 2 =(T,T); (note that with this definition (x,y i )=y i ). The likelihood of x with parameters (q H, q T ), is q H q T +q T 2 For the initial parameters θ = ( ¼, ¾ ), we have: p(x| θ) = P(x,y 1 | ) + P(x,y 2 | ) = ¾ * ¼ + ¾ * ¾ = ¾ Note that in this case P(x,y i | ) = P(y i | ), for i = 1,2. we can always define y so that (x,y) = y (otherwise we set y’ (x,y) and replace the “ y ”s by “ y’ ”s).
10
10 Example: 2 coin tosses - E step Calculate Q θ ( ) = Q θ (q H,q T ). Note: q H,q T are variables Q θ ( ) = p(y 1 |x,θ)log p(x,y 1 | )+p(y 2 |x,θ)log p(x,y 2 | ) p(y 1 |x,θ) = p(y 1,x|θ)/p(x|θ) = (¾∙ ¼) / (¾) = ¼ p(y 2 |x,θ) = p(y 2,x|θ)/p(x|θ) = (¾∙ ¾)/ (¾) = ¾ Thus we have Q θ ( ) = ¼ log p(x,y 1 | ) + ¾ log p(x,y 2 | )
11
11 Example: 2 coin tosses - E step For a sequence y of coin tosses, let N H (y) be the number of H’s in y, and N T (y) be the number of T’s in y. Then log p(y| ) = N H (y) log q H + N T (y) log q T [ In our example: log p(y 1 | ) = log q H + log q T log p(y 2 | ) = 2log q T ]
12
12 Example: 2 coin tosses - E step Thus ¼ log p(x,y 1 | ) = ¼ (N H (y 1 ) log q H + N T (y 1 ) log q T ) = ¼ (log q H + log q T ) ¾ log p(x,y 2 | ) = ¾ ( N H (y 2 ) log q H + N T (y 2 ) log q T ) = ¾ (2 log q T ) Substituting in the equation for Q θ ( ) : Q θ ( ) = ¼ log p(x,y 1 | )+ ¾ log p(x,y 2 | ) = ( ¼ N H (y 1 )+ ¾ N H (y 2 ))log q H + ( ¼ N T (y 1 )+ ¾ N T (y 2 ))log q T Q θ ( ) = N H log q H + N T log q T N T = 7 / 4 N H = ¼
13
13 Example: 2 coin tosses - M step Find * which maximizes Q θ ( ) Q θ ( ) = N H log q H + N T log q T = ¼ log q H + 7 / 4 log q T We saw earlier that this is maximized when: [The optimal parameters (0,1), will never be reached by the EM algorithm!]
14
14 Let N l be the expected value of N l (y), given x and θ: N l =E(N l |x,θ) = ∑ y p(y|x,θ) N l (y), EM for single random variable (dice) Now, the probability of each y (≡(x,y)) is given by a sequence of dice tosses. The dice has m outcomes, with probabilities q 1,..,q m. Let N l (y) = #(outcome l occurs in y). Then Then we have:
15
15 Q (λ) for one dice NlNl
16
16 EM algorithm for n independent observations x 1,…, x n : Expectation step It can be shown that, if the x j are independent, then:
17
17 Example: The ABO locus A locus is a particular place on the chromosome. Each locus’ state (called genotype) consists of two alleles – one parental and one maternal. Some loci (plural of locus) determine distinguished features. The ABO locus, for example, determines blood type. Suppose we randomly sampled N individuals and found that N a/a have genotype a/a, N a/b have genotype a/b, etc. Then, the MLE is given by: The ABO locus has six possible genotypes {a/a, a/o, b/o, b/b, a/b, o/o}. The first two genotypes determine blood type A, the next two determine blood type B, then blood type AB, and finally blood type O. We wish to estimate the proportion in a population of the 6 genotypes.
18
18 The ABO locus (Cont.) However, testing individuals for their genotype is a very expensive test. Can we estimate the proportions of genotype using the common cheap blood test with outcome being one of the four blood types (A, B, AB, O) ? The problem is that among individuals measured to have blood type A, we don’t know how many have genotype a/a and how many have genotype a/o. So what can we do ?
19
19 The ABO locus (Cont.) The Hardy-Weinberg equilibrium rule states that in equilibrium the frequencies of the three alleles q a,q b,q o in the population determine the frequencies of the genotypes as follows: q a/b = 2q a q b, q a/o = 2q a q o, q b/o = 2q b q o, q a/a = [q a ] 2, q b/b = [q b ] 2, q o/o = [q o ] 2. So now we have three parameters of one dice that we need to estimate. Hardy-Weinberg equilibrium rule follows from modeling this problem as data x with hidden parameters y: We have three possible alleles a, b and o. The observed data are the blood types A, B, AB or O, determined by two successive random sampling of alleles, which define “ordered genotypes pair” - the hidden data. For instance blood type A corresponds to the ordered genotypes pairs (a,a), (a,o) and (o,a).
20
20 The Likelihood Function The probabilities of the six genotypes x a/a, x a/o,x b/b, x b/o, x a/b, x o/o are defined by the parameters = {q a,q b, q o }. eg, P(X= x a/b | ) = P({(a,b), (b,a)} | )= 2q a q b. Similarly P(X= x o/o | ) = q o q o = q o 2. And so on for the other four genotypes. So all we need is to find the parameters = {q a,q b, q o }.
21
21 The Likelihood Function We wish to compute the parameters by sampling a data and then use MLE. This is naturally dealt by EM, because the sampled data – the blood types - have hidden data (the ordered genotypes pairs) Assume the sampled data is {B,A,B,B,O,A,B,A,O,B, AB} What is its probability, for given parameters ? Obtaining the maximum of this function yields the MLE. We use the EM algorithm to replace by * which increases the likelihood.
22
22 ABO loci as a special case of HMM Model the ABO sampling as an HMM with 6 states (genotypes): a/a, a/b, a/o, b/b, b/o, o/o, and 4 outputs (blood types): A,B,AB,O. Assume 3 transitions types: a, b and o, and a state is determined by 2 successive transitions. The probability of transition x is x. Emission is done every other state, and is determined by the state. Eg, e a/o (A)=1, since a/o produces blood type A. ao a/o a/b A AB a/b AB b baa
23
23 A faster and simpler EM for ABO loci Can be solved via the Baum-Welch EM training. This is quite inefficient: for L sampling it requires running the forward and backward algorithm on HMM of length 2L, even that there are only 6 distinct genotypes. Direct application of the EM algorithm yields a simpler and more efficient way: Consider the input data {B,A,B,B,O,A,B,A,O,B, AB} as observations x 1,…x 11. The hidden data of an observation are the ordered genotypes pairs which produce it. Eg, for O it is (o,o), and for B it is (o,b), (b,o) and (b,b).
24
24 A faster EM for ABO loci For each ordered genotype pair y we have N a (y), N b (y) and N o (y). eg, N a (o,b)=0; N b (o,b) = N o (o,b) = 1. For each observation of blood type x j and for each allel z in {a,b,o} we compute N z j, the expected number of times that z appear in x j. So we apply the EM algorithm with initial parameters = (q a,q b,q o ), to get better parameters.
25
25 A faster EM for ABO loci The computation for blood type B: P(B| ) = P((b,b)| ) + p((b,o)| ) +p((o,b)| )) = q b 2 + 2q b q o. Since N b ((b,b))=2, and N b ((b,o))=N b ((o,b)) =N o ((o,b))=N o ((b,o))=1, N o B and N b B, the expected number of occurrences of o and b in B, are given by: Observe that N b B + N o B = 2
26
26 A faster EM for ABO loci Similarly, P(A| ) = q a 2 + 2q a q o. P(AB| ) = p((b,a)| ) + p((a,b)| )) = 2q a q b ; P(O| ) = p((o,o)| ) = q o 2 N a AB = N b AB = 1 N o O = 2 [ N b O = N a O = N o AB = N b A = N a B = 0 ]
27
27 E step: compute N a, N b and N o Let #(A)=3, #(B)=5, #(AB)=1, #(O)=2 be the number of observations of A, B, AB, and O respectively. M step: set λ*=( q a *, q b *, q o *)
28
28 EM for stochastic processes with many dices But this time (x,y) is generated by a general stochastic process, which employs r discrete random variables (dices) Z 1,..., Z r. A finite process of this type can be viewed as a probabilistic acyclic state machine, where at each state one of the random variable Z i is sampled, and then the next state is determined – until a final state is reached. Now we wish to maximize likelihood of observation x with hidden data as before, ie maximize p(x| )=∑ y p(x,y| ).
29
29 EM for processes with many dices In HMM, the random variables are the transmissions probabilities a kl and the emission probabilities e k (b). x stands for the visible information y stands for the sequence s of states (x,y) stands for the complete HMM s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi As before, we can redefine y so that (x,y) = y.
30
30 EM for processes with many dices Each random variable Z k (k =1,...,r) has m k values z k,1,...z k,m k with probabilities {q kl,|l=1,...,m k }. Each y defines a sequence of outcomes (z k 1, l 1,...,z k n, l n ) of the random variables used in y. In the HMM, these are the specific transitions and emissions, defined by the states and outputs of the sequence y j. Let N kl (y) = #(z kl appears in y).
31
31 Define N kl as the expected value of N kl (y), given x and θ: N kl =E(N kl |x,θ) = ∑ y p(y|x,θ) N kl (y), Then we have: EM for processes with many dices Similarly to the dice case, we have:
32
32 Q (λ) for processes with many dices N kl
33
33 EM algorithm for processes with many dices Maximization step Set q kl =N kl / (∑ l’ N kl’ ) Similarly to the one dice case we get: Expectation step Set N kl to E (N kl (y)|x,θ), ie: N kl = ∑ y p(y|x,θ) N kl (y)
34
34 EM algorithm for n independent observations x 1,…, x n : Expectation step It can be shown that, if the x j are independent, then:
35
35 Application to HMM For HMM, the random variables z kl are the state transitions and symbol emissions from state k, and q kl are the corresponding probabilities a kl and e k (b).
36
36 EM algorithm for HMM: (the Baum-Welch training): Expectation step (single observation x): A kl, the expected number of (k,l) transitions: A kl = ∑ s p(s|x,θ) N kl (x,s) Is computed by: E k (b), the expected number of emissions of b from state k: E k (b) = ∑ s p(s|x,θ) E k (b;x,s), computed by:
37
37 EM algorithm for HMM: (the Baum-Welch training): Expectation step (n observations x 1,...,x n ): A kl, the expected number of (k,l) transitions: A kl = ∑ j ∑ s p(s|x j,θ) N kl (x j,s) Is computed by: E kb = ∑ s p(s|x j,θ) E kb (x j,s), is computed by:
38
38 EM algorithm for HMM: (the Baum-Welch training): Maximization step: The new parameters are given by:
39
39 Correctness proof of EM Theorem: If λ* maximizes Q (λ) = ∑ y p(y|x,θ)log p(y| λ), then P(x| λ*) P(x| θ). Comment: The proof remains valid if we assume only that Q (λ*) Q (θ).
40
40 By the definition of conditional probability, for each y we have, p(x| ) p(y|x, ) = p(y,x| ), and hence: log p(x| ) = log p( y,x| ) – log p( y |x, ) Hence log p(x| λ) = ∑ y p(y|x,θ) [log p(y|λ) – log p(y|x,λ)] log p(x|λ) Proof (cont.) =1 (Next..)
41
41 Proof (end) log p(x|λ) = ∑ y p(y|x, θ) log p(y|λ) - ∑ y p(y|x,θ) log [p(y|x,λ)] Qθ(λ)Qθ(λ) Substituting λ=λ* and λ=θ, and then subtracting, we get log p(x|λ*) - log p(x|θ) = Q(λ*) – Q(θ) + D(p(y|x,θ) || p(y|x,λ*)) ≥ Q(λ*) – Q(θ) ≥ 0 [since λ* maximizes Q(λ)]. QED Relative entropy 0 ≤
42
42 EM in Practice Initial parameters: u Random parameters setting u “Best” guess from other source Stopping criteria: u Small change in likelihood of data u Small change in parameter values Avoiding bad local maxima: u Multiple restarts u Early “pruning” of unpromising ones
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.