EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and.

EM algorithm and applications

Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and q is defined by H(p||q) = ∑ x p(x)log[p(x)/q(x)] = ∑ x p(x)log(1/(q(x)) - -∑ x p(x)log(1/(p(x)). “The inefficiency of assuming distribution q when the correct distribution is p”.

log P(x| λ ) General idea of EM: Use “current point” θ to construct alternative function Q θ (which is “nice”) if Q θ (λ)>Q θ (θ), than λ has higher likelihood than θ. EM algorithm: approximating MLE from Incomplete Data Finding MLE parameters: nonlinear optimization problem λ E  [log P(x,y| λ )] θλ

The EM algorithm Consider a model where, for observed data x and model parameters θ: p(x|θ)=∑ y p(x,y|θ). (y are “hidden data”). The EM algorithm receives x and parameters θ, and returns new parameters *, s.t. p(x| *) > p(x|θ).

Finding * which maximizes p(x| *)=∑ y p(x,y| *). is equivalent to finding * which maximizes the logarithm log p(x| *) = log (∑ y p(x,y| *)) Which is what the EM algorithm attempts to do. The EM algorithm

In each iteration the EM algorithm does the following. (E step): Calculate Q θ ( ) = ∑ y p(y|x,θ)log p(x,y| ) (M step): Find * which maximizes Q θ ( ) (Next iteration sets   * and repeats). The EM algorithm

The Baum-Welsh algorithm is the EM algorithm for HMM : E step for HMM: Q θ ( ) = ∑ S p(s|x,θ)log p(s,x| ), where λ are the new parameters {a kl,e k (b)}. Example: Baum-Welsh = EM for HMM (The are the counts of state transitions and symbol emissions in (s,x)).

Baum Welsh = EM for HMM M step For HMM: Find * which maximizes Q θ ( ). As we proved, λ * is given by the relative frequencies of the A kl ’s and the E k (b)’s

A simplest example: EM for 2 coin tosses Given a coin with two possible outcomes: H (head) and T (tail), with probabilities q H, q T = 1- q H. The coin is tossed twice, but only the 1 st outcome, T, is seen. So the data is x = (T,*). We wish to apply the EM algorithm to get parameters that increase the likelihood of the data. Let the initial parameters be θ = (q H, q T ) = ( ¼, ¾ ).

EM for 2 coin tosses The hidden data which can produce x are the sequences y 1 = (T,H); y 2 =(T,T); Hence the likelihood of x with parameters (q H, q T ), is p(x| θ) = P(x,y 1 |  ) + P(x,y 2 |  ) = q H q T +q T 2 For the initial parameters θ = ( ¼, ¾ ), we have: p(x| θ) = ¾ * ¼ + ¾ * ¾ = ¾ Note that in this case P(x,y i |  ) = P(y i |  ), for i = 1,2. we can always define y so that (x,y) = y (otherwise we set y’  (x,y) and replace the “ y ”s by “ y’ ”s).

EM for 2 coin tosses - E step Calculate Q θ ( ) = Q θ (q H,q T ). Q θ ( ) = p(y 1 |x,θ)log p(x,y 1 | ) + p(y 2 |x,θ)log p(x,y 2 | ) p(y 1 |x,θ) = p(y 1,x|θ)/p(x|θ) = (¾∙ ¼)/ (¾) = ¼ p(y 2 |x,θ) = p(y 2,x|θ)/p(x|θ) = (¾∙ ¾)/ (¾) = ¾ Thus we have Q θ ( ) = ¼ log p(x,y 1 | ) + ¾ log p(x,y 2 | )

EM for 2 coin tosses - E step For a sequence y of coin tosses, let N H (y) be the number of H’s in y, and N T (y) be the number of T’s in y. Then log p(y| ) = N H (y) log q H + N T (y) log q T In our example: y 1 = (T,H); y 2 =(T,T), hence: N H (y 1 ) = N T (y 1 )=1, N H (y 2 ) =0, N T (y 2 )=2

Example: 2 coin tosses - E step Thus ¼ log p(x,y 1 | ) = ¼ (N H (y 1 ) log q H + N T (y 1 ) log q T ) = ¼ (log q H + log q T ) ¾ log p(x,y 2 | ) = ¾ ( N H (y 2 ) log q H + N T (y 2 ) log q T ) = ¾ (2 log q T ) Substituting in the equation for Q θ ( ) : Q θ ( ) = ¼ log p(x,y 1 | )+ ¾ log p(x,y 2 | ) = ( ¼ N H (y 1 )+ ¾ N H (y 2 ))log q H + ( ¼ N T (y 1 )+ ¾ N T (y 2 ))log q T Q θ ( ) = N H log q H + N T log q T N T = 7 / 4 N H = ¼

EM for 2 coin tosses - M step Find * which maximizes Q θ ( ) Q θ ( ) = N H log q H + N T log q T = ¼ log q H + 7 / 4 log q T We saw earlier that this is maximized when: The optimal parameters (0,1), will never be reached by the EM algorithm!

Let N l be the expected value of N l (y), given x and θ: N l =E(N l |x,θ) = ∑ y p(y|x,θ) N l (y), EM for single random variable (dice) Now, the probability of each y (≡(x,y)) is given by a sequence of dice tosses. The dice has m outcomes, with probabilities q 1,..,q m. Let N l (y) = #(outcome l occurs in y). Then

Q  (λ) for one dice

EM algorithm for n independent observations x 1,…, x n : Expectation step It can be shown that, if the x j are independent, then:

Example: The ABO locus A locus is a particular place on the chromosome. Each locus’ state (called genotype) consists of two alleles – one parental and one maternal. Some loci (plural of locus) determine distinguished features. The ABO locus, for example, determines blood type. Suppose we randomly sampled N individuals and found that N a/a have genotype a/a, N a/b have genotype a/b, etc. Then, the MLE is given by: The ABO locus has six possible genotypes {a/a, a/o, b/o, b/b, a/b, o/o}. The first two genotypes determine blood type A, the next two determine blood type B, then blood type AB, and finally blood type O. We wish to estimate the proportion in a population of the 6 genotypes.

The ABO locus However, testing individuals for their genotype is a very expensive. Can we estimate the proportions of genotype using the common cheap blood test with outcome being one of the four blood types (A, B, AB, O) ? The problem is that among individuals measured to have blood type A, we don’t know how many have genotype a/a and how many have genotype a/o. So what can we do ?

The ABO locus The Hardy-Weinberg equilibrium rule states that in equilibrium the frequencies of the three alleles q a,q b,q o in the population determine the frequencies of the genotypes as follows: q a/b = 2q a q b, q a/o = 2q a q o, q b/o = 2q b q o, q a/a = [q a ] 2, q b/b = [q b ] 2, q o/o = [q o ] 2. In fact, Hardy-Weinberg equilibrium rule follows from modeling this problem as data x with hidden parameters y.

The ABO locus The dice’ outcome are the three possible alleles a, b and o. The observed data are the blood types A, B, AB or O. Each blood type is determined by two successive random sampling of alleles, which is an “ordered genotypes pair” – this is the hidden data. For instance blood type A corresponds to the ordered genotypes pairs (a,a), (a,o) and (o,a). So we have three parameters of one dice – q a,q b,q o - that we need to estimate.

EM setting for the ABO locus problem The observed data x =(x 1,..,x n ) is a sequence of letters (blood types) from the alphabet {A,B,AB,O}. eg: (B,A,B,B,O,A,B,A,O,B, AB) are observations (x 1,…x 11 ). The hidden data (ie the y’s) for each letter x j is the set of ordered pairs of alleles that generates it. For instance, for A it is the set {aa, ao, oa}. The parameters  = {q a,q b, q o } are the probabilities of the alleles. We need is to find the parameters  = {q a,q b, q o } that maximize the likelihood of the given data. We do this by the EM algorithm:

EM for ABO locus problem  For each observed blood type x j  {A,B,AB,O} and for each allele z in {a,b,o} we compute N z (x j ), the expected number of times that z appear in x j. Where the sum is taken over the ordered “genotype pairs” y j, and N z (y j ) is the number of times allele z occurs in the pair y j. eg, N a (o,b)=0; N b (o,b) = N o (o,b) = 1.

EM for ABO locus problem The computation for blood type B: P(B|  ) = P((b,b)|  ) + p((b,o)|  ) +p((o,b)|  )) = q b 2 + 2q b q o. Since N b ((b,b))=2, and N b ((b,o))=N b ((o,b)) =N o ((o,b))=N o ((b,o))=1, N o (B) and N b (B), the expected number of occurrences of o and b in B, are given by: Observe that N b (B) + N o (B) = 2

EM for ABO loci Similarly, P(A|  ) = q a 2 + 2q a q o. P(AB|  ) = p((b,a)|  ) + p((a,b)|  )) = 2q a q b ; P(O|  ) = p((o,o)|  ) = q o 2 N a (AB) = N b (AB) = 1 N o (O) = 2 [ N b (O) = N a (O) = N o (AB) = N b (A) = N a (B) = 0 ]

E step: compute N a, N b and N o Let #(A)=3, #(B)=5, #(AB)=1, #(O)=2 be the number of observations of A, B, AB, and O respectively. M step: set λ*=( q a *, q b *, q o *)

Example: the Motif Finding Problem Given a set of DNA sequences: cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc Find the motif in each of the individual sequences

Motif Finding Problem: a reformulation Collect all substrings with the same length k from the input sequences; With N sequences with same length L, n=N  (L-k+1) substrings can be derived; Find a significant number of substring that can be described by a profile model.

Fitting a mixture model by EM A finite mixture model: –data X = (X 1,…,X n ) arises from two or more groups with g models  = (  1, …,  g ). Indicator vectors Z = (Z 1,…,Z n ), where Z i = (Z i1,…,Z ig ), and Z ij = 1 if X i is from group j, and = 0 otherwise. P(Z ij = 1|  ) = j. For any given i, all Z ij are 0 except one j; g=2: class 1 (the motif) and class 2 (the background) are given by position specific and a general multinomial distribution

Complete data likelihood Under the assumption that the pairs (Z i,X i ) are mutually independent, their joint density may be written P(Z, X|  ) = ∏ ij [ j P(X i |  j ) ] Zij The log likelihood of the model is thus log L( , | Z, X) = ∑∑ Z ij log [ j P(X i |  j ) ]. The EM algorithm iteratively computes the expectation of the likelihood given the observed data X, and initial estimates  ’ and ’ of  and (the E-step), and then maximizes the result in the free variables  and leading to new estimates  ’’ and ’’ (the M-step).

Mixture models: the E-step Since the log likelihood is the sum of over i and j of terms multiplying Z ij, and these are independent across i, we need only consider the expectation of one such, given X i. Using initial parameter values  ’ and ’, and the fact that the Z ij are binary, we get E(Z ij |X,  ’, ’)= ’ j P(X i |  ’ j )/ ∑ k ’ k P(X i |  ’ k )= Z’ ij

Mixture models: the M-step Now we want to maximize the result of an E- step: ∑∑ Z’ ij j + ∑∑ Z’ ij log P(X i |  j ). The maximization over is independent of the rest and is readily achieved with j ’’ = ∑ i Z’ ij / n.

Mixture models: the M-step Note, P(X i |  1 ) = ∏ j ∏ k f jk I(k,Xij), and P(X i |  2 ) = ∏ j ∏ k f 0k I(k,Xij) where X ij is the letter in the jth position of sample i, and I(k,a) = 1 if a=a k, and =0 otherwise. c 0k = ∑∑ Z’ i2 I( k,X ij ), and c jk = ∑∑ Z’ i1 I(k,X ij ). Here c 0k is the expected number of times letter a k appears in the background, and c jk the expected number of times a k appears in occurrences of the motif in the data. f’’ jk = c jk / ∑ k=1 L c jk, j = 0,1,…,w; k = 1,…,L. In practice, care must be taken to avoid zero frequencies, so one adds pseudo-counts: small constants  j,∑  j = , giving f’’ jk = (c jk +  j )/ (∑ k=1 L c jk +  ), j = 0,1,…,w; k = 1,…,L.

EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and.

Similar presentations

Presentation on theme: "EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and.

Similar presentations

Presentation on theme: "EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and."— Presentation transcript:

Similar presentations

About project

Feedback