. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.

Slides:



Advertisements
Similar presentations
. Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters.
Advertisements

. Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution.
. Inference and Parameter Estimation in HMM Lecture 11 Computational Genomics © Shlomo Moran, Ydo Wexler, Dan Geiger (Technion) modified by Benny Chor.
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Learning HMM parameters
The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.
EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and.
Hidden Markov Models Eine Einführung.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Statistical NLP: Lecture 11
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
Hidden Markov Models Fundamentals and applications to bioinformatics.
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
Hidden Markov Models Usman Roshan BNFO 601.
. Hidden Markov Model Lecture #6. 2 Reminder: Finite State Markov Chain An integer time stochastic process, consisting of a domain D of m states {1,…,m}
Markov Chains Lecture #5
HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the.
. EM algorithm and applications Lecture #9 Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Learning Hidden Markov Models Tutorial #7 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
Lecture 6, Thursday April 17, 2003
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
. EM algorithm and applications Lecture #9 Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Parameter Estimation For HMM Background Readings: Chapter 3.3 in the book, Biological Sequence Analysis, Durbin et al., 2001.
. Hidden Markov Models Lecture #5 Prepared by Dan Geiger. Background Readings: Chapter 3 in the text book (Durbin et al.).
. Sequence Alignment via HMM Background Readings: chapters 3.4, 3.5, 4, in the Durbin et al.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
S. Maarschalkerweerd & A. Tjhang1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Hidden Markov Models For Genetic Linkage Analysis Lecture #4 Prepared by Dan Geiger.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Hidden Markov Models 1 2 K … x1 x2 x3 xK.
The EM algorithm LING 572 Fei Xia 03/01/07. What is EM? EM stands for “expectation maximization”. A parameter estimation method: it falls into the general.
Elze de Groot1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
CS262 Lecture 5, Win07, Batzoglou Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Model Continues …. Finite State Markov Chain A discrete time stochastic process, consisting of a domain D of m states {1,…,m} and 1.An m.
. Markov Chains Lecture #5 Background Readings: Durbin et. al. Section 3.1 Prepared by Shlomo Moran, based on Danny Geiger’s and Nir Friedman’s.
1 Markov Chains. 2 Hidden Markov Models 3 Review Markov Chain can solve the CpG island finding problem Positive model, negative model Length? Solution:
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al.,  Shlomo.
Lecture 3. Relation with Information Theory and Symmetry of Information Shannon entropy of random variable X over sample space S: H(X) = ∑ P(X=x) log 1/P(X=x)‏,
STATISTIC & INFORMATION THEORY (CSNB134)
. EM with Many Random Variables Another Example of EM Sequence Alignment via HMM Lecture # 10 This class has been edited from Nir Friedman’s lecture. Changes.
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG.
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
. EM and variants of HMM Lecture #9 Background Readings: Chapters 11.2, 11.6, 3.4 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10 This class has been edited from Nir Friedman’s lecture. Changes made.
Hidden Markov Models CBB 231 / COMPSCI 261 part 2.
. EM algorithm and applications Lecture #9 Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
S. Salzberg CMSC 828N 1 Three classic HMM problems 2.Decoding: given a model and an output sequence, what is the most likely state sequence through the.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
. EM in Hidden Markov Models Tutorial 7 © Ydo Wexler & Dan Geiger, revised by Sivan Yogev.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.
Hidden Markov Models BMI/CS 576
Hidden Markov Models - Training
Hidden Markov Models Part 2: Algorithms
Three classic HMM problems
CONTEXT DEPENDENT CLASSIFICATION
CPSC 503 Computational Linguistics
Hidden Markov Model Lecture #6
Presentation transcript:

. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.

2 Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities: p(X i = b| S i = s) = e s (b) S1S1 S2S2 S L-1 SLSL x1x1 x2x2 X L-1 xLxL M M M M TTTT

3 Reminder: Finding ML parameters for HMM when paths are known Let A kl = #(transitions from k to l) in the training set. E k (b) = #(emissions of symbol b from state k) in the training set. We look for parameters  ={a kl, e k (b)} that: s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

4 Optimal ML parameters when all paths are known The optimal ML parameters θ are defined by: s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

5 Case 2: Finding ML parameters when state paths are unknown In this case only the values of the x i ’s of the input sequences are known. This is a ML problem with “missing data”. We wish to find θ * so that p(x 1,..., x n |θ * )=MAX θ {p(x 1,..., x n |θ)} s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

6 Case 2: State paths are unknown For a given θ we have: p(x 1,..., x n |θ)= p(x 1 | θ)    p (x n |θ) (since the x j are independent) s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi For each sequence x, p(x|θ)=∑ s p(x,s|θ), The sum taken over all state paths s which emit x.

7 Case 2: State paths are unknown For the n sequences (x 1,..., x n ) p(x 1,..., x n |θ)= ∑ p(x 1,..., x n, s 1,..., s n |θ), Where the summation is taken over all tuples of n state paths (s 1,..., s n ) which generate (x 1,..., x n ). We will assume that n=1. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi (s 1,..., s n )

8 Case 2: State paths are unknown So we need to maximize p(x|θ)=∑ s p(x,s|θ), where the summation is over all the sequences S which produce the output sequence x. Finding θ * which maximizes ∑ s p(x,s|θ) is hard. [Unlike finding θ * which maximizes p(x,s|θ) for a single sequence (x,s).] s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

9 ML Parameter Estimation for HMM The general process for finding θ in this case is 1.Start with an initial value of θ. 2.Find θ’ so that p(x|θ’) > p(x|θ) 3.set θ = θ’. 4.Repeat until some convergence criterion is met. A general algorithm of this type is the Expectation Maximization algorithm, which we will meet later. For the specific case of HMM, it is the Baum- Welch training.

10 Baum Welch training We start with some values of a kl and e k (b), which define prior values of θ. Then we use an iterative algorithm which attempts to replace θ by a θ * s.t. p( x |θ * ) > p( x |θ) This is done by “imitating” the algorithm for Case 1, where all states are known: s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

11 Baum Welch training In case 1 we computed the optimal values of a kl and e k (b), (for the optimal θ) by simply counting the number A kl of transitions from state k to state l, and the number E k (b) of emissions of symbol b from state k, in the training set. This was possible since we knew all the states. S i = lS i-1 = k x i-1 = b … … x i = c

12 Baum Welch training When the states are unknown, the counting process is replaced by averaging process: For each edge s i-1  s i we compute the average number of “k to l” transitions, for all possible pairs (k,l), over this edge. Then, for each k and l, we take A kl to be the sum over all edges. S i = ?S i-1 = ? x i-1 = b x i = c ……

13 Baum Welch training Similarly, For each edge s i  b and each state k, we compute the average number of times that s i =k, which is the expected number of “k → b” transmission on this edge. Then we take E k (b) to be the sum over all such edges. These expected values are computed as follows: S i = ?S i-1 = ? x i-1 = b x i = c

14 Baum Welch: step 1a Count expected number of state transitions For each i and for each k,l, compute the posterior state transitions probabilities: s1s1 SiSi sLsL X1X1 XiXi XLXL S i-1 X i-1.. P(s i-1 =k, s i =l | x,θ) For this, we use the forwards and backwards algorithms

15 Reminder: finding posterior state probabilities p(s i =k,x) = f k (s i ) b k (s i ) (since these are independent events) {f k (i) b k (i)} for every i, k are computed by one run of the backward/forward algorithms. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi f k (i) = p(x 1,…,x i,s i =k ), the probability that in a path which emits (x 1,..,x i ), state s i =k. b k (i)= p(x i+1,…,x L |s i =k), the probability that a path which emits (x i+1,..,x L ), given that state s i =k.

16 Baum Welch: Step 1a (cont) Claim: s1s1 SiSi sLsL X1X1 XiXi XLXL S i-1 X i-1.. (a kl and e l (x i ) are the parameters defined by , and f k (i-1), b k (i) are the forward and backward functions)

17 Step 1a: Computing P(s i-1 =k, s i =l | x,θ) P(x 1,…,x L,s i-1 =k,s i =l|  ) = P(x 1,…,x i-1,s i-1 =k|  ) a kl e l (x i ) P(x i+1,…,x L |s i =l,  ) = f k (i-1) a kl e l (x i ) b l (i) Via the forward algorithm Via the backward algorithm s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL S i-1 X i-1 sisi XiXi x p(s i-1 =k,s i =l | x,  ) = f k (i-1) a kl e l (x i ) b l (i)

18 Step 1a (end) For each pair (k,l), compute the expected number of state transitions from k to l, as the sum of the expected number of k to l transitions over all L edges :

19 Step 1a for many sequences: Exercise: Prove that when we have n input sequences (x 1,..., x n ), then A kl is given by:

20 Baum-Welch: Step 1b count expected number of symbols emissions for state k and each symbol b, for each i where X i =b, compute the expected number of times that X i =b. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi X i =b

21 Baum-Welch: Step 1b For each state k and each symbol b, compute the expected number of emissions of b from k as the sum of the expected number of times that s i = k, over all i’s for which x i = b.

22 Step 1b for many sequences Exercise: when we have n sequences (x 1,..., x n ), the expected number of emissions of b from k is given by:

23 Summary of Steps 1a and 1b: the E part of the Baum Welch training These steps compute the expected numbers A kl of k,l transitions for all pairs of states k and l, and the expected numbers E k (b) of transmitions of symbol b from state k, for all states k and symbols b. The next step is the M step, which is identical to the computation of optimal ML parameters when all states are known.

24 Baum-Welch: step 2 Use the A kl ’s, E k (b)’s to compute the new values of a kl and e k (b). These values define θ *. The correctness of the EM algorithm implies that: p(x 1,..., x n |θ * )  p(x 1,..., x n |θ) i.e, θ * increases the probability of the data This procedure is iterated, until some convergence criterion is met.

25 Viterbi training: maximizing the probabilty of the most probable path States are unknown. Viterbi training attempts to maximizes the probability of a most probable path, ie the value of p(s(x 1 ),..,s(x n ), x 1,..,x n |θ) Where s(x j ) is the most probable (under θ) path for x j. We assume only one sequence (j=1). s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

26 Viterbi training (cont) Start from given values of a kl and e k (b), which define prior values of θ. Each iteration: Step 1: Use Viterbi’s algoritm to find a most probable path s(x), which maximizes p(s(x), x|θ). s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

27 Viterbi training (cont) Step 2. Use the ML method for HMM with known parameters, to find θ * which maximizes p(s(x), x|θ * ) Note: In Step 1. the maximizing argument is the path s(x), in Step 2. it is the parameters θ *. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

28 Viterbi training (cont) 3. Set θ=θ *, and repeat. Stop when paths are not changed. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi Claim 2 ( Exercise) : If s(x) is the optimal path in step 1 of two different iterations, then in both iterations θ has the same values, and hence p(s(x), x |θ) will not increase in any later iteration. Hence the algorithm can terminate in this case.

29 Viterbi training (end) s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi Exercise: generalize the algorithm for the case where there are n training sequences x 1,..,x n to find paths {s(x 1 ),..,s(x n )} and parameters θ so that {s(x 1 ),..,s(x n )} are most probable paths for x 1,..,x n under θ.

30 The EM algorithm Baum Welch training is a special case of a general algorithm for approximating ML parameters in case of “missing data”, the EM algorithm. The correctness proof of the EM algorithm uses the concept of “relative entropy”. Next we define this concept.

31 Entropy: Definition Consider a probability space X of k events x 1,..,x k. The Shannon entropy H(X) is defined by: H(X) = -∑ i p(x i )log(p(x i )) =∑ i p(x i )log(1/(p(x i )). It is a measure of the “uncertainty” of the probability space Why?

32 Entropy as expected length of random walk from root to leaf on binary tree Consider the following experiment on a full binary tree: Take a random walk from the root to a leaf, and count the number of steps in it. Let L be the expected number of steps in this experiment. Let p(x) = the probability to reach a leaf x, and l(x) = the distance from the root to x. Then: L=∑ x p(x)  l(x), The summation taken over all leaves x. In the tree here, L=3  (1/4  2) + 2  (1/8  3) = 2.25

33 Entropy as expected length… (cont.) Note that p(x)=2 -l(x), i.e. l(x)=-log(p(x)). Thus L= H(X) = -∑ i p(x i )log(p(x i )) = ∑ i p(x i )log(1/p(x i )), where X={x i } is the set of leaves in the tree. In the “binary tree experiment”, entropy is the expected length of a random walk from the root to a leaf

34 Entropy as expected length… (cont.) Assume now that each leaf correspond to a letter x, Which is transmitted over communication channel with probability p(x). Associate bits to edges so x is now represented by a binary word of length l(x). Then H(X) is the expected number of bits transmitted per letter. We will see soon that it is the minimal possible length over all possible encodings of the words to binary strings a d c e b

35 Entropy: Generalization Not all distributions over k outcomes corresponds to a random walk from root to leaves on a binary tree of k leaves. To represent any (finite) distribution, we allow outgoing edges to have any probability p in [0,1]: p q=1-p Need to define the length l = l(p) of edges, as a function of their probability p.

36 Generalization We wish l to satisfy the following: 1.l(p 1 )+ l(p 2 ) = l(p 1 p 2 ) [length of a path to a leaf is determined by the leaf’s probability]. 2.l(0.5) = 1 [as in the equiprobable case]. Claim: l(p) = log(1/p) is the only continuous length function on the edges of binary trees with probabilities, which satisfies 1 and 2.

37 Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and q is defined by D(p||q) = ∑ x p(x)log[p(x)/q(x)] = ∑ x p(x)log(1/(q(x)) - -∑ x p(x)log(1/(p(x)). “The inefficiency of assuming distribution q when the correct distribution is p”.

38 Non negativity of relative entropy Claim: D(p||q)=∑ x p(x)log[p(x)/q(x)]≥0 Equality only if q=p. Proof We may take the log to base e – ie, log x = ln x. Then, ln x ≤ x-1, with equality only if x=1. Thus -D(p||q) = ∑ x p(x)ln[q(x)/p(x)] ≤ ∑ x p(x)[q(x)/p(x) – 1] = =∑ x [q(x) - p(x)] = 0

39 Relative entropy as average Score for sequence comparisons Recall that we have defined the scoring function via Note that the average score is the relative entropy D(P||Q)=∑ a,b P(a,b)log[P(a,b)/Q(a,b)] where Q(a,b) = Q(a) Q(b).

40 The EM algorithm Consider a model where, for a data x and model parameters θ, p(x|θ) is defined by: p(x|θ)=∑ y p(x,y|θ). y are the “hidden parameters” The EM algorithm receives x and parameters θ 0, and return θ’ s.t. p(x|θ’) > p(x|θ 0 ). Or equivalently, log p(x|θ’) > log p(x|θ 0 )

41 The EM algorithm works in iterations. Each iteration has input parameter θ 0, and it outputs a new parameter θ’, which is the input to the next iteration. θ’ is defined from θ 0 as follows: u (E step): Calculate Q θ 0 (θ) = ∑ y p(y|x,θ 0 )log p(x,y|θ) When θ 0 is clear, we shall use Q(θ) instead of Q θ 0 (θ) u (M step): Set θ’ = argmax θ Q(θ). The EM algorithm Comment: At the M-step one can actually choose any  ’ as long as Q(θ’)>Q(θ 0 ). This change yields the so called Generalized EM algorithm. It is important when argmax is hard to compute.

42 Claim: p(x|θ’) ≥ p(x|θ 0 ). We shall prove the claim in several steps. We try to maximize log p(x|θ). Step 1. For each y we have log p(x|θ) = log p(y,x| θ) – log p(y|x, θ) Also, ∑ y p(y|x, θ 0 ) =1 Hence log p(x|θ) = ∑ y p(y|x, θ 0 ) [log p(y,x|θ) – log p(y|x,θ)] log p(x|θ)

43 Proof of Claim log p(x|θ) = ∑ y p(y|x,θ 0 ) log p(y,x|θ) + ∑ y p(y|x,θ 0 ) log [1/p(y|x,θ)] Q(θ)Q(θ) = Q(θ) + ∑ y p(y|x, θ 0 ) log [1/p(y|x,θ)]

44 Proof of Claim (end) log p(x|θ) - log p(x|θ 0 ) = Q(θ) – Q(θ 0 ) + D(p(y|x,θ 0 ) || p(y|x,θ)) 0 ≤ ≥ Q(θ) – Q(θ 0 ) Thus, setting θ’=argmax Q(θ) guarantees the claim.

45 Application to HMM Consider the case when θ is given by parameters q kl,, and: Then Q(θ) = ∑p(y|x,θ 0 ) log p(x,y|θ) = = ∑p(y|x,θ 0 ) ∑ kl N kl (x,y) log q kl.

46 Application to HMM (cont) We need θ that maximizes: Q(θ) = ∑ y p(y|x,θ 0 ) ∑ kl N kl (x,y)log q kl = ∑ kl ∑ y p(y|x,θ 0 ) N kl (x,y) log q kl. E(N kl (x,y)|x, θ 0 ) ≡N kl

47 Summary We need θ ={q kl } that maximizes: Q(θ) = ∑ kl N kl log q kl. Subject to: for each k, ∑ l q kl =1. We saw that Q(θ) is maximized when for each k:

48 Application to HMM (cont) For HMM, the q kl are the transition probabilities a kl and the emission probabilities e k (b).