. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
2 Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities: p(X i = b| S i = s) = e s (b) S1S1 S2S2 S L-1 SLSL x1x1 x2x2 X L-1 xLxL M M M M TTTT
3 Reminder: Finding ML parameters for HMM when paths are known Let A kl = #(transitions from k to l) in the training set. E k (b) = #(emissions of symbol b from state k) in the training set. We look for parameters ={a kl, e k (b)} that: s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi
4 Optimal ML parameters when the state path is known The optimal ML parameters θ are defined by: s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi
5 Case 2: Finding ML parameters when the state path is unknown In this case only the values of the x i ’s of the input sequences are known. This is a ML problem with “missing data”. We wish to find θ * so that p(x|θ * )=MAX θ {p(x|θ)} s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi
6 Case 2: State paths are unknown Usually we have n >1 independent sample sequences x 1,..., x n. For a given θ we have: p(x 1,..., x n |θ)= p(x 1 | θ) p (x n |θ) s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi For a single sequence x, p(x|θ)=∑ s p(x,s|θ), The sum taken over all state paths s which emit x.
7 For n independent sequences (x 1,..., x n ) Where the summation is taken over all tuples of n state paths (s 1,..., s n ) We will assume that n=1. Case 2: State paths are unknown s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi
8 Case 2: State paths are unknown We will be interested in conditional probabilities when the sequences (x 1,..., x n ) are given. The basic equation is: s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi
9 A kl and E k (b) when states are unknown A kl and E k (b) are computed according to the current distribution θ, that is: s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi A kl =∑ s A s kl p(s|x,θ), where A s kl is the number of k to l transitions in the sequence s. E k (b)=∑ s E s k (b)p(s|x,θ), where E s k (b) is the number of times k emits b in the sequence s with output x.
10 Baum Welch: step 1a Count expected number of state transitions For each i, k,l, compute the state transitions probabilities by the current θ: s1s1 SiSi sLsL X1X1 XiXi XLXL S i-1 X i-1.. P(s i-1 =k, s i =l | x,θ) For this, we use the forwards and backwards algorithms
11 Step 1a: Computing P(s i-1 =k, s i =l | x,θ) P(x 1,…,x L,s i-1 =k,s i =l| ) = P(x 1,…,x i-1,s i-1 =k| ) a kl e l (x i ) P(x i+1,…,x L |s i =l, ) = F k (i-1) a kl e l (x i ) B l (i) Via the forward algorithm Via the backward algorithm s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL S i-1 X i-1 sisi XiXi x p(s i-1 =k,s i =l | x, ) = F k (i-1) a kl e l (x i ) B l (i)
12 Step 1a (end) For each pair (k,l), compute the expected number of state transitions from k to l, as the sum of the expected number of k to l transitions over all L edges :
13 Step 1a for many sequences: Exercise: Prove that when we have n independent input sequences (x 1,..., x n ), then A kl is given by:
14 Baum-Welch: Step 1b count expected number of symbols emissions for state k and each symbol b, for each i where X i =b, compute the expected number of times that X i =b. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi X i =b
15 Baum-Welch: Step 1b For each state k and each symbol b, compute the expected number of emissions of b from k as the sum of the expected number of times that s i = k, over all i’s for which x i = b.
16 Step 1b for many sequences Exercise: when we have n sequences (x 1,..., x n ), the expected number of emissions of b from k is given by:
17 Summary of Steps 1a and 1b: the E part of the Baum Welch training These steps compute the expected numbers A kl of k,l transitions for all pairs of states k and l, and the expected numbers E k (b) of transmitions of symbol b from state k, for all states k and symbols b. The next step is the M step, which is identical to the computation of optimal ML parameters when all states are known.
18 Baum-Welch: step 2 Use the A kl ’s, E k (b)’s to compute the new values of a kl and e k (b). These values define θ *. The correctness of the EM algorithm implies that: p(x 1,..., x n |θ * ) p(x 1,..., x n |θ) i.e, θ * increases the probability of the data This procedure is iterated, until some convergence criterion is met.
19 Viterbi training: maximizing the probabilty of the most probable path States are unknown. Viterbi training attempts to maximize the probability of a most probable path, ie the value of p(s(x 1 ),..,s(x n ), x 1,..,x n |θ) Where s(x j ) is the most probable (under θ) path for x j. We assume only one sequence (j=1). s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi
20 Viterbi training (cont) Start from given values of a kl and e k (b), which define prior values of θ. Each iteration: Step 1: Use Viterbi’s algorithm to find a most probable path s(x), which maximizes p(s(x), x|θ). s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi
21 Viterbi training (cont) Step 2. Use the ML method for HMM with known parameters, to find θ * which maximizes p(s(x), x|θ * ) Note: In Step 1. the maximizing argument is the path s(x), in Step 2. it is the parameters θ *. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi
22 Viterbi training (cont) 3. Set θ=θ *, and repeat. Stop when paths are not changed. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi Claim 2 ( Exercise) : If s(x) is the optimal path in step 1 of two different iterations, then in both iterations θ has the same values, and hence p(s(x), x |θ) will not increase in any later iteration. Hence the algorithm can terminate in this case.
23 Viterbi training (end) s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi Exercise: generalize the algorithm for the case where there are n training sequences x 1,..,x n to find paths {s(x 1 ),..,s(x n )} and parameters θ so that {s(x 1 ),..,s(x n )} are most probable paths for x 1,..,x n under θ.
24 The EM algorithm Baum Welch training is a special case of a general algorithm for approximating ML parameters in case of “missing data”, the EM algorithm. The correctness proof of the EM algorithm uses the concept of “relative entropy”, which is important in its own. Next few slides present the concepts of entropy and relative entropy.
25 Entropy: Definition Consider a probability space X of k events x 1,..,x k. The Shannon entropy H(X) is defined by: H(X) = -∑ i p(x i )log(p(x i )) =∑ i p(x i )log(1/p(x i )) =E(log(1/p(x i )). It is a measure of the “uncertainty” of the probability space: A large entropy corresponds to high uncertainty.
26 Some intuition: Entropy as expected length of random walk from root to leaf on binary tree Consider the following experiment on a full binary tree: Take a random walk from the root to a leaf, and report the leaf’s name. Let H be the expected number of steps in this experiment. Let p(x) = the probability to reach a leaf x,and l(x) = the distance from the root to x. Then the sum: H=∑ x p(x) l(x), is the expected number of steps in this experiment. In the tree here, H=3 (1/4 2) + 2 (1/8 3) = 2.25
27 Entropy as expected length… (cont.) Note that p(x)=2 -l(x), i.e. l(x)=-log(p(x)). Thus H= H(X) = -∑ i p(x i )log(p(x i )) = ∑ i p(x i )log(1/p(x i )), where X={x i } is the set of leaves in the tree. In the “binary tree experiment”, entropy is the expected length of a random walk from the root to a leaf
28 Entropy as expected length… (cont.) Assume now that each leaf corresponds to a letter x, which is transmitted over a communication channel with probability p(x)=2 -l(x). Associate bits to edges so x is now represented by a binary word of length l(x)=-log p(x). Then H(X) is the expected number of bits transmitted per letter (indeed, entropy is measured in bits). What happens when p(x) 2 -l(x) ? a d c e b Entropy of this encoding: 2.25 bits
29 Relative Entropy Assume now that in our tree, the transmission probabilities p(x) are different from the “random walk” probabilities, q(x)=2 -l(x). Now, the expected number of bits transmitted per letter is ∑ x p(x)l(x)=∑ x p(x)log(1/(q(x)). We will show that this expected length is larger than the entropy ∑ x p(x)log(1/(p(x)) a d c e b If p(x)=0.2 for all letters x, then the expected length is 2.4 bits, compared to 2.25 bits when p(x)=2 -l(x).
30 Relative Entropy: Definition Let p,q be two probability distributions on the same sample space. The relative entropy between p and q is defined by D(p||q) = ∑ x p(x)log(1/(q(x)) -∑ x p(x)log(1/(p(x)) = ∑ x p(x)log[p(x)/q(x)] “The inefficiency of assuming distribution q when the correct distribution is p”.
31 Non negativity of relative entropy Claim: D(p||q)=∑ x p(x)log[p(x)/q(x)] ≥ 0. Equality only if q=p. Proof. We may take the log to base e – ie, log t = ln t. Then, for all t>0, ln t ≤ t-1, with equality only if t=1. Thus, setting t = q(x)/p(x), we get -D(p||q) = ∑ x p(x)ln[q(x)/p(x)] ≤ ∑ x p(x)[q(x)/p(x) – 1] = =∑ x [q(x) - p(x)] =∑ x q(x) -∑ x p(x) = 0
32 Entropy: Generalization for other probability distributions The formula H(p)=∑ x p(x)log(1/(p(x)) ≤ ∑ x p(x)log(1/(q(x)) holds also for distributions p, q that do not correspond to random walks from root to leaves on binary trees. To represent entropy of any (finite) distribution by the “random walk” concept, we allow (binary) trees in which outgoing edges may have any probability p in [0,1]: Need to define the length l = l(p) of edges, as a function of their probability p. p1-p l(p)=?l(1-p)=?
33 Generalization for n = 2 Consider first the case of n=2 (a possibly biased coin, where p is the probabilty of head). p 1-p If we define l(p) = log 2 (1/ p) = -log p, we get H(p,1-p)= plog(1/p)+(1-p)log(1/(1-p)). The maximum entropy (1 bit) is when there is maximum uncertainty (p = 0.5).
34 Generalization for all n Claim: The only continuous length function l(p) on the edges of (binary) trees with probabilities which satisfies the following two conditions: 1.l(p 1 ) + l(p 2 ) = l(p 1 p 2 ) [length of a path to a leaf is determined by the leaf’s probability]. 2.l(0.5) = 1 [as in the equiprobable case for n=2]. is l(p) = log(1/p).
35 Relative entropy as average Score for sequence comparisons Recall that we have defined the scoring function via Note that the average score of “matching” pairs is the relative entropy D(P||Q)=∑ a,b P(a,b)log[P(a,b)/Q(a,b)] where Q(a,b) = Q(a) Q(b). Here, large relative entropy means that P(a,b), the distribution defined by the “match” model, is significantly different from Q(a)Q(b), the distribution defined by the “random” model. (This relative entropy is called “the mutual information of P and Q”, denoted I(P,Q). I(P,Q)=0 iff P =Q.)