Download presentation
Presentation is loading. Please wait.
1
. Parameter Estimation For HMM Background Readings: Chapter 3.3 in the book, Biological Sequence Analysis, Durbin et al., 2001.
2
2 Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities: p(X i = b| S i = s) = e s (b) S1S1 S2S2 S L-1 SLSL x1x1 x2x2 X L-1 xLxL M M M M TTTT
3
3 Reminder: Most Probable state path S1S1 S2S2 S L-1 SLSL x1x1 x2x2 X L-1 xLxL M M M M TTTT Given an output sequence x = (x 1,…,x L ), A most probable path s*= (s * 1,…,s * L ) is one which maximizes p(s|x).
4
4 Reminder: Viterbi’s algorithm for most probable path s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi For i=1 to L do for each state l : v l (i) = e l (x i ) MAX k {v k (i-1)a kl } ptr i (l)=argmax k {v k (i-1)a kl } [storing previous state for reconstructing the path] Termination: the probability of the most probable path Initialization: v 0 (0) = 1, v k (0) = 0 for k > 0 0 We add the special initial state 0. p(s 1 *,…,s L * ;x 1,…,x l ) =
5
5 A-A- C-C- T-T- T+T+ A CTT G + G Predicting CpG islands via most probable path: Output symbols: A, C, G, T (4 letters). Markov Chain states: 4 “-” states and 4 “+” states, two for each letter (8 states). The transitions probabilities a st and e k (b) will be discussed soon. The most probable path found by Viterbi’s algorithm predicts CpG islands. Experiment (Durbin et al, p. 60-61) shows that the predicted islands are shorter than the assumed ones. In addition quite a few “false negatives” are found.
6
6 Reminder: finding most probable state 1. The forward algorithm finds {f k (s i ) = P(x 1,…,x i,s i =k): k = 1,...m}. 2. The backward algorithm finds {b k (s i ) = P(x i+1,…,x L |s i =k ): k = 1,...m}. 3. Return {p(S i =k|x) = f k (s i ) b k (s i ) |k=1,...,m}. To Compute for every i simply run the forward and backward algorithms once, and compute {f k (s i ) b k (s i )} for every i, k. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi f l (i) = p(x 1,…,x i,s i =l ), the probability of a path which emits (x 1,..,x i ) and in which state s i =l. b l (i)= p(x i+1,…,x L,s i =l), the probability of a path which emits (x i+1,..,x L ) and in which state s i =l.
7
7 Finding the probability that a letter is in a CpG island via the algorithm for most probable state: The probability that an occurrence of G is in a CpG island (+ state) is: ∑ s + p(S i =s + |x) = ∑ s + F(S i =s + )B(S i =s + ) Where the summation is formally over the 4 “+” states, but actually only state G + need to be considered (why?) A-A- C-C- T-T- T+T+ A CTT G + G
8
8 Parameter Estimation for HMM An HMM model is defined by the parameters: a kl and e k (b), for all states k,l and all symbols b. Let θ denote the collection of these parameters. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi lk b a kl ek(b)ek(b)
9
9 Parameter Estimation for HMM To determine the values of (the parameters in) θ, use a training set = {x 1,...,x n }, where each x j is a sequence which is assumed to fit the model. Given the parameters θ, each sequence x j has an assigned probability p(x j |θ) (or p(x j | θ,HMM)). s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi
10
10 ML Parameter Estimation for HMM The elements of the training set {x 1,...,x n }, are assumed to be independent, p(x 1,..., x n |θ) = ∏ j p (x j |θ). ML parameter estimation looks for θ which maximizes the above. The exact method for finding or approximating this θ depends on the nature of the training set used.
11
11 Data for HMM Possible properties of (the sequences in) the training set: 1.For each x j, what is our information on the states s i (the symbols x i are usually known). 2.The size (number of sequences) of the training set S1S1 S2S2 S L-1 SLSL x1x1 x2x2 X L-1 xLxL M M M M TTTT
12
12 Case 1: Sequences are fully known We know the complete structure of each sequence in the training set {x 1,...,x n }. We wish to estimate a kl and e k (b) for all pairs of states k, l and symbols b. By the ML method, we look for parameters θ* which maximize the probability of the sample set: p(x 1,...,x n | θ*) =MAX θ p(x 1,...,x n | θ). s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi
13
13 Case 1: ML method s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi Let m kl = |{i: s i-1 =k,s i =l}| (in x j ). m k (b)=|{i:s i =k,x i =b}| (in x j ). For each x j we have:
14
14 Case 1 (cont) s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi By the independence of the x j ’s, p(x 1,...,x n | θ)=∏ i p(x j |θ). Thus, if A kl = #(transitions from k to l) in the training set, and E k (b) = #(emissions of symbol b from state k) in the training set, we have:
15
15 Case 1 (cont) So we need to find a kl ’s and e k (b)’s which maximize: Subject to:
16
16 Case 1 (cont) Rewriting, we need to maximize:
17
17 Case 1 (cont) Then we will maximize also F. Each of the above is a simpler ML problem, which is treated next.
18
18 A simpler case: ML parameters estimation for a die Let X be a random variable with 6 values x 1,…,x 6 denoting the six outcomes of a die. Here the parameters are θ ={ 1, 2, 3, 4, 5, 6 }, ∑θ i =1 Assume that the data is one sequence: Data = (x 6,x 1,x 1,x 3,x 2,x 2,x 3,x 4,x 5,x 2,x 6 ) So we have to maximize Subject to: θ 1 +θ 2 + θ 3 + θ 4 + θ 5 + θ 6 =1 [and θ i 0 ]
19
19 Side comment: Sufficient Statistics u To compute the probability of data in the die example we only require to record the number of times N i falling on side i (namely,N 1, N 2,…,N 6 ). u We do not need to recall the entire sequence of outcomes u {N i | i=1…6} is called sufficient statistics for the multinomial sampling.
20
20 Sufficient Statistics u A sufficient statistics is a function of the data that summarizes the relevant information for the likelihood Formally, s(Data) is a sufficient statistics if for any two datasets D and D’ s(Data) = s(Data’ ) P(Data| ) = P(Data’| ) Datasets Statistics Exercise: Define “sufficient statistics” for the HMM model.
21
21 Maximum Likelihood Estimate By the ML approach, we look for parameters that maximizes the probability of data (i.e., the likelihood function ). Usually one maximizes the log-likelihood function which is easier to do and gives an identical answer: A necessary condition for maximum is: 0 1 )|(log 5 1 6 θ i i j j j NNDataP
22
22 Finding the Maximum Rearranging terms: Divide the j th equation by the i th equation: Sum from j=1 to 6: Hence the MLE is given by:
23
23 Generalization for distribution with any number n of outcomes Let X be a random variable with n values x 1,…,x k denoting the k outcomes of an iid experiments, with parameters θ ={ 1, 2,..., k } (θ i is the probability of x i ). Again, the data is one sequence of length n: Data = (x i 1,x i 2,...,x i n ) Then we have to maximize Subject to: θ 1 +θ 2 +....+ θ k =1
24
24 Generalization for n outcomes (cont) By treatment identical to the die case, the maximum is obtained when for all i: Hence the MLE is given by:
25
25 Fractional Exponents Some models allow n i ’s which are not integers (eg, when we are uncertain of a die outcome, and consider it “6” with 20% confidence and “5” with 80%): We still can have And the same analysis yields:
26
26 Apply the ML method to HMM Let A kl = #(transitions from k to l) in the training set. E k (b) = #(emissions of symbol b from state k) in the training set. We need to: s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi
27
27 Apply to HMM (cont.) We apply the previous technique to get for each k the parameters {a kl |l=1,..,m} and {e k (b)|b Σ}: s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi Which gives the optimal ML parameters
28
28 Adding pseudo counts in HMM If the sample set is too small, we may get a biased result. In this case we modify the actual count by our prior knowledge/belief: r kl is our prior belief and transitions from k to l. r k (b) is our prior belief on emissions of b from state k. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi
29
29 Summary of Case 1: Sequences are fully known We know the complete structure of each sequence in the training set {x 1,...,x n }. We wish to estimate a kl and e k (b) for all pairs of states k, l and symbols b. We just showed a method which finds the (unique) parameters θ* which maximizes p(x 1,...,x n | θ*) =MAX θ p(x 1,...,x n | θ). s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi
30
30 Case 2: State paths are unknown: In this case only the values of the x i ’s of the input sequences are known. This is a ML problem with “missing data”. We wish to find θ * so that p(x|θ * )=MAX θ {p(x|θ)}. For each sequence x, p(x|θ)=∑ s p(x,s|θ), taken over all state paths s. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi
31
31 Case 2: State paths are unknown So we need to maximize p(x|θ)=∑ s p(x,s|θ), where the summation is over all the sequences S which produce the output sequence x. Finding θ * which maximizes ∑ s p(x,s|θ) is hard. [Unlike finding θ * which maximizes p(x,s|θ) for a single sequence (x,s).] s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi
32
32 ML Parameter Estimation for HMM The general process for finding θ in this case is 1.Start with an initial value of θ. 2.Find θ’ so that p(x 1,..., x n |θ’) > p(x 1,..., x n |θ) 3.set θ = θ’. 4.Repeat until some convergence criterion is met. A general algorithm of this type is the Expectation Maximization algorithm, which we will meet later. For the specific case of HMM, it is the Baum- Welch training.
33
33 Case 2: State paths are unknown A general algorithm that solves such a problem is the Expectation Maximization algorithm, which we will meet later. For the specific case of HMM, it is the Baum-Welch training. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi
34
34 Baum Welch training We start with some values of a kl and e k (b), which define prior values of θ. Baum-Welch training is an iterative algorithm which attempts to replace θ by a θ * s.t. p( x |θ * ) > p( x |θ) Each iteration consists of few steps: s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi
35
35 Baum Welch: step 1 Count expected number of state transitions: For each sequence x j, for each i and for each k,l, compute the posterior state transitions probabilities: s1s1 SiSi sLsL X1X1 XiXi XLXL S i-1 X i-1.. P(s i-1 =k, s i =l | x j,θ)
36
36 Recall the forward algorithm H1H1 H2H2 X1X1 X2X2 HiHi XiXi The task: Compute f(h i ) = P(x 1,…,x i,h i ) for i=1,…,L (namely, considering evidence up to time slot i). Initial step: P(x 1, h 1 ) = P(h 1 ) P(x 1 |h 1 ) Step i: P(x 1,…,x i,h i ) = P(x 1,…,x i-1, h i-1 ) P(h i | h i-1 ) P(x i | h i ) h i-1 Recall that the backward algorithm is similar: multiplying matrices from the end to the start and summing on hidden states.
37
37 Baum Welch training Claim: s1s1 SiSi sLsL X1X1 XiXi XLXL S i-1 X i-1..
38
38 Step 1: Computing P(s i-1 =k, s i =l | x j,θ) P(x 1,…,x L,s i-1 =k,s i =l) = P(x 1,…,x i-1,s i-1 =k) a kl e k (x i ) P(x i+1,…,x L |s i =l) p(s i-1 =k,s i =l | x j ) = f k (i-1) a kl e l (x i ) b l (i) f k’ (i-1) a k’l’ e k’ (x i ) b l’ (i) K’ l’ = f k (i-1) a kl e k (x i ) b l (i) Via the forward algorithm Via the backward algorithm s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL S i-1 X i-1 sisi XiXi xjxj
39
39 Step 1 (end) for each pair (k,l), compute the expected number of state transitions from k to l:
40
40 Baum-Welch: Step 2 for each state k and each symbol b, compute the expected number of emissions of b from k:
41
41 Baum-Welch: step 3 Use the A kl ’s, E k (b)’s to compute the new values of a kl and e k (b). These values define θ *. It can be shown that: p(x 1,..., x n |θ * ) > p(x 1,..., x n |θ) i.e, θ * increases the probability of the data This procedure is iterated, until some convergence criterion is met.
42
42 Case 2: State paths are unknown: Viterbi training Also start from given values of a kl and e k (b), which defines prior values of θ. Viterbi training attempts to maximize the probability of a most probable path; i.e., maximize p((s(x 1 ),..,s(x n )) |θ, x 1,..,x n ) Where s(x j ) is the most probable (under θ) path for x j. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi
43
43 Case 2: State paths are unknown: Viterbi training Each iteration: 1.Find a set {s(x j )} of most probable paths, which maximize p(s(x 1 ),..,s(x n ) |θ, x 1,..,x n ) 2. Find θ *, which maximizes p(s(x 1 ),..,s(x n ) | θ *, x 1,..,x n ) Note: In 1. the maximizing arguments are the paths, in 2. it is θ *. 3. Set θ=θ *, and repeat. Stop when paths are not changed. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi
44
44 Case 2: State paths are unknown: Viterbi training p(s(x 1 ),..,s(x n ) | θ *, x 1,..,x n ) can be expressed in a closed form (since we are using a single path for each x j ), so this time convergence is achieved when the optimal paths are not changed any more. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.