. Computational Genomics Lecture 8a Hidden Markov Models (HMMs) © Ydo Wexler & Dan Geiger (Technion) and by Nir Friedman (HU) Modified by Benny Chor (TAU)
2 Outline u Finite, or Discrete, Markov Models u Hidden Markov Models u Three major questions: u Q1: Compute the probability of a given sequence of observations. A1: Forward – Backward dynamic programming algorithm (Baum Welch). u Q2: Compute the most probable sequence of states, given a sequence of observations. A2: Viterbi’s dynamic programming Algorithm u Q3: Learn best model, given an observation,. A3: The Expectation Maximization (EM) heuristic.
3 Markov Models u A discrete (finite) system: l N distinct states. l Begins (at time t=1) in some initial state(s). l At each time step (t=1,2,…) the system moves from current to next state (possibly the same as the current state) according to transition probabilities associated with current state. u This kind of system is called a finite, or discrete Markov model. Aka probabilistic finite automata. u After Andrei Andreyevich Markov ( )
4 Example (reminder): The Friendly Gambler Game starts with 10$ in gambler’s pocket – At each round we have the following: Gambler wins 1$ with probability p Gambler loses 1$ with probability 1-p – Game ends when gambler goes broke (no sister in bank), or accumulates a capital of 100$ (including initial capital) – Both 0$ and 100$ are absorbing states (or boundaries) 01 2 N-1 N p p p p 1-p Start (10$) or
5 Example (reminder): : The Friendly Gambler 01 2 N-1 N p p p p 1-p Start (10$) Irreducible means that every state is accessible from every other state. Aperiodic means that there exists at least one state for which the transition from that state to itself is possible. Positive recurrent means that for every state, the expected return time is finite. If the Markov chain is positive recurrent, there exists a stationary distribution. Is the gambler’s chain positive recurrent? Does it have stationary distribution(s) (and are they independent of initial distribution)?
6 Let Us Change Gear u Nough with these simple Markov chains. u Our next mission: Hidden Markov chains. 0.9 Fair loaded head tail /2 1/4 3/4 1/2 Start 1/2
7 Hidden Markov Models (or probabilistic finite state transducers) Often we face cases where states cannot be directly observed. We need an extension to Markov Models: Hidden Markov Models a 11 a 22 a 33 a 44 a 12 a 23 a 34 b 11 b 14 b 12 b Observed phenomenon a ij are state transition probabilities. b ik are observation (output) probabilities. b 11 + b 12 + b 13 + b 14 = 1, b 21 + b 22 + b 23 + b 24 = 1, etc.
8 Hidden Markov Models - HMM H1H1 H2H2 H L-1 HLHL X1X1 X2X2 X L-1 XLXL HiHi XiXi hidden state variables observed data (“output”)
9 Example: Dishonest Casino Actually, what is hidden in this model?
10 A Similar Example: Loaded Coin 0.9 Fair loaded head tail /2 1/4 3/4 1/2 H1H1 H2H2 H L-1 HLHL X1X1 X2X2 X L-1 XLXL HiHi XiXi L tosses Fair/Loade d Head/Tail Start 1/2
11 H1H1 H2H2 H L-1 HLHL X1X1 X2X2 X L-1 XLXL HiHi XiXi L tosses Fair/Loade d Head/Tail 0.9 Fair loaded head tail /2 1/4 3/4 1/2 Start 1/2 Loaded Coin Example (cont.) Q1.: What is the probability of the sequence of observed outcome (e.g. HHHTHTTHHT), given the model?
12 HMMs – Question I Given an observation sequence O = ( O 1 O 2 O 3 … O L ), and a model M = {A, B, } how do we efficiently compute P( O | M ), the probability that the given model M produces the observation O in a run of length L ? u This probability can be viewed as a measure of the quality of the model M. Viewed this way, it enables discrimination/selection among alternative models M 1, M 2, M 3 …
13 Example: CpG islands In human genome, CG dinucleotides are relatively rare CG pairs undergo a process called methylation that modifies the C nucleotide A methylated C mutate (with relatively high chance) to a T Promotor regions are CG rich l These regions are not methylated, and thus mutate less often These are called CG (aka CpG) islands
14 Biological Example: Methylation and CG Islands CG dinucleotides in nuclear DNA sequences often undergo a process of methylation, where a methyl (CH 3 ) “joins” the Cytosine (C). The methylated Cytosine may be converted to Thymine (T) by accidental deamination. Over evolutionary time scales, the methylated CG sequence will often be converted to the TG sequence. Genes whose control regions are methylated are usually under expressed. Indeed, unmethylated CGs are often found around active genes (this is condition and tissue dependent). A CG island is a short stretch of DNA in which the frequency of the CG sequence is higher than other regions. Therefore, such islands are found with higher density around genes.
15 Biological Example: Methylaion & CG Islands Notice: The complement of a CG is a GC epi genetic phenomena
16 CpG Islands u We construct a Markov chain for CpG rich and another for CpG poor regions u Using maximum likelihood estimates from 60K nucleotide, we get two models
17 Ratio Test for CpC islands Given a sequence X 1,…,X n we compute the likelihood ratio
18 Empirical Evalation
19 Finding CpG islands Simple Minded approach: Pick a window of size N ( N = 100, for example) u Compute log-ratio for the sequence in the window, and classify based on that Problems: How do we select N ? u What do we do when the window intersects the boundary of a CpG island?
20 Alternative Approach u Build a model that include “+” states and “-” states u A state “remembers” last nucleotide and the type of region u A transition from a - state to a + corresponds to the start of a CpG island
21 C-G Islands: A Different HMM Regular DNA C-G island Define C-G islands: DNA stretches which are very rich in CG A C G T change A C G T (1-P)/4 P/6 q/4 P P q q q qP P (1-q)/6 (1-q)/3 p/3 p/6 aka CpG islands
22 A Different C-G Islands HMM A C G T change A C G T H1H1 H2H2 H L-1 HLHL X1X1 X2X2 X L-1 XLXL HiHi XiXi C-G island? A/C/G/T
23 HMM Recognition (question I) u For a given model M = { A, B, p} and a given state sequence Q 1 Q 2 Q 3 … Q L,, the probability of an observation sequence O 1 O 2 O 3 … O L is P(O|Q,M) = b Q1O1 b Q2O2 b Q3O3 … b QTOT u For a given hidden Markov model M = { A, B, p} the probability of the state sequence Q 1 Q 2 Q 3 … Q L is (the initial probability of Q 1 is taken to be Q1 ) P(Q|M) = p Q1 a Q1Q2 a Q2Q3 a Q3Q4 … a QL-1QL u So, for a given HMM, M the probability of an observation sequence O 1 O 2 O 3 … O T is obtained by summing over all possible state sequences
24 HMM – Recognition (cont.) P(O| M) = Q P(O|Q) P(Q|M) = Q Q 1 b Q 1 O 1 a Q 1 Q 2 b Q 2 O 2 a Q 2 Q 3 b Q 2 O 2 … u Requires summing over exponentially many paths u Can this be made more efficient?
25 HMM – Recognition (cont.) u Why isn’t it efficient? – O(2LQ L ) l For a given state sequence of length L we have about 2L calculations P(Q|M) = Q 1 a Q 1 Q 2 a Q 2 Q 3 a Q 3 Q 4 … a Q T-1 Q T H P(O|Q) = b Q 1 O 1 b Q 2 O 2 b Q 3 O 3 … b Q T O T l There are Q L possible state sequence l So, if Q=5, and L=100, then the algorithm requires 200x5 100 computations l Instead, we will use the forward-backward (F-B) algorithm of Baum (68) to do things more efficiently.
26 The Forward Backward Algorithm u A white board presentation.
27 The F-B Algorithm (cont.) Option 1) The likelihood is measured using any sequence of states of length T l This is known as the “Any Path” Method Option 2) We can choose an HMM by the probability generated using the best possible sequence of states l We’ll refer to this method as the “Best Path” Method
28 HMM – Question II (Harder) u Given an observation sequence, O = (O 1 O 2 … O T ), and a model, M = {A, B, p }, how do we efficiently compute the most probable sequence(s) of states, Q ? u Namely the sequence of states Q = (Q 1 Q 2 … Q T ), which maximizes P(O|Q,M), the probability that the given model M produces the given observation O when it goes through the specific sequence of states Q. u Recall that given a model M, a sequence of observations O, and a sequence of states Q, we can efficiently compute P(O|Q,M) (should watch out for numeric underflows)
29 Most Probable States Sequence (Q. II) Idea: If we know the identity of Q i, then the most probable sequence on i+1,…,n does not depend on observations before time i u A white board presentation of Viterbi’s algorithm u Followed by a simple weather demo. u An online demo of Viterbi’s algorithm
30 Dishonest Casino (again) u Computing posterior probabilities for “fair” at each point in a long sequence:
31 HMM – Question III (Hardest) u Given an observation sequence O = (O 1 O 2 … O L ), and a class of models, each of the form M = {A,B,p}, which specific model “best” explains the observations? u A solution to question I enables the efficient computation of P(O|M) (the probability that a specific model M produces the observation O). u Question III can be viewed as a learning problem: We want to use the sequence of observations in order to “train” an HMM and learn the optimal underlying model parameters (transition and output probabilities).
32 Learning Given the two sequences O 1,…,O T, and S 1,…,S T How do we learn the model param. a i,j and b i (a) ? u We want to find parameters that maximize the likelihood, Pr(O 1,…,O T | ) We simply count: N kl - number of times q i =S k & q i+1 =S l N ka - number of times q i =S k & O i = a
33 Learning the Model Given only the observations O 1,…,O T, How do we learn A kl and B ka ? We want to find parameters that maximize the likelihood Pr(O 1,…,O T | ) Problem: u Counts are inaccessible, since we do not observe the S t
34 Learning the Model Problem: Counts are inaccessible, since the S t ’s are hidden. Solution (heuristic): The EM algorithm, next lecture.
35 If we have A kl and B ka we can compute
36 Expected Counts We can compute expected number of times h i =k & h i+1 =l u Similarly
37 Expectation Maximization (EM) Choose A kl and B ka E-step: Compute expected counts E[N kl ], E[N ka ] M-Step: u Restimate: u Reiterate
38 EM - basic properties P(x 1,…,x n: A kl, B ka ) P(x 1,…,x n: A’ kl, B’ ka ) l Likelihood grows in each iteration If P(x 1,…,x n: A kl, B ka ) = P(x 1,…,x n: A’ kl, B’ ka ) then A kl, B ka is a stationary point of the likelihood l either a local maxima, minima, or saddle point
39 Complexity of E-step u Compute forward and backward messages Time & Space complexity: O(nL) u Accumulate expected counts Time complexity O(nL 2 ) Space complexity O(L 2 )
40 EM - problems Local Maxima: u Learning can get stuck in local maxima u Sensitive to initialization u Require some method for escaping such maxima Choosing L u We often do not know how many hidden values we should have or can learn
41 Communication Example
42 1. Compute the posteriori belief in H i (specific i) given the evidence {x 1,…,x L } for each of H i ’s values h i, namely, compute p(h i | x 1,…,x L ). 2. Do the same computation for every H i but without repeating the first task L times. Coin-Tossing Example Seeing the set of outcomes {x 1,…,x L }, compute p(loaded | x 1,…,x L ) for each coin toss Q.: what is the most likely sequence of values in the H-nodes to generate the observed data?