CISC 667 Intro to Bioinformatics (Fall 2005) Hidden Markov Models (I)

Slides:



Advertisements
Similar presentations
Markov models and applications
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
Marjolijn Elsinga & Elze de Groot1 Markov Chains and Hidden Markov Models Marjolijn Elsinga & Elze de Groot.
Hidden Markov Model.
Hidden Markov Models Eine Einführung.
Hidden Markov Models.
Markov Models Charles Yan Markov Chains A Markov process is a stochastic process (random process) in which the probability distribution of the.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Hidden Markov Models Modified from:
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
. Hidden Markov Model Lecture #6. 2 Reminder: Finite State Markov Chain An integer time stochastic process, consisting of a domain D of m states {1,…,m}
… Hidden Markov Models Markov assumption: Transition model:
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Hidden Markov Models. Decoding GIVEN x = x 1 x 2 ……x N We want to find  =  1, ……,  N, such that P[ x,  ] is maximized  * = argmax  P[ x,  ] We.
Markov Models Charles Yan Spring Markov Models.
Hidden Markov Models Lecture 6, Thursday April 17, 2003.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
CpG islands in DNA sequences
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Models.
Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.
Hidden Markov models Sushmita Roy BMI/CS 576 Oct 16 th, 2014.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
CS262 Lecture 5, Win07, Batzoglou Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
. Class 5: Hidden Markov Models. Sequence Models u So far we examined several probabilistic model sequence models u These model, however, assumed that.
Hidden Markov Model Continues …. Finite State Markov Chain A discrete time stochastic process, consisting of a domain D of m states {1,…,m} and 1.An m.
Dishonest Casino Let’s take a look at a casino that uses a fair die most of the time, but occasionally changes it to a loaded die. This model is hidden.
1 Markov Chains. 2 Hidden Markov Models 3 Review Markov Chain can solve the CpG island finding problem Positive model, negative model Length? Solution:
HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.
Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models.
H IDDEN M ARKOV M ODELS. O VERVIEW Markov models Hidden Markov models(HMM) Issues Regarding HMM Algorithmic approach to Issues of HMM.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
Hidden Markov Models BMI/CS 776 Mark Craven March 2002.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
From Genomics to Geology: Hidden Markov Models for Seismic Data Analysis Samuel Brown February 5, 2009.
1 DNA Analysis Part II Amir Golnabi ENGS 112 Spring 2008.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models – Concepts 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Markov Models Brian Jackson Rob Caldwell March 9, 2010.
Lecture 16, CS5671 Hidden Markov Models (“Carnivals with High Walls”) States (“Stalls”) Emission probabilities (“Odds”) Transitions (“Routes”) Sequences.
Hidden Markov Models BMI/CS 576
Modeling Biological Sequence using Hidden Markov Models
Hidden Markov Models - Training
Hidden Markov Model I.
6.047/ Computational Biology: Genomes, Networks, Evolution
Professor of Computer Science and Mathematics
- Viterbi training - Baum-Welch algorithm - Maximum Likelihood
Hidden Markov Models (HMMs)
Hidden Markov Model ..
Professor of Computer Science and Mathematics
CISC 667 Intro to Bioinformatics (Spring 2007) Review session for Mid-Term CISC667, S07, Lec14, Liao.
Professor of Computer Science and Mathematics
CSE 5290: Algorithms for Bioinformatics Fall 2009
CSE 5290: Algorithms for Bioinformatics Fall 2011
CISC 667 Intro to Bioinformatics (Fall 2005) Hidden Markov Models (IV)
CSE 5290: Algorithms for Bioinformatics Fall 2009
Profile HMMs GeneScan TMMOD
Hidden Markov Model Lecture #6
CISC 667 Intro to Bioinformatics (Fall 2005) Hidden Markov Models (I)
CISC 667 Intro to Bioinformatics (Fall 2005) Hidden Markov Models (II)
Presentation transcript:

CISC 667 Intro to Bioinformatics (Fall 2005) Hidden Markov Models (I) The model The decoding: Viterbi algorithm CISC667, F05, Lec10, Liao

Hidden Markov models A Markov chain of states At each state, there are a set of possible observables (symbols), and The states are not directly observable, namely, they are hidden. E.g., Casino fraud Three major problems Most probable state path The likelihood Parameter estimation for HMMs 1: 1/6 2: 1/6 3: 1/6 4: 1/6 5: 1/6 6: 1/6 1: 1/10 2: 1/10 3: 1/10 4: 1/10 5: 1/10 6: 1/2 0.05 0.1 0.95 0.9 Fair Loaded CISC667, F05, Lec10, Liao

A biological example: CpG islands Higher rate of Methyl-C mutating to T in CpG dinucleotides → generally lower CpG presence in genome, except at some biologically important ranges, e.g., in promoters, -- called CpG islands. The conditional probabilities P±(N|N’)are collected from ~ 60,000 bps human genome sequences, + stands for CpG islands and – for non CpG islands. P+ A C G T P- A C G T A C G T .180 .274 .426 .120 A C G T .300 .205 .285 .210 .171 .368 .274 .188 .322 .298 .078 .302 .161 .339 .375 .125 .248 .246 .298 .208 .079 .355 .384 .182 .177 .239 .292 .292 CISC667, F05, Lec10, Liao

S(x) = log [ P(x | model +) / P(x | model -)] Task 1: given a sequence x, determine if it is a CpG island. One solution: compute the log-odds ratio scored by the two Markov chains: S(x) = log [ P(x | model +) / P(x | model -)] where P(x | model +) = P+(x1|x2) P+(x2|x3)… P+(xL-1|xL) # of sequence S(x) Histogram of the length-normalized scores (CpG sequences are shown as dark shaded ) CISC667, F05, Lec10, Liao

Task 2: For a long genomic sequence x, label these CpG islands, if there are any. Approach 1: Adopt the method for Task 1 by calculating the log-odds score for a window of, say, 100 bps around every nucleotide and plotting it. Problems with this approach: Won’t do well if CpG islands have sharp boundary and variable length No effective way to choose a good Window size. CISC667, F05, Lec10, Liao

Approach 2: using hidden Markov model T: .188 A: .372 C: .198 G: .112 T: .338 0.35 0.30 0.65 0.70 + − The model has two states, “+” for CpG island and “-” for non CpG island. Those numbers are made up here, and shall be fixed by learning from training examples. A reasonable assignment for emission frequencies may use the respective limiting distribution of the two Markov chains in Approach 1. The notations: akl is the transition probability from state k to state l; ek(b) is the emission frequency – probability that symbol b is seen when in state k. CISC667, F05, Lec10, Liao

P(x, π) = ∏i=1 to L eπi (xi) a πi πi+1 C: .368 G: .274 T: .188 A: .372 C: .198 G: .112 T: .338 0.35 0.30 0.65 0.70 + − The probability that sequence x is emitted by a state path π is: P(x, π) = ∏i=1 to L eπi (xi) a πi πi+1 i:123456789 x:TGCGCGTAC π :--++++--- P(x, π) = 0.338 × 0.70 × 0.112 × 0.30 × 0.368 × 0.65 × 0.274 × 0.65 × 0.368 × 0.65 × 0.274 × 0.35 × 0.338 × 0.70 × 0.372 × 0.70 × 0.198. Then, the probability to observe sequence x in the model is P(x) = π P(x, π), which is also called the likelihood of the model. CISC667, F05, Lec10, Liao

Decoding: Given an observed sequence x, what is the most probable state path, i.e., π* = argmax π P(x, π) Q: Given a sequence x of length L, how many state paths do we have? A: NL, where N stands for the number of states in the model. As an exponential function of the input size, it precludes enumerating all possible state paths for computing P(x). Viterbi Algorithm Initialization: v0(0) =1, vk(0) = 0 for k > 0. Recursion: vk(i) = ek(xi) maxj (vj(i-1) ajk); ptri(k) = argmaxj (vj(i-1) ajk); Termination: P(x, π* ) = maxk(vk(L) ak0); π*L = argmaxj (vj(L) aj0); Traceback: π*i-1 = ptri (π*i). CISC667, F05, Lec10, Liao

− − + + + hidden state path ε 0.5 0.5 A: .170 C: .368 G: .274 T: .188 A: .372 C: .198 G: .112 T: .338 0.35 0.30 0.65 0.70 + − vk(i) = ek(xi) maxj (vj(i-1) ajk); ε A A G C G Vk(i) i − + 1 0 0 0 0 0 .186 . 048 .0038 .00053 .000042 .085 .0095 .0039 .00093 .00038 k − − + + + hidden state path CISC667, F05, Lec10, Liao

Casino Fraud: investigation results by Viterbi decoding CISC667, F05, Lec10, Liao