Hidden Markov Models CBB 231 / COMPSCI 261. An HMM is a following: An HMM is a stochastic machine M=(Q, , P t, P e ) consisting of the following: a finite.

Slides:



Advertisements
Similar presentations
GS 540 week 5. What discussion topics would you like? Past topics: General programming tips C/C++ tips and standard library BLAST Frequentist vs. Bayesian.
Advertisements

Learning HMM parameters
Ab initio gene prediction Genome 559, Winter 2011.
Introduction to Hidden Markov Models
Computational Gene Finding using HMMs
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
1 DNA Analysis Amir Golnabi ENGS 112 Spring 2008.
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Hidden Markov Models in Bioinformatics
Profiles for Sequences
Hidden Markov Models Theory By Johan Walters (SR 2003)
Hidden Markov Models M. Vijay Venkatesh. Outline Introduction Graphical Model Parameterization Inference Summary.
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
Hidden Markov Models (HMMs) Steven Salzberg CMSC 828H, Univ. of Maryland Fall 2010.
… Hidden Markov Models Markov assumption: Transition model:
CISC667, F05, Lec18, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Gene Prediction and Regulation.
FSA and HMM LING 572 Fei Xia 1/5/06.
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
Hidden Markov Models Sasha Tkachev and Ed Anderson Presenter: Sasha Tkachev.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Lyle Ungar, University of Pennsylvania Hidden Markov Models.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Gene Finding (DNA signals) Genome Sequencing and assembly
CSE182-L10 Gene Finding.
Comparative ab initio prediction of gene structures using pair HMMs
Doug Downey, adapted from Bryan Pardo,Northwestern University
Hidden Markov Models 戴玉書
CSE182-L8 Gene Finding. Project EST clustering and assembly Given a collection of EST (3’/5’) sequences, your goal is to cluster all ESTs from the same.
Hidden Markov models Sushmita Roy BMI/CS 576 Oct 16 th, 2014.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
CSE182-L10 MS Spec Applications + Gene Finding + Projects.
Deepak Verghese CS 6890 Gene Finding With A Hidden Markov model Of Genomic Structure and Evolution. Jakob Skou Pedersen and Jotun Hein.
DNA Feature Sensors B. Majoros. What is Feature Sensing? A feature is any DNA subsequence of biological significance. For practical reasons, we recognize.
Hidden Markov Models In BioInformatics
HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.
Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
CMSC 828N lecture notes: Eukaryotic Gene Finding with Generalized HMMs Mihaela Pertea and Steven Salzberg Center for Bioinformatics and Computational Biology,
Hidden Markov Models CBB 231 / COMPSCI 261 part 2.
S. Salzberg CMSC 828N 1 Three classic HMM problems 2.Decoding: given a model and an output sequence, what is the most likely state sequence through the.
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
Hidden Markovian Model. Some Definitions Finite automation is defined by a set of states, and a set of transitions between states that are taken based.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
Generalized Hidden Markov Models for Eukaryotic Gene Prediction Ela Pertea Assistant Research Scientist CBCB.
B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly Lecture 5 Hidden Markov Model Aleppo University Faculty of technical engineering.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
Hidden Markov Models (HMMs) –probabilistic models for learning patterns in sequences (e.g. DNA, speech, weather, cards...) (2 nd order model)
(H)MMs in gene prediction and similarity searches.
Gene Structure Prediction (Gene Finding) I519 Introduction to Bioinformatics, 2012.
1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.
Introducing Hidden Markov Models First – a Markov Model State : sunny cloudy rainy sunny ? A Markov Model is a chain-structured process where future states.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Bayan Turki Bagasi.  Introduction  Generating a Test Sequence  Estimating the State Sequence  Estimating Transition and Emission Matrices  Estimating.
Hidden Markov Models Wassnaa AL-mawee Western Michigan University Department of Computer Science CS6800 Adv. Theory of Computation Prof. Elise De Doncker.
bacteria and eukaryotes
Genome Annotation (protein coding genes)
What is a Hidden Markov Model?
Hidden Markov Model I.
Eukaryotic Gene Finding
Ab initio gene prediction
Three classic HMM problems
Hidden Markov Models (HMMs)
4. HMMs for gene finding HMM Ability to model grammar
Modeling of Spliceosome
The Toy Exon Finder.
Presentation transcript:

Hidden Markov Models CBB 231 / COMPSCI 261

An HMM is a following: An HMM is a stochastic machine M=(Q, , P t, P e ) consisting of the following: a finite set of states, Q={q 0, q 1,..., q m } a finite alphabet  ={s 0, s 1,..., s n } a transition distribution P t : Q×Q a¡ i.e., P t (q j | q i ) an emission distribution P e : Q×  a¡ i.e., P e (s j | q i ) q 0q 0 100% 80% 15% 30% 70% 5% R =0% Y = 100% q 1 Y =0% R = 100% q 2 What is an HMM? M 1 =({q 0,q 1,q 2 },{ Y, R },P t,P e ) P t ={(q 0,q 1,1), (q 1,q 1,0.8), (q 1,q 2,0.15), (q 1,q 0,0.05), (q 2,q 2,0.7), (q 2,q 1,0.3)} P e ={(q 1, Y,1), (q 1, R,0), (q 2, Y,0), (q 2, R,1) } An Example

q 0q 0 100% 80% 15% 30% 70% 5% R =0% Y = 100% q 1 Y =0% R = 100% q 2 P( YRYRY |M 1 ) = a 0→1  b 1,Y  a 1→2  b 2,R  a 2→1  b 1,Y  a 1→2  b 2,R  a 2→1  b 1,Y  a 1→0 =1  1  0.15  1  0.3  1  0.15  1  0.3  1  0.05 = Probability of a Sequence

Another Example A=10% T=30% C=40% G=20% A=10% T=30% C=40% G=20% A=11% T=17% C=43% G=29% A=11% T=17% C=43% G=29% A=35% T=25% C=15% G=25% A=35% T=25% C=15% G=25% A=27% T=14% C=22% G=37% A=27% T=14% C=22% G=37% 50% 100% 65% 35% 20% 80% q1q1 q1q1 100% q2q2 q2q2 q3q3 q3q3 q4q4 q4q4 q0q0 q0q0 M 2 = (Q, , P t, P e ) Q = {q 0, q 1, q 2, q 3, q 4 }  ={A, C, G, T}

A=10% T=30% C=40% G=20% A=10% T=30% C=40% G=20% A=11% T=17% C=43% G=29% A=11% T=17% C=43% G=29% A=35% T=25% C=15% G=25% A=35% T=25% C=15% G=25% A=27% T=14% C=22% G=37% A=27% T=14% C=22% G=37% 50% 100% 65% 35% 20% 80% q1q1 q1q1 100% q2q2 q2q2 q3q3 q3q3 q4q4 q4q4 q0q0 q0q0 The most probable path is: States: States: Sequence: Sequence: CATTAATAG resulting in this parse: States: States: Sequence: Sequence: CATTAATAG The most probable path is: States: States: Sequence: Sequence: CATTAATAG resulting in this parse: States: States: Sequence: Sequence: CATTAATAG Finding the Most Probable Path Example: C A T T A A T A G top: top: 7.0× bottom: bottom: 2.8× feature 1: C feature 2: ATTAATA feature 3: G Finding the Most Probable Path

emission prob. transition prob. Decoding with an HMM

The Best Partial Parse

The Viterbi Algorithm sequence states (i,k)(i,k) k k-1k-1... k-2k-2 k +1...

Viterbi: Traceback T( T( T(... T( T(i, L - 1), L - 2)..., 2), 1), 0) = 0

Viterbi Algorithm in Pseudocode trans [q i ]={q j | P t (q i |q j )>0} emit [s] = {q i | P e (s|q i )>0} initialization fill out main part of DP matrix choose best state from last column in DP matrix traceback

The Forward Algorithm : Probability of a Sequence Fik F(i,k) represents the probability P(S 0..k - 1 | q i ) that the machine emits the subsequence x 0...x k - 1 by any path ending in state q i — i.e., so that symbol x k - 1 is emitted by state q i.

The Forward Algorithm : Probability of a Sequence sequence states (i,k)(i,k) k k-1k-1... k-2k-2 k Viterbi:Viterbi: Forward:Forward: the single most probable path sum over all paths i.e.,

The Forward Algorithm in Pseudocode fill out the DP matrix sum over the final column to get P(S)

CGATATTCGATTCTACGCGCGTATACTAGCTTATCTGATC CGATATTCGATTCTACGCGCGTATACTAGCTTATCTGATC to state 012 from state 00 (0%)1 (100%)0 (0%) 11 (4%)21 (84%)3 (12%) 20 (0%)3 (20%)12 (80%) symbol ACGT in state 1 6 (24%) 7 (28%) 5 (20%) 7 (28%) 2 3 (20%) 2 (13%) 7 (47%) Training an HMM from Labeled Sequences transitions emissions

Recall: Eukaryotic Gene Structure ATG TGA coding segment complete mRNA ATG GT AGGTAG... start codon stop codon donor site acceptor site exonexon intron TGA

exon 1 exon 2 exon 3 AGCTAGCAGTATGTCATGGCATGTTCGGAGGTAGTACGTAGAGGTAGCTAGTATAGGTCGATAGTACGCGA Intergenic Start codon Start codon Stop codon Stop codon Exon Donor Acceptor Intron the Markov model: the gene prediction: the input sequence: q0q0 q0q0 the most probable path: Using an HMM for Gene Prediction

Higher Order Markovian Eukaryotic Recognizer (HOMER) H3H3 H5H5 H 17 H 27 H 77 H 95

nucleotides splice sites start/stop codonsexonsgenes SnSpFSnSpSnSpSnSpFSn# baseline H3H HOMER, version H 3 I=intron state E=exon state N=intergenic state tested on 500 Arabidopsis genes: Intergenic Start codon Stop codon Exon DonorAcceptor Intron q0q0 q0q0 

Recall: Sensitivity and Specificity

nucleotides splice sites start/stop codonsexonsgenes SnSpFSnSpSnSpSnSpFSn# H3H H5H HOMER, version H 5 three exon states, for the three codon positions

nucleotides splice sites start/stop codonsexonsgenes SnSpFSnSpSnSpSnSpFSn# H5H H HOMER version H 17 acceptor site donor site start codon stop codon Intergenic Start codon Start codon Stop codon Stop codon Exon Donor Acceptor Intron q0q0 q0q0

GTATGCGATAGTCAAGAGTGATCGCTAGACC | | | | | | | phase: sequence: coordinates: Maintaining Phase Across an Intron

nucleotidessplicestart/stopexonsgenes SnSpFSnSpSnSpSnSpFSn# H H HOMER version H 27 three separate intron models

A T G T G A T A A T A G G T A G (start codons) (donor splice sites) (acceptor splice sites) Recall: Weight Matrices (stop codons) (stop codons)

nucleotidessplicestart/stopexonsgenes SnSpFSnSpSnSpSnSpFSn# H H HOMER version H 77 positional biases near splice sites

nucleotidessplicestart/stopexonsgenes SnSpFSnSpSnSpSnSpFSn# H H HOMER version H 95

Summary of HOMER Results

Higher-order Markov Models A C G C T A P(G|AC) A C G C T A P(G|C) A C G C T A P(G) 0 th order: 1 st order: 2 nd order:

order nucleotides splice sites starts/ stopsexonsgenes SnSpFSnSpSnSpSnSpFSn# H H H H H H Higher-order Markov Models

Variable-Order Markov Models

Interpolation Results

Summary An HMM is a stochastic generative model which emits sequences Parsing with an HMM can be accomplished using a decoding algorithm (such as Viterbi) to find the most probable (MAP) path generating the input sequence Training of unambiguous HMM’s can be accomplished using labeled sequence training Training of ambiguous HMM’s can be accomplished using Viterbi training or the Baum-Welch algorithm (next lesson...) Posterior decoding can be used to estimate the probability that a given symbol or substring was generate by a particular state (next lesson...) An HMM is a stochastic generative model which emits sequences Parsing with an HMM can be accomplished using a decoding algorithm (such as Viterbi) to find the most probable (MAP) path generating the input sequence Training of unambiguous HMM’s can be accomplished using labeled sequence training Training of ambiguous HMM’s can be accomplished using Viterbi training or the Baum-Welch algorithm (next lesson...) Posterior decoding can be used to estimate the probability that a given symbol or substring was generate by a particular state (next lesson...)