Hidden Markov Models in Bioinformatics

Slides:



Advertisements
Similar presentations
B. Knudsen and J. Hein Department of Genetics and Ecology
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Learning HMM parameters
1 Hidden Markov Model Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Hidden Markov Models Eine Einführung.
Hidden Markov Models.
MNW2 course Introduction to Bioinformatics
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Hidden Markov Models Modified from:
Statistical NLP: Lecture 11
Hidden Markov Models Theory By Johan Walters (SR 2003)
Hidden Markov Models Fundamentals and applications to bioinformatics.
Hidden Markov Models (HMMs) Steven Salzberg CMSC 828H, Univ. of Maryland Fall 2010.
Apaydin slides with a several modifications and additions by Christoph Eick.
INTRODUCTION TO Machine Learning 3rd Edition
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Lecture 6, Thursday April 17, 2003
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Hidden Markov Models. Decoding GIVEN x = x 1 x 2 ……x N We want to find  =  1, ……,  N, such that P[ x,  ] is maximized  * = argmax  P[ x,  ] We.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
. Parameter Estimation For HMM Background Readings: Chapter 3.3 in the book, Biological Sequence Analysis, Durbin et al., 2001.
. Sequence Alignment via HMM Background Readings: chapters 3.4, 3.5, 4, in the Durbin et al.
S. Maarschalkerweerd & A. Tjhang1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Hidden Markov Models: an Introduction by Rachel Karchin.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Comparative ab initio prediction of gene structures using pair HMMs
Hidden Markov Models Usman Roshan BNFO 601. Hidden Markov Models Alphabet of symbols: Set of states that emit symbols from the alphabet: Set of probabilities.
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
Elze de Groot1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Hidden Markov Models.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Hidden Markov Model Continues …. Finite State Markov Chain A discrete time stochastic process, consisting of a domain D of m states {1,…,m} and 1.An m.
Deepak Verghese CS 6890 Gene Finding With A Hidden Markov model Of Genomic Structure and Evolution. Jakob Skou Pedersen and Jotun Hein.
Hidden Markov Models In BioInformatics
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Comparative Genomics & Annotation The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
MNW2 course Introduction to Bioinformatics Lecture 22: Markov models Centre for Integrative Bioinformatics FEW/FALW
RNA Secondary Structure What is RNA? Definition of RNA secondary Structure RNA molecule evolution Algorithms for base pair maximisation Chomsky’s Linguistic.
Hidden Markov Models for Sequence Analysis 4
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
Hidden Markov Models BMI/CS 776 Mark Craven March 2002.
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
5 Open Problems in Bioinformatics Pedigrees from Genomes Comparative Genomics of Alternative Splicing Viral Annotation Evolving Turing Patterns Protein.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel:
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
Hidden Markov Models (HMMs) –probabilistic models for learning patterns in sequences (e.g. DNA, speech, weather, cards...) (2 nd order model)
1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Recombination and Pedigrees Genealogies and Recombination: The ARG Recombination Parsimony The ARG and Data Pedigrees: Models and Data Pedigrees & ARGs.
1 Hidden Markov Model Xiaole Shirley Liu STAT115, STAT215.
Hidden Markov Models Wassnaa AL-mawee Western Michigan University Department of Computer Science CS6800 Adv. Theory of Computation Prof. Elise De Doncker.
MACHINE LEARNING 16. HMM. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Modeling dependencies.
Hidden Markov Models BMI/CS 576
Genome Annotation (protein coding genes)
Hidden Markov Models - Training
Catalogues, Homology & Molecular Evolution.
Hidden Markov Models in Bioinformatics min
Hidden Markov Models (HMMs)
Presentation transcript:

Hidden Markov Models in Bioinformatics O1 O2 O3 O4 O5 O6 O7 O8 O9 O10 H1 H2 H3 Definition Three Key Algorithms Summing over Unknown States Most Probable Unknown States Marginalizing Unknown States Key Bioinformatic Applications Pedigree Analysis Profile HMM Alignment Fast/Slowly Evolving States Statistical Alignment

Hidden Markov Models O1 O2 O3 O4 O5 O6 O7 O8 O9 O10 H1 H2 H3 (O1,H1), (O2,H2),……. (On,Hn) is a sequence of stochastic variables with 2 components - one that is observed (Oi) and one that is hidden (Hi). The marginal distribution of the Hi’s are described by a Homogenous Markov Chain: pi,j = P(Hk=i,Hk+1=j) Let pi =P{H1=i) - often pi is the equilibrium distribution of the Markov Chain. Conditional on Hk (all k), the Ok are independent. The distribution of Ok only depends on the value of Hi and is called the emit function

What is the probability of the data? O1 O2 O3 O4 O5 O6 O7 O8 O9 O10 H1 H2 H3 The probability of the observed is , which could be hard to calculate. However, these calculations can be considerably accelerated. Let the probability of the observations (O1,..Ok) conditional on Hk=j. Following recursion will be obeyed:

Example - probability of the data Observables {0, 1} at times 1, 2, 3. Hidden states {a, b}. Emission probabilities: a b 1 .9 .1 .7 .3 Transition probabilities: Equilibrium distribution, p, of a b is .5 .5 Example. Observation 0 1 1 P(aaa) =.5*.9 *.9 P(011| aaa) = .7 *.3 *.3 P(aab) =.5*.9 *.1 P(011| aab) = .7 *.3 *.7 ……………………………………………. Direct calculation: Forward recursion: Observations: Hidden states: .15 .35 pa*P(0|a) = .5 * .7 .099 .3 .9 .1 a b 1 pa = .5 pb = .5 .7(.35*.1+.15*.9)=.119 .3(.099*.9*+.119*.1)=0.0303 .7(.099*.1+.119*.9)=0.0819 Hence P(O) up to the 3rd state is 0.0303+0.0819 = 0.1122 4

What is the most probable ”hidden” configuration? This algorithm is also called Viterby. Let be the sequences of hidden states in the most probably hidden path ie ArgMaxH[ ]. Let be the probability of the most probable path up to k ending in hidden state j. Again recursions can be found: The actual sequence of hidden states can be found recursively by O1 O2 O3 O4 O5 O6 O7 O8 O9 O10 H1 H2 H3

What is the probability of specific ”hidden” state? O1 O2 O3 O4 O5 O6 O7 O8 O9 O10 H1 H2 H3 Let be the probability of the observations from k+1 to n given Hk=j. These will also obey recursions: The probability of the observations and a specific hidden state can found as: And of a specific hidden state can found as:

Example continued - best path, single hidden state pb = .5 1 Observations: Hidden states: or .3 pa = .5 .7 .9 .051 .189 = Max{.7 *.9 *.3, .3 *.1 *.3} b .1 .189 .1191 .3 Single hidden state: Forward: Forward - Backward: a b 1 pa = .5 pb = .5 Observations: Hidden states: a b 1 pa = .5 pb = .5 Observations: Hidden states: Backward: a b 1 pa = .5 pb = .5 Observations: Hidden states: 7

Baum-Welch, Parameter Estimation or Training O1 O2 O3 O4 O5 O6 O7 O8 O9 O10 H1 H2 H3 Objective: Evaluate Transition and Emission Probabilities Set pij and e( ) arbirarily to non-zero values Use forward-backward to re-evaluate pij and e( ) Do this until no significant increase in probability of data To avoid zero probabilities, add pseudo-counts. Other numerical optimization algorithms can be applied.

Fast/Slowly Evolving States Felsenstein & Churchill, 1996 positions sequences k slow - rs fast - rf HMM: pr - equilibrium distribution of hidden states (rates) at first position pi,j - transition probabilities between hidden states L(j,r) - likelihood for j’th column given rate r. L(j,r) - likelihood for first j columns given j’th column has rate r. Make simpler. Show basic probability expressions. Likelihood Recursions: Likelihood Initialisations:

Recombination HMMs 1 2 3 T i-1 i L Data Trees Illustrate better

Statistical Alignment Steel and Hein,2001 + Holmes and Bruno,2001 Emit functions: e(##)= p(N1)f(N1,N2) e(#-)= p(N1), e(-#)= p(N2) p(N1) - equilibrium prob. of N f(N1,N2) - prob. that N1 evolves into N2 - # # E # # - E * * lb l/m (1- lb)e-m l/m (1- lb)(1- e-m) (1- l/m) (1- lb) - # lb l/m (1- lb)e-m l/m (1- lb)(1- e-m) (1- l/m) (1- lb) _ # lb l/m (1- lb)e-m l/m (1- lb)(1- e-m) (1- l/m) (1- lb) # - lb An HMM Generating Alignments

Probability of Data given a pedigree. Elston-Stewart (1971) -Temporal Peeling Algorithm: Mother Father Condition on parental states Recombination and mutation are Markovian Lander-Green (1987) - Genotype Scanning Algorithm: Mother Father Condition on paternal/maternal inheritance Recombination and mutation are Markovian Comment: Obvious parallel to Wiuf-Hein99 reformulation of Hudson’s 1983 algorithm

Further Examples Isochore: Gene Finding: Simple Eukaryotic Churchill,1989,92 Lp(C)=Lp(G)=0.1, Lp(A)=Lp(T)=0.4, Lr(C)=Lr(G)=0.4, Lr(A)=Lr(T)=0.1 poor rich HMM: Likelihood Recursions: Likelihood Initialisations: Gene Finding: Burge and Karlin, 1996 Simple Eukaryotic Simple Prokaryotic Make simpler. Show basic probability expressions.

Further Examples Secondary Structure Elements: Profile HMM Alignment: Goldman, 1996 Further Examples HMM for SSEs: a  L .909 .0005 .091 .005 .881 .184 .062 .086 .852 .325 .212 .462 a  L Adding Evolution: SSE Prediction: Make simpler. Show basic probability expressions. Profile HMM Alignment: Krogh et al.,1994

Summary H1 H2 H3 Definition Three Key Algorithms O1 O2 O3 O4 O5 O6 O7 O8 O9 O10 H1 H2 H3 Definition Three Key Algorithms Summing over Unknown States Most Probable Unknown States Marginalizing Unknown States Key Bioinformatic Applications Pedigree Analysis Isochores in Genomes (CG-rich regions) Profile HMM Alignment Fast/Slowly Evolving States Secondary Structure Elements in Proteins Gene Finding Statistical Alignment

Grammars: Finite Set of Rules for Generating Strings A starting symbol: Ordinary letters: & Variables: ii. A set of substitution rules applied to variables in the present string: finished – no variables Regular Context Free Context Sensitive General (also erasing)

Simple String Generators Terminals (capital) --- Non-Terminals (small) i. Start with S S --> aT bS T --> aS bT  One sentence – odd # of a’s: S-> aT -> aaS –> aabS -> aabaT -> aaba ii. S--> aSa bSb aa bb One sentence (even length palindromes): S--> aSa --> abSba --> abaaba

Stochastic Grammars i. Start with S. S --> (0.3)aT (0.7)bS The grammars above classify all string as belonging to the language or not. All variables has a finite set of substitution rules. Assigning probabilities to the use of each rule will assign probabilities to the strings in the language. If there is a 1-1 derivation (creation) of a string, the probability of a string can be obtained as the product probability of the applied rules. i. Start with S. S --> (0.3)aT (0.7)bS T --> (0.2)aS (0.4)bT (0.2) *0.2 S -> aT -> aaS –> aabS -> aabaT -> aaba *0.3 *0.7 *0.3 *0.2 ii. S--> (0.3)aSa (0.5)bSb (0.1)aa (0.1)bb S -> aSa -> abSba -> abaaba *0.3 *0.5 *0.1

Recommended Literature Vineet Bafna and Daniel H. Huson (2000) The Conserved Exon Method for Gene Finding ISMB 2000 pp. 3-12 S.Batzoglou et al.(2000) Human and Mouse Gene Structure: Comparative Analysis and Application to Exon Prediction. Genome Research. 10.950-58. Blayo, Rouze & Sagot (2002) ”Orphan Gene Finding - An exon assembly approach” J.Comp.Biol. Delcher, AL et al.(1998) Alignment of Whole Genomes Nuc.Ac.Res. 27.11.2369-76. Gravely, BR (2001) Alternative Splicing: increasing diversity in the proteomic world. TIGS 17.2.100- Guigo, R.et al.(2000) An Assesment of Gene Prediction Accuracy in Large DNA Sequences. Genome Research 10.1631-42 Kan, Z. Et al. (2001) Gene Structure Prediction and Alternative Splicing Using Genomically Aligned ESTs Genome Research 11.889-900. Ian Korf et al.(2001) Integrating genomic homology into gene structure prediction. Bioinformatics vol17.Suppl.1 pages 140-148 Tejs Scharling (2001) Gene-identification using sequence comparison. Aarhus University JS Pedersen (2001) Progress Report: Comparative Gene Finding. Aarhus University Reese,MG et al.(2000) Genome Annotation Assessment in Drosophila melanogaster Genome Research 10.483-501. Stein,L.(2001) Genome Annotation: From Sequence to Biology. Nature Reviews Genetics 2.493-

Example continued - parameter optimisation 20