Hidden Markov Models in Bioinformatics

Hidden Markov Models in Bioinformatics
O1 O2 O3 O4 O5 O6 O7 O8 O9 O10 H1 H2 H3 Definition Three Key Algorithms Summing over Unknown States Most Probable Unknown States Marginalizing Unknown States Key Bioinformatic Applications Pedigree Analysis Profile HMM Alignment Fast/Slowly Evolving States Statistical Alignment

Hidden Markov Models O1 O2 O3 O4 O5 O6 O7 O8 O9 O10 H1 H2 H3 (O1,H1), (O2,H2),……. (On,Hn) is a sequence of stochastic variables with 2 components - one that is observed (Oi) and one that is hidden (Hi). The marginal distribution of the Hi’s are described by a Homogenous Markov Chain: pi,j = P(Hk=i,Hk+1=j) Let pi =P{H1=i) - often pi is the equilibrium distribution of the Markov Chain. Conditional on Hk (all k), the Ok are independent. The distribution of Ok only depends on the value of Hi and is called the emit function

What is the probability of the data?
O1 O2 O3 O4 O5 O6 O7 O8 O9 O10 H1 H2 H3 The probability of the observed is , which could be hard to calculate. However, these calculations can be considerably accelerated. Let the probability of the observations (O1,..Ok) conditional on Hk=j. Following recursion will be obeyed:

Example - probability of the data
Observables {0, 1} at times 1, 2, 3. Hidden states {a, b}. Emission probabilities: a b 1 .9 .1 .7 .3 Transition probabilities: Equilibrium distribution, p, of a b is .5 .5 Example. Observation P(aaa) =.5*.9 * P(011| aaa) = .7 *.3 *.3 P(aab) =.5*.9 * P(011| aab) = .7 *.3 *.7 ……………………………………………. Direct calculation: Forward recursion: Observations: Hidden states: .15 .35 pa*P(0|a) = .5 * .7 .099 .3 .9 .1 a b 1 pa = .5 pb = .5 .7(.35*.1+.15*.9)=.119 .3(.099*.9*+.119*.1)=0.0303 .7(.099* *.9)=0.0819 Hence P(O) up to the 3rd state is = 4

What is the most probable ”hidden” configuration?
This algorithm is also called Viterby. Let be the sequences of hidden states in the most probably hidden path ie ArgMaxH[ ]. Let be the probability of the most probable path up to k ending in hidden state j. Again recursions can be found: The actual sequence of hidden states can be found recursively by O1 O2 O3 O4 O5 O6 O7 O8 O9 O10 H1 H2 H3

What is the probability of specific ”hidden” state?
O1 O2 O3 O4 O5 O6 O7 O8 O9 O10 H1 H2 H3 Let be the probability of the observations from k+1 to n given Hk=j. These will also obey recursions: The probability of the observations and a specific hidden state can found as: And of a specific hidden state can found as:

Example continued - best path, single hidden state
pb = .5 1 Observations: Hidden states: or .3 pa = .5 .7 .9 .051 .189 = Max{.7 *.9 *.3, .3 *.1 *.3} b .1 .189 .1191 .3 Single hidden state: Forward: Forward - Backward: a b 1 pa = .5 pb = .5 Observations: Hidden states: a b 1 pa = .5 pb = .5 Observations: Hidden states: Backward: a b 1 pa = .5 pb = .5 Observations: Hidden states: 7

Baum-Welch, Parameter Estimation or Training
O1 O2 O3 O4 O5 O6 O7 O8 O9 O10 H1 H2 H3 Objective: Evaluate Transition and Emission Probabilities Set pij and e( ) arbirarily to non-zero values Use forward-backward to re-evaluate pij and e( ) Do this until no significant increase in probability of data To avoid zero probabilities, add pseudo-counts. Other numerical optimization algorithms can be applied.

Fast/Slowly Evolving States Felsenstein & Churchill, 1996
positions sequences k slow - rs fast - rf HMM: pr - equilibrium distribution of hidden states (rates) at first position pi,j - transition probabilities between hidden states L(j,r) - likelihood for j’th column given rate r. L(j,r) - likelihood for first j columns given j’th column has rate r. Make simpler. Show basic probability expressions. Likelihood Recursions: Likelihood Initialisations:

Recombination HMMs 1 2 3 T i-1 i L Data Trees Illustrate better

Statistical Alignment Steel and Hein,2001 + Holmes and Bruno,2001
Emit functions: e(##)= p(N1)f(N1,N2) e(#-)= p(N1), e(-#)= p(N2) p(N1) - equilibrium prob. of N f(N1,N2) - prob. that N1 evolves into N2 # # E # # E * * lb l/m (1- lb)e-m l/m (1- lb)(1- e-m) (1- l/m) (1- lb) - # lb l/m (1- lb)e-m l/m (1- lb)(1- e-m) (1- l/m) (1- lb) _ # lb l/m (1- lb)e-m l/m (1- lb)(1- e-m) (1- l/m) (1- lb) # lb An HMM Generating Alignments

Probability of Data given a pedigree.
Elston-Stewart (1971) -Temporal Peeling Algorithm: Mother Father Condition on parental states Recombination and mutation are Markovian Lander-Green (1987) - Genotype Scanning Algorithm: Mother Father Condition on paternal/maternal inheritance Recombination and mutation are Markovian Comment: Obvious parallel to Wiuf-Hein99 reformulation of Hudson’s 1983 algorithm

Further Examples Isochore: Gene Finding: Simple Eukaryotic
Churchill,1989,92 Lp(C)=Lp(G)=0.1, Lp(A)=Lp(T)=0.4, Lr(C)=Lr(G)=0.4, Lr(A)=Lr(T)=0.1 poor rich HMM: Likelihood Recursions: Likelihood Initialisations: Gene Finding: Burge and Karlin, 1996 Simple Eukaryotic Simple Prokaryotic Make simpler. Show basic probability expressions.

Further Examples Secondary Structure Elements: Profile HMM Alignment:
Goldman, 1996 Further Examples HMM for SSEs: a  L .909 .0005 .091 .005 .881 .184 .062 .086 .852 .325 .212 .462 a  L Adding Evolution: SSE Prediction: Make simpler. Show basic probability expressions. Profile HMM Alignment: Krogh et al.,1994

Summary H1 H2 H3 Definition Three Key Algorithms
O1 O2 O3 O4 O5 O6 O7 O8 O9 O10 H1 H2 H3 Definition Three Key Algorithms Summing over Unknown States Most Probable Unknown States Marginalizing Unknown States Key Bioinformatic Applications Pedigree Analysis Isochores in Genomes (CG-rich regions) Profile HMM Alignment Fast/Slowly Evolving States Secondary Structure Elements in Proteins Gene Finding Statistical Alignment

Grammars: Finite Set of Rules for Generating Strings
A starting symbol: Ordinary letters: & Variables: ii. A set of substitution rules applied to variables in the present string: finished – no variables Regular Context Free Context Sensitive General (also erasing)

Simple String Generators
Terminals (capital) Non-Terminals (small) i. Start with S S --> aT bS T --> aS bT  One sentence – odd # of a’s: S-> aT -> aaS –> aabS -> aabaT -> aaba ii. S--> aSa bSb aa bb One sentence (even length palindromes): S--> aSa --> abSba --> abaaba

Stochastic Grammars i. Start with S. S --> (0.3)aT (0.7)bS
The grammars above classify all string as belonging to the language or not. All variables has a finite set of substitution rules. Assigning probabilities to the use of each rule will assign probabilities to the strings in the language. If there is a 1-1 derivation (creation) of a string, the probability of a string can be obtained as the product probability of the applied rules. i. Start with S. S --> (0.3)aT (0.7)bS T --> (0.2)aS (0.4)bT (0.2) *0.2 S -> aT -> aaS –> aabS -> aabaT -> aaba *0.3 *0.7 *0.3 *0.2 ii. S--> (0.3)aSa (0.5)bSb (0.1)aa (0.1)bb S -> aSa -> abSba -> abaaba *0.3 *0.5 *0.1

Recommended Literature
Vineet Bafna and Daniel H. Huson (2000) The Conserved Exon Method for Gene Finding ISMB pp. 3-12 S.Batzoglou et al.(2000) Human and Mouse Gene Structure: Comparative Analysis and Application to Exon Prediction. Genome Research Blayo, Rouze & Sagot (2002) ”Orphan Gene Finding - An exon assembly approach” J.Comp.Biol. Delcher, AL et al.(1998) Alignment of Whole Genomes Nuc.Ac.Res Gravely, BR (2001) Alternative Splicing: increasing diversity in the proteomic world. TIGS Guigo, R.et al.(2000) An Assesment of Gene Prediction Accuracy in Large DNA Sequences. Genome Research Kan, Z. Et al. (2001) Gene Structure Prediction and Alternative Splicing Using Genomically Aligned ESTs Genome Research Ian Korf et al.(2001) Integrating genomic homology into gene structure prediction. Bioinformatics vol17.Suppl.1 pages Tejs Scharling (2001) Gene-identification using sequence comparison. Aarhus University JS Pedersen (2001) Progress Report: Comparative Gene Finding. Aarhus University Reese,MG et al.(2000) Genome Annotation Assessment in Drosophila melanogaster Genome Research Stein,L.(2001) Genome Annotation: From Sequence to Biology. Nature Reviews Genetics

Example continued - parameter optimisation
20

Hidden Markov Models in Bioinformatics

Similar presentations

Presentation on theme: "Hidden Markov Models in Bioinformatics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hidden Markov Models in Bioinformatics

Similar presentations

Presentation on theme: "Hidden Markov Models in Bioinformatics"— Presentation transcript:

Similar presentations

About project

Feedback