Variants of HMMs. Higher-order HMMs How do we model “memory” larger than one time point? P(  i+1 = l |  i = k)a kl P(  i+1 = l |  i = k,  i -1 =

Slides:



Advertisements
Similar presentations
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Advertisements

Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Hidden Markov Models Fundamentals and applications to bioinformatics.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Lecture 6, Thursday April 17, 2003
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Hidden Markov Models. Decoding GIVEN x = x 1 x 2 ……x N We want to find  =  1, ……,  N, such that P[ x,  ] is maximized  * = argmax  P[ x,  ] We.
S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter
Connection Between Alignment and HMMs. A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC IMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII.
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Expected accuracy sequence alignment
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Heuristic alignment algorithms; Cost matrices 2.5 – 2.9 Thomas van Dijk.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Hidden Markov Models—Variants Conditional Random Fields 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Models 1 2 K … x1 x2 x3 xK.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Elze de Groot1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC.
Similar Sequence Similar Function Charles Yan Spring 2006.
Hidden Markov Models.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
Class 3: Estimating Scoring Rules for Sequence Alignment.
CISC667, F05, Lec16, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (III) Probabilistic methods.
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Sequence Alignment - III Chitta Baral. Scoring Model When comparing sequences –Looking for evidence that they have diverged from a common ancestor by.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Hidden Markov Models for Sequence Analysis 4
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
Expected accuracy sequence alignment Usman Roshan.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Sequence Similarity.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Pairwise Sequence Alignment (cont.)
Variants of HMMs.
Alignment IV BLOSUM Matrices
Presentation transcript:

Variants of HMMs

Higher-order HMMs How do we model “memory” larger than one time point? P(  i+1 = l |  i = k)a kl P(  i+1 = l |  i = k,  i -1 = j)a jkl … A second order HMM with K states is equivalent to a first order HMM with K 2 states state Hstate T a HT (prev = H) a HT (prev = T) a TH (prev = H) a TH (prev = T) state HHstate HT state THstate TT a HHT a TTH a HTT a THH a THT a HTH

Modeling the Duration of States Length distribution of region X: E[l X ] = 1/(1-p) Geometric distribution, with mean 1/(1-p) This is a significant disadvantage of HMMs Several solutions exist for modeling different length distributions XY 1-p 1-q pq

Sol’n 1: Chain several states XY 1-p 1-q p q X X Disadvantage: Still very inflexible l X = C + geometric with mean 1/(1-p)

Sol’n 2: Negative binomial distribution Duration in X: m turns, where  During first m – 1 turns, exactly n – 1 arrows to next state are followed  During m th turn, an arrow to next state is followed m – 1 P(l X = m) = n – 1 (1 – p) n-1+1 p (m-1)-(n-1) = n – 1 (1 – p) n p m-n X p X X p 1 – p p …… Y 1 – p

Example: genes in prokaryotes EasyGene: Prokaryotic gene-finder Larsen TS, Krogh A Negative binomial with n = 3

Solution 3:Duration modeling Upon entering a state: 1.Choose duration d, according to probability distribution 2.Generate d letters according to emission probs 3.Take a transition to next state according to transition probs Disadvantage: Increase in complexity: Time: O(D 2 ) -- Why? Space: O(D) where D = maximum duration of state X

Connection Between Alignment and HMMs

A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC IMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII M (+1,+1) I (+1, 0) J (0, +1) Alignments correspond 1-to-1 with sequences of states M, I, J

Let’s score the transitions -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC IMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII M (+1,+1) I (+1, 0) J (0, +1) Alignments correspond 1-to-1 with sequences of states M, I, J s(x i, y j ) -d -e

How do we find optimal alignment according to this model? Dynamic Programming: M(i, j):Optimal alignment of x 1 …x i to y 1 …y j ending in M I(i, j): Optimal alignment of x 1 …x i to y 1 …y j ending in I J(i, j): Optimal alignment of x 1 …x i to y 1 …y j ending in J The score is additive, therefore we can apply DP recurrence formulas

Needleman Wunsch with affine gaps – state version Initialization: M(0,0) = 0; M(i,0) = M(0,j) = - , for i, j > 0 I(i,0) = d + i  e;J(0,j) = d + j  e Iteration: M(i – 1, j – 1) M(i, j) = s(x i, y j ) + max I(i – 1, j – 1) J(i – 1, j – 1) e + I(i – 1, j) I(i, j) = maxe + J(i, j – 1) d + M(i – 1, j – 1) e + I(i – 1, j) J(i, j) = maxe + J(i, j – 1) d + M(i – 1, j – 1) Termination: Optimal alignment given by max { M(m, n), I(m, n), J(m, n) }

Probabilistic interpretation of an alignment An alignment is a hypothesis that the two sequences are related by evolution Goal: Produce the most likely alignment Assert the likelihood that the sequences are indeed related

A Pair HMM for alignments M P(x i, y j ) I P(x i ) J P(y j ) 1 – 2  –  1 – 2  –        BEGIN END M J I    1 – 2  – 

A Pair HMM for unaligned sequences BEGIN I P(x i ) END BEGIN J P(y j ) END 1 -    P(x, y | R) =  (1 –  ) m P(x 1 )…P(x m )  (1 –  ) n P(y 1 )…P(y n ) =  2 (1 –  ) m+n  i P(x i )  j P(y j ) Model R

To compare ALIGNMENT vs. RANDOM hypothesis Every pair of letters contributes: (1 – 2  –  ) P(x i, y j ) when matched  P(x i ) P(y j ) when gapped (1 –  ) 2 P(x i ) P(y j ) in random model Focus on comparison of P(x i, y j ) vs. P(x i ) P(y j ) BEGIN I P(x i ) END BEGIN J P(y j ) END 1 -    M P(x i, y j ) I P(x i ) J P(y j ) 1 – 2  –  1 – 2  –      

To compare ALIGNMENT vs. RANDOM hypothesis Idea: We will divide alignment score by the random score, and take logarithms Let P(x i, y j ) (1 – 2  –  ) s(x i, y j ) = log ––––––––––– + log ––––––––––– P(x i ) P(y j ) (1 –  ) 2  (1 – 2  –  ) P(x i ) d = – log –––––––––––––––––––– (1 –  ) (1 – 2  –  ) P(x i )  P(x i ) e = – log ––––––––––– (1 –  ) P(x i ) Every letter b in random model contributes (1 –  ) P(b)

The meaning of alignment scores Because , , are small, and ,  are very small, P(x i, y j ) (1 – 2  –  ) P(x i, y j ) s(x i, y j ) = log ––––––––– + log ––––––––––  log –––––––– + log(1 – 2  ) P(x i ) P(y j ) (1 –  ) 2 P(x i ) P(y j )  (1 –  –  ) 1 –  d = – log ––––––––––––––––––  – log  –––––– (1 –  ) (1 – 2  –  ) 1 – 2   e = – log –––––––  – log  (1 –  )

The meaning of alignment scores The Viterbi algorithm for Pair HMMs corresponds exactly to the Needleman-Wunsch algorithm with affine gaps However, now we need to score alignment with parameters that add up to probability distributions   1/mean length of next gap   1/mean arrival time of next gap  affine gaps decouple arrival time with length   1/mean length of aligned sequences(set to ~0)   1/mean length of unaligned sequences(set to ~0)

The meaning of alignment scores Match/mismatch scores: P(x i, y j ) s(a, b)  log ––––––––––– (let’s ignore log(1 – 2  ) for the moment – assume no gaps) P(x i ) P(y j ) Example: Say DNA regions between human and mouse have average conservation of 50% Then P(A,A) = P(C,C) = P(G,G) = P(T,T) = 1/8 (so they sum to ½) P(A,C) = P(A,G) =……= P(T,G) = 1/24 (24 mismatches, sum to ½) Say P(A) = P(C) = P(G) = P(T) = ¼ log [ (1/8) / (1/4 * 1/4) ] = log 2 = 1, for match Then, s(a, b) = log [ (1/24) / (1/4 * 1/4) ] = log 16/24 = Cutoff similarity that scores 0: s*1 – (1 – s)*0.585 = 0 According to this model, a 37.5%-conserved sequence with no gaps would score on average * 1 – * = 0 Why? 37.5% is between the 50% conservation model, and the random 25% conservation model !

Substitution matrices A more meaningful way to assign match/mismatch scores For protein sequences, different substitutions have dramatically different frequencies! PAM Matrices: 1.Start from a curated set of very similar protein sequences 2.Construct ancestral sequences (using parsimony) 3.Calculate A ab : frequency of letters a and b interchanging 4.Calculate B ab = P(b|a) = A ab /(  c≤d A cd ) 5.Adjust matrix B so that  a,b q a q b B ab = 0.01 PAM(1) 6.Let PAM(N) = [PAM(1)] N-- Common PAM(250)

Substitution Matrices BLOSUM matrices: 1.Start from BLOCKS database (curated, gap-free alignments) 2.Cluster sequences according to > X% identity 3.Calculate A ab : # of aligned a-b in distinct clusters, correcting by 1/mn, where m, n are the two cluster sizes 4.Estimate P(a) = (  b A ab )/(  c≤d A cd ); P(a, b) = A ab /(  c≤d A cd )

BLOSUM matrices BLOSUM 50 BLOSUM 62 (The two are scaled differently)