Variants of HMMs.

Slides:



Advertisements
Similar presentations
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Advertisements

Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Profile HMMs for sequence families and Viterbi equations Linda Muselaars and Miranda Stobbe.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Lecture 6, Thursday April 17, 2003
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter
Connection Between Alignment and HMMs. A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC IMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII.
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
. Sequence Alignment via HMM Background Readings: chapters 3.4, 3.5, 4, in the Durbin et al.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Hidden Markov Models—Variants Conditional Random Fields 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Models 1 2 K … x1 x2 x3 xK.
Sequence similarity.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC.
Similar Sequence Similar Function Charles Yan Spring 2006.
Hidden Markov Models.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
Class 3: Estimating Scoring Rules for Sequence Alignment.
CISC667, F05, Lec16, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (III) Probabilistic methods.
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Variants of HMMs. Higher-order HMMs How do we model “memory” larger than one time point? P(  i+1 = l |  i = k)a kl P(  i+1 = l |  i = k,  i -1 =
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Hidden Markov Models for Sequence Analysis 4
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
. Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10 This class has been edited from Nir Friedman’s lecture. Changes made.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Expected accuracy sequence alignment Usman Roshan.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Construction of Substitution matrices
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
1 Hidden Markov Model Xiaole Shirley Liu STAT115, STAT215.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Learning to Align: a Statistical Approach
Definition of Minimum Edit Distance
Hidden Markov Models - Training
Lecture 5: Local Sequence Alignment Algorithms
Sequence comparison: Significance of similarity scores
Ab initio gene prediction
Pairwise sequence Alignment.
Pair Hidden Markov Model
Pairwise Sequence Alignment (cont.)
Lecture 6: Sequence Alignment Statistics
Lecture 13: Hidden Markov Models and applications
CSE 5290: Algorithms for Bioinformatics Fall 2009
Alignment IV BLOSUM Matrices
Presentation transcript:

Variants of HMMs

Higher-order HMMs How do we model “memory” larger than one time point? P(i+1 = l | i = k) akl P(i+1 = l | i = k, i -1 = j) ajkl … A second order HMM with K states is equivalent to a first order HMM with K2 states aHHT state HH state HT aHT(prev = H) aHT(prev = T) aHTH state H state T aTHH aHTT aTHT state TH state TT aTH(prev = H) aTH(prev = T) aTTH

Modeling the Duration of States 1-p Length distribution of region X: E[lX] = 1/(1-p) Geometric distribution, with mean 1/(1-p) This is a significant disadvantage of HMMs Several solutions exist for modeling different length distributions X Y p q 1-q

Sol’n 1: Chain several states p 1-p X X X Y q 1-q Disadvantage: Still very inflexible lX = C + geometric with mean 1/(1-p)

Sol’n 2: Negative binomial distribution p p p 1 – p 1 – p 1 – p X X X Y …… Duration in X: m turns, where During first m – 1 turns, exactly n – 1 arrows to next state are followed During mth turn, an arrow to next state is followed m – 1 m – 1 P(lX = m) = n – 1 (1 – p)n-1+1p(m-1)-(n-1) = n – 1 (1 – p)npm-n

Example: genes in prokaryotes EasyGene: Prokaryotic gene-finder Larsen TS, Krogh A Negative binomial with n = 3

Solution 3: Duration modeling Upon entering a state: Choose duration d, according to probability distribution Generate d letters according to emission probs Take a transition to next state according to transition probs Disadvantage: Increase in complexity: Time: O(D2) -- Why? Space: O(D) where D = maximum duration of state X

Connection Between Alignment and HMMs

A state model for alignment (+1,+1) Alignments correspond 1-to-1 with sequences of states M, I, J I (+1, 0) J (0, +1) -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC IMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII

Let’s score the transitions s(xi, yj) M (+1,+1) Alignments correspond 1-to-1 with sequences of states M, I, J s(xi, yj) s(xi, yj) -d -d I (+1, 0) J (0, +1) -e -e -e -e -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC IMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII

How do we find optimal alignment according to this model? Dynamic Programming: M(i, j): Optimal alignment of x1…xi to y1…yj ending in M I(i, j): Optimal alignment of x1…xi to y1…yj ending in I J(i, j): Optimal alignment of x1…xi to y1…yj ending in J The score is additive, therefore we can apply DP recurrence formulas

Needleman Wunsch with affine gaps – state version Initialization: M(0,0) = 0; M(i,0) = M(0,j) = -, for i, j > 0 I(i,0) = d + ie; J(0,j) = d + je Iteration: M(i – 1, j – 1) M(i, j) = s(xi, yj) + max I(i – 1, j – 1) J(i – 1, j – 1) e + I(i – 1, j) I(i, j) = max e + J(i, j – 1) d + M(i – 1, j – 1) J(i, j) = max e + J(i, j – 1) Termination: Optimal alignment given by max { M(m, n), I(m, n), J(m, n) }

Probabilistic interpretation of an alignment An alignment is a hypothesis that the two sequences are related by evolution Goal: Produce the most likely alignment Assert the likelihood that the sequences are indeed related

A Pair HMM for alignments BEGIN M P(xi, yj) I P(xi) J P(yj) 1 – 2 –  1 – 2 –     1 – 2 –    M I J    END

A Pair HMM for unaligned sequences Model R 1 -  1 -  I P(xi) J P(yj) BEGIN END BEGIN END 1 -  1 -    P(x, y | R) = (1 – )m P(x1)…P(xm) (1 – )n P(y1)…P(yn) = 2(1 – )m+n i P(xi) j P(yj)

To compare ALIGNMENT vs. RANDOM hypothesis 1 – 2 –  Every pair of letters contributes: (1 – 2 – ) P(xi, yj) when matched  P(xi) P(yj) when gapped (1 – )2 P(xi) P(yj) in random model Focus on comparison of P(xi, yj) vs. P(xi) P(yj) M P(xi, yj) 1 – 2 –  1 – 2 –     I P(xi) J P(yj)    1 -  1 -  BEGIN I P(xi) END BEGIN J P(yj) END 1 -  1 -   

To compare ALIGNMENT vs. RANDOM hypothesis Idea: We will divide alignment score by the random score, and take logarithms Let P(xi, yj) (1 – 2 – ) s(xi, yj) = log ––––––––––– + log ––––––––––– P(xi) P(yj) (1 – )2 (1 – 2 – ) P(xi) d = – log –––––––––––––––––––– (1 – ) (1 – 2 – ) P(xi)  P(xi) e = – log ––––––––––– (1 – ) P(xi) Every letter b in random model contributes (1 – ) P(b)

The meaning of alignment scores Because , , are small, and ,  are very small, P(xi, yj) (1 – 2 – ) P(xi, yj) s(xi, yj) = log ––––––––– + log ––––––––––  log –––––––– + log(1 – 2) P(xi) P(yj) (1 – )2 P(xi) P(yj) (1 –  – ) 1 –  d = – log ––––––––––––––––––  – log  –––––– (1 – ) (1 – 2 – ) 1 – 2  e = – log –––––––  – log  (1 – )

The meaning of alignment scores The Viterbi algorithm for Pair HMMs corresponds exactly to the Needleman-Wunsch algorithm with affine gaps However, now we need to score alignment with parameters that add up to probability distributions  1/mean length of next gap  1/mean arrival time of next gap affine gaps decouple arrival time with length  1/mean length of aligned sequences (set to ~0)  1/mean length of unaligned sequences (set to ~0)

The meaning of alignment scores Match/mismatch scores: P(xi, yj) s(a, b)  log ––––––––––– (let’s ignore log(1 – 2) for the moment – assume no gaps) P(xi) P(yj) Example: Say DNA regions between human and mouse have average conservation of 50% Then P(A,A) = P(C,C) = P(G,G) = P(T,T) = 1/8 (so they sum to ½) P(A,C) = P(A,G) =……= P(T,G) = 1/24 (24 mismatches, sum to ½) Say P(A) = P(C) = P(G) = P(T) = ¼ log [ (1/8) / (1/4 * 1/4) ] = log 2 = 1, for match Then, s(a, b) = log [ (1/24) / (1/4 * 1/4) ] = log 16/24 = -0.585 Cutoff similarity that scores 0: s*1 – (1 – s)*0.585 = 0 According to this model, a 37.5%-conserved sequence with no gaps would score on average 0.375 * 1 – 0.725 * 0.585 = 0 Why? 37.5% is between the 50% conservation model, and the random 25% conservation model !

Substitution matrices A more meaningful way to assign match/mismatch scores For protein sequences, different substitutions have dramatically different frequencies! PAM Matrices: Start from a curated set of very similar protein sequences Construct ancestral sequences (using parsimony) Calculate Aab: frequency of letters a and b interchanging Calculate Bab = P(b|a) = Aab/(c≤d Acd) Adjust matrix B so that a,b qa qb Bab = 0.01 PAM(1) Let PAM(N) = [PAM(1)]N -- Common PAM(250)

Substitution Matrices BLOSUM matrices: Start from BLOCKS database (curated, gap-free alignments) Cluster sequences according to > X% identity Calculate Aab: # of aligned a-b in distinct clusters, correcting by 1/mn, where m, n are the two cluster sizes Estimate P(a) = (b Aab)/(c≤d Acd); P(a, b) = Aab/(c≤d Acd)

BLOSUM matrices BLOSUM 50 BLOSUM 62 (The two are scaled differently)