Pair Hidden Markov Model

Slides:



Advertisements
Similar presentations
Gene Prediction: Similarity-Based Approaches
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Ab initio gene prediction Genome 559, Winter 2011.
Measuring the degree of similarity: PAM and blosum Matrix
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Lecture 8 Alignment of pairs of sequence Local and global alignment
A Hidden Markov Model for Progressive Multiple Alignment Ari Löytynoja and Michel C. Milinkovitch Appeared in BioInformatics, Vol 19, no.12, 2003 Presented.
Lecture 6, Thursday April 17, 2003
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.
Hidden Markov Models Sasha Tkachev and Ed Anderson Presenter: Sasha Tkachev.
Heuristic alignment algorithms and cost matrices
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Sequence Alignment via HMM Background Readings: chapters 3.4, 3.5, 4, in the Durbin et al.
Sequence Alignment.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Comparative ab initio prediction of gene structures using pair HMMs
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
Introduction to Bioinformatics Algorithms Sequence Alignment.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Deepak Verghese CS 6890 Gene Finding With A Hidden Markov model Of Genomic Structure and Evolution. Jakob Skou Pedersen and Jotun Hein.
Variants of HMMs. Higher-order HMMs How do we model “memory” larger than one time point? P(  i+1 = l |  i = k)a kl P(  i+1 = l |  i = k,  i -1 =
Sequence alignment, E-value & Extreme value distribution
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene Prediction: Similarity-Based Approaches.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Sequence Alignment.
Introduction to Profile Hidden Markov Models
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Protein Sequence Alignment and Database Searching.
Hidden Markov Models for Sequence Analysis 4
. EM with Many Random Variables Another Example of EM Sequence Alignment via HMM Lecture # 10 This class has been edited from Nir Friedman’s lecture. Changes.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Chapter 3 Computational Molecular Biology Michael Smith
. Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10 This class has been edited from Nir Friedman’s lecture. Changes made.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
Expected accuracy sequence alignment Usman Roshan.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Hidden Markov Models BMI/CS 576
Gil McVean Department of Statistics, Oxford
Sequence Alignment.
Hidden Markov Models - Training
Ab initio gene prediction
Pairwise sequence Alignment.
Intro to Alignment Algorithms: Global and Local
Hidden Markov Models (HMMs)
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
Variants of HMMs.
Pairwise Alignment Global & local alignment
CSE 5290: Algorithms for Bioinformatics Fall 2009
Hidden Markov Model Lecture #6
Sequence alignment, E-value & Extreme value distribution
Pairwise Sequence Alignment (II)
Presentation transcript:

Pair Hidden Markov Model

Three kinds of pair HMMs (PHMMs) PHMM for pairwise sequence alignment BSA Chapter 4 PHMM for the analysis (e.g. gene prediction) on two aligned sequences (i.e. the pre-calculated pairwise alignments) Twinscan PHMM for simultaneously pairwise alignment and analysis SLAM

Pairwise sequence alignment Given two sequences over an alphabet (4 nucleotides or 20 amino acids): ATGTTAT and ATCGTAC A T - G T T A T A T C G T - A C By inserting ‘-’s and shifting two sequences, they can be aligned into a table of two rows with the same length:

Scoring a pairwise alignment Mismatches are penalized by –μ, indels are penalized by –σ, and matches are rewarded with +1, the resulting score is: #matches – μ(#mismatches) – σ (#indels) A T - G T T A T A T C G T - A C 5- μ -2σ

Scoring Matrix: Example K 5 -2 -1 - 7 3 6 Notice that although R and K are different amino acids, they have a positive score. Why? They are both positively charged amino acids will not greatly change function of protein. AKRANR KAAANK -1 + (-1) + (-2) + 5 + 7 + 3 = 11

Scoring matrices Amino acid substitution matrices PAM BLOSUM DNA substitution matrices DNA is less conserved than protein sequences Less effective to compare coding regions at nucleotide level

Affine Gap Penalties ATA__GC ATATTGC ATAG_GC AT_GTGC In nature, a series of k indels often come as a single event rather than a series of k single nucleotide events: ATA__GC ATATTGC ATAG_GC AT_GTGC This is more likely. This is less likely. Normal scoring would give the same score for both alignments

Accounting for Gaps Gaps- contiguous sequence of spaces in one of the rows Score for a gap of length x is: -(ρ + σx) where ρ >0 is the penalty for introducing a gap: gap opening penalty ρ will be large relative to σ: gap extension penalty because you do not want to add too much of a penalty for extending the gap.

Affine Gap Penalties Gap penalties: -ρ-σ when there is 1 indel -ρ-2σ when there are 2 indels -ρ-3σ when there are 3 indels, etc. -ρ- x·σ (-gap opening - x gap extensions) Somehow reduced penalties (as compared to naive scoring) are given to runs of horizontal and vertical edges

Alignment: a path in the Alignment Graph 1 2 3 4 5 6 7 G A T C w v 0 1 2 2 3 4 5 6 7 A T - G T T A T A T C G T - A C 0 1 2 3 4 5 5 6 7 (0,0) , (1,1) , (2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (7,7) - Corresponding path -

Alignment as a Path in the Edit Graph 1 2 3 4 5 6 7 G A T C w v Old Alignment 012234567 x= AT_GTTAT y= ATCGT_AC 012345567 New Alignment 012234567 x= AT_GTTAT y= ATCG_TAC 012344567

Representing sequence alignment using pair HMM HMM for sequence alignment, which incorporates affine gap scores. “Hidden” States Match (M) Insertion in x (X) insertion in y (Y) Observation Symbols Match (M): {(a,b)| a,b in ∑ }. Insertion in x (X): {(a,-)| a in ∑ }. Insertion in y (Y): {(-,a)| a in ∑ }.

Representing sequence alignment using pair HMM  Emission probabilities: M: Pxi,yj M: qxi Y: qyj 1- X  1-2 M   Y 1-

Alignment: a path  a hidden state sequence 1 2 3 4 5 6 7 G A T C w v A T - G T T A T A T C G T - A C M M Y M M X M M

Sequence alignment using pair HMM Based on the HMM, each alignment of two DNA/protein sequences can be assigned with a probability score; Each “observation symbol” of the HMM is an aligned pair of two letters, or of a letter and a gap. The Markov chain of hidden states should represent a scoring scheme reflecting an evolutionary model. Transition and emission probabilities define the probability of each aligned pair of sequences. Given two input sequences, we look for an alignment of these two sequences of maximum probability.

Transitions and Emission Probabilities ε 1- ε δ 1-2δ M X Y Transitions probabilities (note the forbidden ones). δ = probability for 1st gap ε = probability for extending gap. Emission Probabilities Match: (a,b) with pab – only from M states Insertion in x: (a,-) with qa – only from X state Insertion in y: (-,a).with qa - only from Y state.

Scoring alignments For each pair of sequences x (of length m) and y (of length n), there are many alignments of x and y, each corresponds to a different state sequence (with the length between max{m,n} and m+n). Given the transmission and emission probabilities, each alignment has a defined score – the product of the corresponding probabilities. An alignment is “most probable”, if it maximizes this score.

Finding the most probable alignment Let vM(i,j) be the probability of the most probable alignment of x(1..i) and y(1..j), which ends with a match (state M). Similarly, vX(i,j) and vY(i,j), the probabilities of the most probable alignment of x(1..i) and y(1..j), which ends with states X or Y, respectively.

Most probable alignment Similar argument for vX(i,j) and vY(i,j), the probabilities of the most probable alignment of x(1..i) and y(1..j), which ends with an insertion to x or y, are:

Adding termination probabilities Different alignments of x and y may have different lengths. To get a coherent probabilistic model we need to define a probability distribution over sequences of different lengths. For this, an END state is added, with transition probability τ from any other state to END. This assumes expected sequence length of 1/ τ. M X Y END 1-2δ -τ δ τ 1-ε -τ ε 1 The last transition in each alignment is to the END state, with probability τ

Representing sequence alignment using pair HMM    1- X 1--2  1-2 Begin End M   Y 1-

The log-odds scoring function We wish to know if the alignment score is above or below the score of random alignment of sequences with the same length. Model comparison We need to model random sequence alignment by HMM, with end state. This model assigns probability to each pair of sequences x and y of arbitrary lengths m and n.

HMM for a random sequence alignment The transition probabilities for the random model, with termination probability η: (x is the start state) X Y END 1- η η 1 The emission probability for a is qa. Thus the probability of x (of length n) and y (of length m) being random is: And the corresponding score is:

HMM for random sequence alignment

Markov Chains for “Random” and “Model” X Y END 1-2δ -τ δ τ 1-ε -τ ε 1 “Random” X Y END 1- η η 1 “Model”

Combining models in the log-odds scoring function In order to compare the M score to the R score of sequences x and y, we can find an optimal M score, and then subtract from it the R score. This is insufficient when we look for local alignments, where the optimal substrings in the alignment are not known in advance. A better way: Define a log-odds scoring function which keeps track of the difference Match-Random scores of the partial strings during the alignment. At the end add to the score (logτ – 2logη) to compensate for the end transitions in both models.

The log-odds scoring function (assuming that letters at insertions/deletions are selected by the random model) And at the end add to the score (logτ – 2logη).

A Pair HMM For Local Alignment

Full Probability Of The Two Sequences HMMs allow for calculating the probability that a given pair of sequences are related according to the HMM by any alignment This is achieved by summing over all alignments

Full Probability Of The Two Sequences The way to calculate the sum is by using the forward algorithm fk(i,j) : the combined probability of all alignments up to (i,j) that end in state k

Forward Algorithm For Pair HMMs P(x,y)

Full Probability Of The Two Sequences P(x,y) gives the likelihood that x and y are related by some unspecified alignment, as opposed to being unrelated If there is an unambiguous best alignment, P(x,y) will be “dominated” by the single hidden state seuence corresponding to that alignment

How correct is the alignment Define a posterior distribution P(s|x,y) over all alignments given a pair of sequences x and y Probability that the optimal scoring alignment is correct: Viterbi algorithm Forward algorithm

Usually the probability that the optimal scoring alignment is correct, is extremely small! Reason: there are many small variants of the best alignment that have nearly the same score.

The Posterior Probability That Two Residues Are Aligned If the probability of any single complete path being entirely correct is small, can we say something about the local accuracy of an alignment? It is useful to be able to give a reliability measure for each part of an alignment

The posterior probability that two residues are aligned The idea is: calculate the probability of all the alignments that pass through a specified matched pair of residues (xi,yj) Compare this value with the full probability of all alignments of the pair of sequences If the ratio is close to 1, then the match is highly reliable If the ratio is close to 0, then the match is unreliable

The posterior probability that two residues are aligned Notation: xiyj denotes that xi is aligned to yj We are interested in P(xiyj|x,y) We have P(x,y) is computed using the forward algorithm P(x,y,xiyj): the first term in computed by the forward algorithm, and the second is computed by the backward algorithm (=bM(i,j) in the backward algorithm)

Backward Algorithm For Pair HMMs

Pair HMM for gene finding (Twinscan) Twinscan is an augmented version of the GHMM used in Genscan.

Genscan Model Genscan considers the following: Promoter signals Polyadenylation signals Splice signals Probability of coding and non-coding DNA Gene, exon and intron length Chris Burge and Samuel Karlin, Prediction of Complete Gene Structures in Human Genomic DNA, JMB. (1997) 268, 78-94

Twinscan Algorithm Align the two sequences (eg. from human and mouse); The similar hidden states as Genscan; New “alphabet” for observation symbols: 4 x 3 = 12 symbols:  = { A-, A:, A|, C-, C:, C|, G-, G:, G|, U-, U:, U| } Mark each base as gap ( - ), mismatch ( : ), match ( | )

Twinscan Algorithm Run Viterbi using emissions ek(b), where b  { A-, A:, A|, …, T| } Note: Emission distributions ek(b) estimated from the alignment of real gene pairs from human/mouse eI(x|) < eE(x|): matches favored in exons eI(x-) > eE(x-): gaps (and mismatches) favored in introns

Example Human: ACGGCGACUGUGCACGU Mouse: ACUGUGAC GUGCACUU Align: ||:|:|||-||||||:| Input to Twinscan HMM: A| C| G: G| C: G| A| C| U- G| U| G| C| A| C| G: U| Recall, eE(A|) > eI(A|) eE(A-) < eI(A-) Likely exon

HMMs for simultaneous alignment and gene finding (SLAM) Exon1 Exon2 Exon3 Intron1 Intron2 [human] 5’ 3’ CNS [mouse] Exon = coding Intron = non-coding CNS = conserved non-coding

Generalized Pair HMMs This is the state space, or half the state space, for SLAM. As you can see it looks very similar to the single organism case. In fact it is identical. But the output of the states are now pairs of sequences, and moreover an _aligned_ pair of sequences, rather than just one sequence. There are a number of complications with this approach.

Generalized Pair HMMs (SLAM) Therefor we had to adjust or model a bit. I told you before that the observed sequence identity in a study of 1196 human and mouse genes was 36%. We found that in fact most of the intron and intergenic regions were even lower in identity, with stretches of highly similar, non-coding sequence, conserved non-coding sequence, or CNS. Thus we changed our intron and intergene model from being a PHMM, to model the two sequences completely independent, one at a time, interspersed by highly conserved stretches, CNSs, modelled with a regular PHMM. Thus our model now consists of a PHMM on the protein level in the coding regions, and a PHMM on the DNA level in the CNSs. We found this to bring down a whole lot of overprediction, and our program is now a powerful tool to distinguish between conserved coding and conserved non-coding. Something that is a completely novel feature in existing gene finding tools.

Gapped alignment This is an example of a fairly simply limitation of the search space, also know as a banded Smith-Waterman alignment. Since most sequence combination in the search space will have zero probability, and the true solution is likely to be found along the diagonal, one can limit the search space for each base pair in the first sequence to a subinterval, or a window of length h, in the second. This makes the problem grow linearly in the length of one of the input sequences.

Measuring Performance