Expected accuracy sequence alignment Usman Roshan.

Slides:



Advertisements
Similar presentations
Hidden Markov Model in Biological Sequence Analysis – Part 2
Advertisements

Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Learning HMM parameters
Hidden Markov Models.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Hidden Markov Models Modified from:
Hidden Markov Models Theory By Johan Walters (SR 2003)
Hidden Markov Models Fundamentals and applications to bioinformatics.
Hidden Markov Models Usman Roshan BNFO 601.
Optimal Sum of Pairs Multiple Sequence Alignment David Kelley.
A Hidden Markov Model for Progressive Multiple Alignment Ari Löytynoja and Michel C. Milinkovitch Appeared in BioInformatics, Vol 19, no.12, 2003 Presented.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
BNFO 602 Multiple sequence alignment Usman Roshan.
Lecture 6, Thursday April 17, 2003
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
. Sequence Alignment via HMM Background Readings: chapters 3.4, 3.5, 4, in the Durbin et al.
1 Protein Multiple Alignment by Konstantin Davydov.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Lecture 5: Learning models using EM
Expected accuracy sequence alignment
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
CS262 Lecture 9, Win07, Batzoglou Multiple Sequence Alignments.
BNFO 602 Lecture 2 Usman Roshan. Sequence Alignment Widely used in bioinformatics Proteins and genes are of different lengths due to error in sequencing.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
BNFO 240 Usman Roshan. Last time Traceback for alignment How to select the gap penalties? Benchmark alignments –Structural superimposition –BAliBASE.
BNFO 602, Lecture 3 Usman Roshan Some of the slides are based upon material by David Wishart of University.
Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.
BNFO 602 Lecture 2 Usman Roshan. Bioinformatics problems Sequence alignment: oldest and still actively studied Genome-wide association studies: new problem,
BNFO 602 Multiple sequence alignment Usman Roshan.
Finding the optimal pairwise alignment We are interested in finding the alignment of two sequences that maximizes the similarity score given an arbitrary.
Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006.
Marina Sirota CS374 October 19, 2004 P ROTEIN M ULTIPLE S EQUENCE A LIGNMENT.
Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
CS262 Lecture 5, Win07, Batzoglou Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Developing Pairwise Sequence Alignment Algorithms
Introduction to Profile Hidden Markov Models
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 24th, 2013.
Hidden Markov Models for Sequence Analysis 4
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
ZORRO : A masking program for incorporating Alignment Accuracy in Phylogenetic Inference Sourav Chatterji Martin Wu.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
S. Salzberg CMSC 828N 1 Three classic HMM problems 2.Decoding: given a model and an output sequence, what is the most likely state sequence through the.
Expected accuracy sequence alignment Usman Roshan.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
Sequence Similarity. PROBCONS: Probabilistic Consistency-based Multiple Alignment of Proteins INSERTXINSERTY MATCH xixixixi yjyjyjyj ― yjyjyjyj xixixixi―
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Evaluation of protein alignments Usman Roshan BNFO 236.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Hidden Markov Models BMI/CS 576
BNFO 602 Lecture 2 Usman Roshan.
BNFO 602 Lecture 2 Usman Roshan.
Presentation transcript:

Expected accuracy sequence alignment Usman Roshan

Expected accuracy alignment The dynamic programming formulation allows us to find the optimal alignment defined by a scoring matrix and gap penalties. This may not necessarily be the most “accurate” or biologically informative. We now look at a different formulation of alignment that allows us to compute the most accurate one instead of the optimal one.

Posterior probability of x i aligned to y j Let A be the set of all alignments of sequences x and y, and define P(a|x,y) to be the probability that alignment a (of x and y) is the true alignment a*. We define the posterior probability of the i th residue of x (x i ) aligning to the j th residue of y (y j ) in the true alignment (a*) of x and y as Do et. al., Genome Research, 2005

Expected accuracy of alignment We can define the expected accuracy of an alignment a as The maximum expected accuracy alignment can be obtained by the same dynamic programming algorithm Do et. al., Genome Research, 2005

Example for expected accuracy True alignment AC_CG ACCCA Expected accuracy=( )/4=1 Estimated alignment ACC_G ACCCA Expected accuracy=( ) ~ 0.75

Estimating posterior probabilities If correct posterior probabilities can be computed then we can compute the correct alignment. Now it remains to estimate these probabilities from the data PROBCONS (Do et. al., Genome Research 2006): estimate probabilities from pairwise HMMs using forward and backward recursions (as defined in Durbin et. al. 1998) Probalign (Roshan and Livesay, Bioinformatics 2006): estimate probabilities using partition function dynamic programming matrices

Posterior probabilities from HMM We need to sum the probabilities of all alignments where x i is aligned to y j. In other words we want:

Forward and backward probabilities Define f k (i) as the probability of emitting x 1 x 2 …x i given that the i th hidden state is k. Similarly the backward probability b k (i) as the probability of emitting x i+1 x i+2 …x n given that the i th hidden state is k. Both f k (i) and b k (i) can be computed quickly by dynamic programming (see HMM lecture notes pages 9 to 11)

Once forward and backward are computed we can calculate

Partition function posterior probabilities Standard alignment score: Probability of alignment (Miyazawa, Prot. Eng. 1995) If we knew the alignment partition function then

Partition function posterior probabilities Alignment partition function (Miyazawa, Prot. Eng. 1995) Subsequently

Partition function posterior probabilities More generally the forward partition function matrices are calculated as

Partition function matrices vs. standard affine recursions

Posterior probability calculation If we defined Z’ as the “backward” partition function matrices then

Posterior probabilities using alignment ensembles By generating an ensemble A(n,x,y) of n alignments of x and y we can estimate P(x i ~y j ) by counting the number of times x i is aligned to y j.. Note that this means we are assigning equal weights to all alignments in the ensemble.

Generating ensemble of alignments We can use stochastic backtracking (Muckstein et. al., Bioinformatics, 2002) to generate a given number of optimal and suboptimal alignments. At every step in the traceback we assign a probability to each of the three possible positions. This allows us to “sample” alignments from their partition function probability distribution. Posteror probabilities turn out to be the same when calculated using forward and backward partition function matrices.

Probalign 1.For each pair of sequences (x,y) in the input set –a. Compute partition function matrices Z(T) –b. Estimate posterior probability matrix P(x i ~ y j ) for (x,y) by 2.Perform the probabilistic consistency transformation and compute a maximal expected accuracy multiple alignment: align sequence profiles along a guide-tree and follow by iterative refinement (Do et. al.).

Multiple protein alignment Protein sequence alignment: hard problem for multiple distantly related proteins Several standard protein alignment benchmarks available: BAliBASE, HOMSTRAD, OXBENCH, PREFAB, and SABMARK Benchmark alignments are based on manual and computational structural alignment of proteins with known structure.

Measure of accuracy Sum-of-pairs score: number of correctly aligned pairs divided by number of pairs in true alignment. Column score: number of correctly aligned columns Statistical significance using Friedman rank test AACAGT AAGT_ _ AACAGT AA_ _GT Blue: correct Red: incorrect Acc: 2/4=50%

Experimental design Methods compared: –Probalign –PROBCONS –MUSCLE –MAFFT Probalign temperature parameter trained on RV11 subset of BAliBASE 3.0. Default (optimized) parameters for remaining programs

BAliBASE 3.0 DataProbalignMAFFTProbconsMUSCLE RV / / / / 35.9 RV / / / / 80.4 RV / / / / 35.1 RV / / / / 38.3 RV / / / / 47.1 RV / / / / 48.7 All87.6 / / / / 48.5 MethodRV11RV12RV20RV30RV40RV50All MAFFTNS< 0.005NS < 0.005NS< Probcons NS < 0.005NS< MUSCLE< < NS< Sum-of-pairs and column score accuracies Friedman rank test P-values

Heterogeneous length data I Max length / Standard dev. ProbalignMAFFTProbconsMUSCLE 500 / / / / / / / / / / / / / / / / / / / / 42.5 RV / 100 (25) 1000 / 200 (20) 92.7 / / / / / / 47.6 BAliBASE datasets with maximum length and minimum devation BAliBASE datasets with long extensions Max length / Standard dev. ProbalignMAFFTProbcons

Heterogeneous length data II Max length / Standard dev. ProbalignMAFFTProbcons 500 / 100 (40)89.1 / / / / 200 (21)88.3 / / / / 300 (9)95.3 / / / / 400 (5)94.6 / / / / 100 (15)90.2 / / / / 200 (12)89.2 / / / / 300 (7)94.5 / / / / 400 (5)94.6 / / / 38.0 BAliBASE 2.0 reference 6 datasets with max length and minimum deviation