Or, What is a correspondence set anyway?! Topic 12 Chapter 16, Du and Bourne “Structural Bioinformatics”

Slides:



Advertisements
Similar presentations
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Advertisements

Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Planning under Uncertainty
Lecture 8 Alignment of pairs of sequence Local and global alignment
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Agenda A brief introduction The MASS algorithm The pairwise case Extension to the multiple case Experimental results.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Reduced Support Vector Machine
Finding Compact Structural Motifs Presented By: Xin Gao Authors: Jianbo Qian, Shuai Cheng Li, Dongbo Bu, Ming Li, and Jinbo Xu University of Waterloo,
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Appendix: Automated Methods for Structure Comparison Basic problem: how are any two given structures to be automatically compared in a meaningful way?
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Similar Sequence Similar Function Charles Yan Spring 2006.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Protein Sequence Comparison Patrice Koehl
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Sequencing a genome and Basic Sequence Alignment
Chapter 5 Multiple Sequence Alignment.
Protein Structure Alignment by Incremental Combinatorial Extension (CE) of the Optimal Path Ilya N. Shindyalov, Philip E. Bourne.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Developing Pairwise Sequence Alignment Algorithms
IBGP/BMI 705 Lab 4: Protein structure and alignment TA: L. Cooper.
Sequence Analysis Determining how similar 2 (or more) gene/protein sequences are (too each other) is a “staple” function in bioinformatics. This information.
Bioiformatics I Fall Dynamic programming algorithm: pairwise comparisons.
1 CE 530 Molecular Simulation Lecture 7 David A. Kofke Department of Chemical Engineering SUNY Buffalo
Structural alignment Protein structure Every protein is defined by a unique sequence (primary structure) that folds into a unique.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Structural alignments of Proteins using by TOPOFIT method Vitkup D., Melamud E., Moult J., Sander C. Completeness in structural genomics. Nature Struct.
Chapter 9 Superposition and Dynamic Programming 1 Chapter 9 Superposition and dynamic programming Most methods for comparing structures use some sorts.
06 - Boundary Models Overview Edge Tracking Active Contours Conclusion.
Structure superposition ≠ Structure alignment Lecture 11 Chapter 16, Du and Bourne “Structural Bioinformatics”
Hidden Markov Models for Sequence Analysis 4
PDBe-fold (SSM) A web-based service for protein structure comparison and structure searches Gaurav Sahni, Ph.D.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Dynamic Programming. Well known algorithm design techniques:. –Divide-and-conquer algorithms Another strategy for designing algorithms is dynamic programming.
RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Sequencing a genome and Basic Sequence Alignment
Protein Structure Comparison. Sequence versus Structure The protein sequence is a string of letters: there is an optimal solution (DP) to the problem.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.
DALI Method Distance mAtrix aLIgnment
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Pharm 201 Lecture 10, Reductionism and Classification Require Detailed Comparison Consider 3D Comparison Pharm 201/Bioinformatics I Philip E. Bourne.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
EVOLUTIONARY HMMS BAYESIAN APPROACH TO MULTIPLE ALIGNMENT Siva Theja Maguluri CS 598 SS.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Pair-wise Structural Comparison using DALILite Software of DALI Rajalekshmy Usha.
Course14 Dynamic Vision. Biological vision can cope with changing world Moving and changing objects Change illumination Change View-point.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Step 3: Tools Database Searching
Lecture 11 CS5661 Structural Bioinformatics – Structure Comparison Motivation Concepts Structure Comparison.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
EMBL-EBI Eugene Krissinel SSM - MSDfold. EMBL-EBI MSDfold (SSM)
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
3. Brute Force Selection sort Brute-Force string matching
DALI Method Distance mAtrix aLIgnment
Presentation transcript:

Or, What is a correspondence set anyway?! Topic 12 Chapter 16, Du and Bourne “Structural Bioinformatics”

Alignment vs. superposition Structural alignment attempts to establish homology between two or more polymer structures based on their shape and 3D structure. Structural alignment requires no a priori knowledge of equivalent positions. Structural alignment is a valuable tool for the comparison of proteins with low sequence similarity, where evolutionary relationships between proteins cannot be easily detected by standard sequence alignment techniques. Conversely, simple structural superposition uses knowledge of at least some equivalent residues to guide a rigid body superposition. The most basic possible comparison between protein structures makes no attempt to align the input structures. Requires a precalculated alignment as input to determine which of the residues in the sequence are intended to be considered in the RMSD calculation.

Structure alignment Second step First step + Structure alignments are based on structure similarity, from which sequence alignments can be trivially extracted. Due to computational complexity, most structural alignments are pairwise, but multiple alignment methods do exist.

Dynamic programming and sequence alignment To really understand structure alignment, you need to understand sequence alignment...  Dynamic programming (DP) is an algorithm originally developed by Richard Bellman in the early 1950s for “multistage decision processes.” DP methods solve optimization problems, very useful in bioinformatics applications, for example sequence alignment. Even though there are a large number of possible solutions, but only one (or a few) best solution(s).  Foundation: Any partial sub-path ending at a point along the true optimal path must itself be an optimal path leading up to that point. So the optimal path can be found by incremental extensions of optimal sub-paths, leading to a recursive algorithm that is (typically) guaranteed to produce the best answer.  There are two major types of optimal DP sequence alignments: Global (Needleman-Wunsch) and local (Smith-Waterman) alignments.  Based on the assumption of independence, where the score of a residue (mis)match is unaffected by other pairs, thus joint probability! For example… ASCTVLATCAVIASCTVLATCAVI Based on the magic of logarithms

Substitution (scoring) matrix Substitution matrices are composed of log-ratios that compare observed pairs to background expectation. S(ij) > 0 indicate ‘preferred’ matches. For example, the BLOSUM-62 matrix…

Dynamic Programming (DP) Match: +5 Mismatch: -2 Insertion/deletion: -6 Sean Eddy, 2004, Nature Biotechnology

Back to structure alignment Independence is not a valid assumption in structure because… Similarly, in RNA… That is, the probability of mutating the above lysine to X, p(KX), is NOT independent of the aspartate. This is, of course, the reality in sequence alignment too, but we ignore this fact because we are treating the protein as a 1D sequence that doesn’t reveal those details.

Rigid body treatment ≠ independence of positions Structure alignment treats proteins as rigid bodies, leading to an even more serious violation of independence. That is, adjusting the position of the purple residue, for example, to maximize overlap with its target will also alter the position of the green residue because they rigidly related. Rotation of purple by 90 o also rotates the green

Formalizing the structure alignment problem Given two sets of points A = (a 1, a 2, …, a n ) and B = (b 1,b 2,…b m ) in Cartesian space, find the optimal subsets A(P) and B(Q) with |A(P)| = |B(Q)|, and find the optimal rigid body transformation G between the two subsets A(P) and B(Q) that minimizes a given distance metric D over all possible rigid body transformation G, i.e. The two subsets A(P) and B(Q) define a “correspondence”, and p = |A(P)| = |B(Q)| is called the correspondence length. Naturally, the correspondence length is maximal when A(P) and B(Q) are similar. Therefore there are essentially two problems in structure alignment: (i.) Find the correspondence set (which is NP-hard), and (ii.) Find the alignment transform (which is O(n)).

In the structure alignment literature, you will frequently encounter coordinate root mean squared deviation, which is just like RMSD except B describes a coordinate transformation of b. Where B describes a coordinate transformation of b. Just to clarify…

 DALI: Uses 2D distance matrices between CA atoms to represent each structure. Conceptually, the alignment problem is then straightforward, you must simply maximally overlay the matrices (as described in an earlier cartoon). Holm and Sander. Protein structure comparison by alignment of distance matrices. J Mol Biol 1993, 233:  CE (Combinatorial extension): Uses characteristics of local geometry to seed structural alignments and then joins these regions of local similarity (called aligned fragment pairs, AFPs) into an “optimal” path for the full alignment. Bottom-up approach. Shindyalov and Bourne, Protein structure alignment by incremental combinatorial extension (CE) of optimal path. Prot Eng, 1998, 11:  SSAP (Sequential Structure Alignment Program ) : Uses a “double-dynamic programming” algorithm: high level and low level matrices. Used in CATH classification. Taylor WR, Orengo CA. 1989b. Protein structure alignment. J Mol Biol 208:l-22  VAST (Vector Alignment Search Tool ), TM-align and many more…… Common structure alignment methods

Dali: The Persistence of Time

Overview of the Dali Algorithm Starting with a contact map… Dali attempts to maximize the overlap of the contact maps; however, doing so globally is NP-hard, so the methods focus on local comparisons. Image from Amy Keating at MIT Image from Mark Maciejewski at UConn

The DALI (Distance matrix alignment) algorithm is based on the matrix comparison methods that we have already introduced. Images and content modified from Mark Maciejewski at UConn Similarity score: Structure A Structure B iAiA jAjA jBjB iBiB i and j are equivalent residues in A and B L is the number of such pairs or the size of the substructure  is the similarity measure based on the CA distance and Overview of the Dali Algorithm

The Dali Algorithm (step by step) 1.Compute distance matrices for both protein A and B 2.Extract a full set of overlapped hexapeptide (6x6) sub-matrices (also called contact patterns) from each matrix 3.Each 6x6 distance matrix from protein A is compared with the 6x6 distance matrix in protein B. (Really?) 6x6 CA distance matrices For example: 6.2 – 12.7 = -6.5

Consider protein A with 100 residues, meaning we have = 95 hexapeptides.  (95^2)/2 = 4,512 contact pattern matrices Consider protein B with 150 residues, meaning = 145 hexapeptides.  (145^2)/2 = 10,512 contact pattern matrices Even for these two relatively small proteins, there would be  4,512 x 10,512 = 47,430,144 comparisons between A and B. Step 1: For each hexapeptide, a distance matrix compares it to every other hexapeptide within its structure. Step 2: Every distance matrix created in step 1 for each protein are compared to each other. “Houston, … we have a problem!” The Dali Algorithm (step by step)

4.Each contact pattern in protein A is paired with its most similar pattern in protein B, a process that generates a pair list 5.The list is sorted based on the strength of pair similarity of contact patterns A note about the similarity measure  : We want to maximize the number of equivalent residues while minimize structural variations – it is a tradeoff. That is, if the criteria are so tough that minor structure deviations are not allowed, then the number of matching contact patterns is likely to be very small. Image from Amy Keating at MIT The Dali Algorithm (step by step) Note that unmatched residues do not contribute to the overall similarity score S.

Q: How do you calculate  (i,j)? Method 1: Rigid residue-pair similarity score: Å is the zero level of similarity. --The only thing that matters is absolute difference, meaning that the same difference at large distances is penalized the same as short distances. Method 2: Elastic similarity score (default): --Larger differences are tolerate for longer-range contact pairs. The Dali Algorithm (step by step)

6. Merging contact patterns to form chains and reduce complexity The search space is reduced because only the central contact pattern is retained (actually, the one that gives the smallest average intra-pattern distance).

The Dali Algorithm (step by step) 7.)After removing the overlapping patterns, we are still left with way too many contact patterns to exhaustively compare all possible pairs. Start comparing pairs at random: -- Keep list of positive scores (discard negative scores) -- Keep comparing till your list has 80,000 positive scores Sort the list and keep the best 40,000 contact pattern matches. 8.)End game: Need to find optimal alignment of the 40,000 contact patterns such that the alignment occurs over as wide a range of the structural pair as possible. Using Markov Chain Monte Carlo (MCMC), start with a random contact pattern from the list of 40,000, and then “walk” to another overlapping pattern (must extend the contact pattern by 4 residues) using the standard Metropolis criterion.

Metropolis Monte Carlo Optimization In Dali… The net result is that scores that improve are always kept, whereas scores that get worse are excepted with some probability.

The Dali Algorithm (the reality)

Statistical significance of Dali alignments Dali uses Z-score to show the significance of the alignment A common and practical approach to the problem of assessing alignment significance is to determine if the alignment score is better than one could expect by chance. Dali compares each alignment score against an All-to-All protein structure comparison (normalized by length), which defines the z-score. -- Dali Z-scores > 2 are thought to be meaningful.

Combinatorial Extension (a cursory look)

 Similar to Dali in that it also breaks the structure down into a series of small fragments, from which it attempts to reassemble into a complete alignment.  For a pair of proteins A and B, an alignment fragment pair (AFP) is defined as a continuous segment of A aligned against a continuous segment of B of the same size (without gaps).  If n 1 and n 2 are the lengths of A and B, and AFP length is set to m, then there is a total of possible (n 1  m)  (n 2  m) AFPs.  Only AFPs that meet a given criteria for local similarity are included in the matrix as means of restricting the search space.  An alignment path is calculated as the optimal path through the similarity matrix by linearly progressing through the sequences and extending the alignment with the next possible high-scoring AFP pair. Combinatorial Extension (a cursory look)

Goal: Find a “good” local alignment for structures of proteins A and B. 1.Select some initial AFP. 2.Build an alignment path by incrementally adding AFPs in a way that satisfies the conditions (i.e., stitch AFPs together). 3.Repeat step (2) until the length of each protein is traversed, or until no “good” AFPs remain. 4.Optimize the alignment via dynamic programming. 5.Measure statistical significance. Questions:  How do we choose the starting AFP?  What are the criteria for adding AFPs to our alignment path?  What does the distance function look like.  When to stop? Or at what point do we know that there no “good” AFPs left? Combinatorial Extension (a cursory look)

 To assess how good the alignment produced by CE is, we can compare it to the alignment of a random pair of structures, and compute the Z-score based on the RMSD distance and number of gaps in the final alignment.  Since CE does not penalize gaps, we can perform additional optimization after the CE is completed in order to remove excess gaps using dynamic programming.  The CE method is highly configurable, which is at once its strength and weakness. Adjusting multiple parameters, such as AFP length m, cutoff distances D 0 and D 1, and definitions for AFP distances, can result varying alignments and execution speeds.  In general, CE does not outperform previously existing structural alignment methods, such as Dali and VAST: it does better for some pairs of structures, and worse for others.

VAST (a cursory look) VAST = Vector Alignment Search Tool

1.)Parse protein structures into SSEs (helices and strands). 2.)Fit vectors to SSEs. 3.)To compare a pair of proteins attempt to superpose as many vectors as possible, subject to constraints. 4.)Evaluate the vector alignment for statistical significance (compute an E- value). 5.)If the vector alignment is significant then proceed to a more detailed residue-to-residue alignment (“refined alignment”). Modified from Tom Madej at GWU VAST (a cursory look)

Modified from Tom Madej at GWU VAST in pictures… +

Double Dynamic Programming (a cursory look)  Use two levels of dynamic programming, a high level scoring matrix and a low level matrix for each high level matrix element.  For each F ij in the high level scoring matrix, it shows how likely it is that the pair is on an optimal alignment.  For each F ij, the likelihood is found by a (low level) optimal alignment with the constraint that F ij is part of the alignment.  The scores along the low level alignments are accumulated in the high level scoring matrix.

DDP cont.  Begin by constructing a series of inter-residue distance vectors between each residue and its nearest non-contiguous neighbors on each protein.  A series of matrices are then constructed containing the vector differences between neighbors for each pair of residues for which vectors were constructed.  Dynamic programming applied to each resulting matrix determines a series of optimal local alignments which are then summed into a "summary" matrix to which dynamic programming is applied again to determine the overall structural similarity.

Stated in a slightly different manner  First level: Represent each residue by neighborhood vector for C  Compare n versus m neighborhood vectors Generate optimal alignment based on vector differences and dynamic programming  Second Level: Add matrix scores if paths cross in a cumulative matrix Generate optimal alignment based on the cumulative matrix

SSAP = Sequential Structure Alignment Program  SSAP originally produced only pairwise alignments but has since been extended to multiple alignments as well.  It has been applied in an all-to-all fashion to produce CATH.  Generally, SSAP scores above 80 are associated with highly similar structures. Scores between 70 and 80 indicate a similar fold with minor variations. Structures yielding a score between 60 and 70 do not generally contain the same fold, but usually belong to the same protein class with common structural motifs.

Multiple Structure Alignment  Most multiple structure alignments are based on a pile-up combination of pairwise results; however, few algorithms do an All-to-All optimization.  One example of a multiple alignment is Combinatorial Extension Monte Carlo (CE-MC), which is based on a progressive CE multiple alignment strategy, followed by an iterative Metropolis MC refinement.