Pairwise alignment incorporating dipeptide covariation

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
BLAST Sequence alignment, E-value & Extreme value distribution.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Measuring the degree of similarity: PAM and blosum Matrix
DNA sequences alignment measurement
Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous.
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
We continue where we stopped last week: FASTA – BLAST
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.
Sequence similarity.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Heuristic Approaches for Sequence Alignments
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
1 Lesson 3 Aligning sequences and searching databases.
Sequence alignment, E-value & Extreme value distribution
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Sequence comparison: Local alignment
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
An Introduction to Bioinformatics
BLAST What it does and what it means Steven Slater Adapted from pt.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Pairwise Sequence Analysis-III
Rationale for searching sequence databases June 25, 2003 Writing projects due July 11 Learning objectives- FASTA and BLAST programs. Psi-Blast Workshop-Use.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Dynamic programming with more complex models When gaps do occur, they are often longer than one residue.(biology) We can still use all the dynamic programming.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Sequence comparison: Local alignment
Pairwise sequence Alignment.
Sequence Based Analysis Tutorial
Pairwise Sequence Alignment (cont.)
Pairwise Sequence Alignment
Pairwise Alignment Global & local alignment
Alignment IV BLOSUM Matrices
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Pairwise alignment incorporating dipeptide covariation Gavin E.Crooks, Richard E.Green, Steven E.Brenner presented by, Ramesh MuthuValliappan & Zeeshan Ali Khadim

Motivation: Standard algorithms – amino acid substitutions at neighboring sites are not correlated Advantages- makes the algorithm to run faster But it ignores information which increases the power of homolog detection. We may develop an algorithm which uses Extended substitution matrices that encapsulates the observed correlations between neighboring sites It has the ability to detect remote homologies

Standard algorithms Smith Waterman, Ssearch, BLAST, FASTA despite their low sensitivity are fast and universally applicable for pairwise alignment. They make a couple of assumptions: Amino acid substitutions are assumed to be homogeneous between protein families. BLOSUM and PAM are most commonly used substitution matrices and thus general models for protein sequence evolution. Substitution at a given site is assumed to be uncorrelated with its neighboring site, i.e, likelihood of substituting amino acid X for Y is independent of sequence content of X.

Sensitivity to sequence content: The algorithm the authors describe considers the sequence content and the correlations between neighboring sites. Similar to Smith-Waterman algorithm called “doublet”. Standard terms: Log odds ratio :

We can perform a more controlled approximation by noting that a homogeneous multivariate probability can be expanded into a product of single component distributions, pairwise correlations, triplets correlations and so on. Homologous Score is given below where the first term of this expansion represents single residue replacements, the next term defines the doublet substitution scores Here, i and i are residues separated by a distance l along one amino acid chain, whereas j and j are the corresponding aligned residues on the homologous sequence; ql (i, i; j , j ) is the target probability of observing this aligned quartet, and pl (i, i) is the background probability of this residue pair in protein sequences. These doublet scores represent the additional similarity owing to correlations between substitutions.

Interhomolog mutual information I: A measure of the inter sequence correlations. A high mutual information value indicates strong correlation, whereas a mutual information value of zero indicates uncorrelated variables. The average singlet score is the interhomolog mutual information per residue, under the assumption that replacements are uncorrelated. This is frequently reported as the ‘relative entropy’ of the substitution matrix. The average doublet score is the first order correction to the intersequence mutual information owing to inter site correlations.

A Smith–Waterman match table, with the optimal alignment highlighted A Smith–Waterman match table, with the optimal alignment highlighted. The value of each cell is the maximum of (1) The singlet match score (this is the start of an alignment), (2) The singlet score plus the match score from the previous cell along the diagonal (this extends an aligned region) or (3) the singlet score plus the optimal score from a gap score table (the previous residue was not aligned)

The number of match tables is one plus the distance over which dipeptide correlation information is considered (in this example, two). Again, the optimal alignment is highlighted. The top table corresponds to the starts of aligned regions, the middle table corresponds to aligned regions of at least two consecutive residues and the bottom table corresponds aligned regions of at least three consecutive residues. The alignment path through these tables falls through to lower tables in regions of consecutive aligned residues and begins again in the top table following gaps. To extend dipeptide context scoring over longer distances requires additional match tables.

Doublet alignment algorithm The time complexity of Smith–Waterman is O(nm), where n and m are the lengths of the two sequences. Adding doublet scores increases the complexity to O(nmL), where L is the distance over which substitution correlations are scored. Singlet and Doublet substitution scores r is the length of the preceding contiguous segment of aligned residues,or the maximum sequence separation over which doublet correlations are scored, whichever is less.

Doublet Algorithm : The largest score within the match table marks the last aligned position of the optimal alignment. The full alignment can be found by backtracking through the table, according to the choices previously made during the scoring step.

20* 20 matrix 204 entries. For example, the L = 3 column for ET AV (bottom right) indicates a score of zero for the alignment of ExxT in one sequence to AxxV in the other

Doublet substitution correlations Results : Doublet substitution correlations The top, dark line is the total information at various sequence separations. The remaining lines illustrate the relative contributions of different substitutions classes to this total information; these are exact conservation XY↔XY, partial conservation XY↔XZ, swaps XY↔YX, partial swaps XY↔ZX and unconserved ( double substitutions )XY↔ZU.

Homology Detection The collection of related sequences is derived from the structural classification of proteins (SCOP) database. We partition SCOP data into separate test and training subsets of approximately equal size, each containing 500 superfamilies, 2500 sequences, and 50000 homologous sequence pairs. Two domains share no more than 40% sequence identity. Results of an all-versus-all comparison of the test set, are reported as a plot of coverage (fraction of true relations found) versus errors per query (EPQ) the total number of false relations divided by the number of sequences Since the number of relations within a superfamily scales as the square of the size of the superfamily, and because SCOP superfamilies vary greatly in size, this reported coverage is dominated by the ability to detect relations within the largest superfamilies. To compensate for this unwarranted dependence, we also report the average fraction of true relations per sequence (linear normalization) and the average fraction of true relations per superfamily (quadratic normalization).

Homology Detection

Homology Detection In general, large superfamilies are more diverse, and the relationships within them are harder to discover . Thus, unnormalized coverage is typically less than the linearly normalized coverage, which in turn is less than quadratically normalized coverage. One important point of comparison for search results is 0.01 EPQ rate for linearly normalized results, the average fraction of true relations per database query at a false positive rate of 1 in 100.

Conclusion: In conclusion, the assumption that neighboring sites along a protein sequence evolve independently appears to be generally appropriate. This leads to fast, elegant and effective algorithms for protein sequence alignment and homology detection. Analysis indicates that local correlations between substitutions are not strong on the average. Furthermore, incorporating local substitution correlations into pairwise alignment did not lead to a statistically significant improvement in remote homology detection. Therefore, the standard assumption that individual residues within protein sequences evolve independently of neighboring positions appears to be an efficient and appropriate approximation. We found that incorporating doublet substitution correlations leads to a statistically insignificant difference in homology detection.