Stephen Altschul National Center for Biotechnology Information

Slides:



Advertisements
Similar presentations
Substitution matrices
Advertisements

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
OUTLINE Scoring Matrices Probability of matching runs Quality of a database match.
Triangle Inequality Theorems Sec 5.5 Goals: To determine the longest side and the largest angle of a triangle To use triangle inequality theorems.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Heuristic alignment algorithms and cost matrices
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Sequence analysis course
Slide 1 EE3J2 Data Mining Lecture 20 Sequence Analysis 2: BLAST Algorithm Ali Al-Shahib.
Position-Specific Substitution Matrices. PSSM A regular substitution matrix uses the same scores for any given pair of amino acids regardless of where.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Project Proposals Due Monday Feb. 12 Two Parts: Background—describe the question Why is it important and interesting? What is already known about it? Proposed.
Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences Database searching for sequences Multiple sequence alignment Protein classification.
Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu.
Pairwise profile alignment Usman Roshan BNFO 601.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Comparing Database Search Methods & Improving the Performance of PSI-BLAST Stephen Altschul.
Sequence Alignments Revisited
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Substitution matrices
Dayhoff’s Markov Model of Evolution. Brands of Soup Revisited Brand A Brand B P(B|A) = 2/7 P(A|B) = 2/7.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Graphs, relations and matrices
Arithmetic Operations on Matrices. 1. Definition of Matrix 2. Column, Row and Square Matrix 3. Addition and Subtraction of Matrices 4. Multiplying Row.
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
An Introduction to Bioinformatics
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Bacterial Genetics - Assignment and Genomics Exercise: Aims –To provide an overview of the development and.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Multiple Mapping Method with Multiple Templates (M4T): optimizing sequence-to-structure alignments and combining unique information from multiple templates.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Pairwise Sequence Analysis-III
Significance in protein analysis
Chapter 8 Section 8.2 Law of Cosines. In any triangle (not necessarily a right triangle) the square of the length of one side of the triangle is equal.
4.7 Triangle Inequalities. In any triangle…  The LARGEST SIDE lies opposite the LARGEST ANGLE.  The SMALLEST SIDE lies opposite the SMALLEST ANGLE.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Construction of Substitution matrices
Blosum matrices What are they? Morten Nielsen BioSys, DTU
Step 3: Tools Database Searching
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
DNA SEQUENCE ALIGNMENT FOR PROTEIN SIMILARITY ANALYSIS CARL EBERLE, DANIEL MARTINEZ, MENGDI TAO.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Computer Applications and Bioinformatics
SEG 4630 E-Commerce Data Mining — Final Review —
Tutorial 3 – Protein Scoring Matrices PAM & BLOSUM
1-month Practical Course Genome Analysis Iterative homology searching
Presentation transcript:

Compositionally Adjusted Substitution Matrices for Protein Database Searches Stephen Altschul National Center for Biotechnology Information National Library of Medicine National Institutes of Health

Collaborators Yi-Kuo Yu Alejandro Schäffer John Wootton Richa Agarwala Mike Gertz Aleksandr Morgulis National Center for Biotechnology Information National Library of Medicine National Institutes of Health See: Yu, Wootton & Altschul (2003) PNAS 100:15688-15693; Yu & Altschul (2005) Bioinformatics 21: 902-911; Altschul et al. (2005) FEBS J. 272:5101-5109.

Log-odds scores The scores of any local-alignment substitution matrix can be written in the form where the pi are background amino acid frequencies, the qij are target frequencies and λ is an arbitrary scale factor. (PNAS 87:2264-2268)

The BLOSUM-62 matrix PNAS 89:10915-10919 A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 A R N D C Q E G H I L K M F P S T W Y V PNAS 89:10915-10919

Amino acid compositional bias Some sources of bias: Organismal bias AT-rich genome: tend to have more amino acids FLINKYM GC-rich genome: tend to have more amino acids PRAWG Protein family bias Transmembrane proteins: more hydrophobic residues Cysteine-rich proteins: more Cysteines than usual

Construction of an asymmetric log-odds substitution matrix Given a (not necessarily symmetric) set of target frequencies qij, define two sets of background frequencies pi and p’j as the marginal sums of the qij : The substitution scores are then defined as We call this matrix valid in the context of the pi and p’j.

Substitution matrix validity theorem A substitution matrix can be valid for only a unique set of target and background frequencies, except in certain degenerate cases. (Proof omitted) One can determine efficiently whether an arbitrary substitution matrix can be valid in some context and, if so, one can extract its unique target and background frequencies, and scale. (Proof and algorithms omitted)

Choosing new target frequencies Given new sets of background frequencies Pi and P’j , how should one choose appropriate target frequencies Qij ? Consistency constraints: Close to original qij : Sometimes, it is desirable to constrain the relative entropy H

Substitution matrices compared Mode A: Standard BLOSUM-62 matrix. Mode B: Composition-adjusted matrix; no constraint on relative entropy (H). Mode C: Composition-adjusted matrix; H constrained to equal a constant (0.44 nats). Mode D: Composition-adjusted matrix; H constrained to equal that of the standard matrix in the new compositional context.

Performance evaluation (mode D vrs. mode A)

BLOSUM-62 and sequence specific background frequencies Amino P. falciparum M. tuberculosis Acid BLOSUM62 #16805184 #15607948 ----- --------------- --------------- ----------------- A 7.4 4.8 13.9 R 5.2 4.1 7.4 N 4.5 8.9 2.8 D 5.3 5.6 5.9 C 2.5 2.1 1.9 Q 3.4 3.0 3.6 E 5.4 7.0 6.1 G 7.4 6.2 9.5 H 2.6 3.1 1.7 I 6.8 9.0 4.4 L 9.9 8.2 9.3 K 5.8 8.2 1.9 M 2.5 1.3 1.5 F 4.7 5.1 2.5 P 3.9 3.8 5.3 S 5.7 7.4 4.4 T 5.1 2.3 5.7 W 1.3 1.0 0.8 Y 3.2 4.6 2.8 V 7.3 4.4 8.7

Difference between a scaled, standard BLOSUM-62 and a compositionally adjusted BLOSUM-62 P. falciparum A -15 -55-116 -76 -45 -60 -73 -23 -98 -54 -45 -77 -39 -92 -34 -52 -31 -79-102 -34 R 4 -9 -83 -43 -20 -26 -40 2 -66 -27 -16 -40 -9 -61 -5 -25 -3 -49 -71 -8 N 48 22 -26 6 24 16 3 50 -20 15 25 -2 33 -19 38 21 44 -8 -28 34 D 19 -8 -62 -12 -5 -12 -20 20 -51 -11 -3 -30 4 -47 12 -8 13 -37 -58 6 C 21 -14 -74 -34 19 -20 -35 15 -57 -10 -1 -37 5 -47 6 -11 12 -35 -59 9 Q 23 -2 -65 -24 -3 1 -19 21 -46 -9 1 -24 11 -45 14 -6 15 -30 -54 8 E 22 -5 -66 -20 -6 -7 -14 19 -48 -12 -1 -27 6 -46 12 -8 14 -34 -56 7 G -25 -59-115 -77 -52 -64 -77 -14-102 -61 -51 -80 -45 -94 -38 -57 -37 -82-106 -42 H 54 27 -31 6 30 23 9 52 2 21 33 4 40 -9 44 25 46 1 -14 39 I 26 -6 -67 -25 5 -11 -26 21 -50 9 13 -28 17 -35 14 -7 20 -28 -50 23 L 23 -7 -70 -29 2 -13 -27 18 -51 1 15 -31 17 -35 12 -10 16 -29 -51 16 K 43 20 -45 -5 17 13 -2 41 -29 11 20 2 28 -25 33 14 36 -13 -34 29 M 30 1 -62 -23 8 -4 -20 25 -44 5 18 -23 31 -31 18 -2 23 -22 -46 22 F 62 34 -29 12 41 26 13 61 -7 39 51 9 55 17 52 32 56 19 -1 55 P -31 -62-123 -80 -56 -66 -80 -34-106 -64 -54 -84 -48 -99 -23 -61 -39 -88-110 -44 S 19 -14 -72 -32 -6 -18 -32 15 -57 -17 -8 -35 0 -51 7 -7 12 -41 -62 2 T 11 -21 -78 -41 -12 -26 -39 6 -65 -19 -11 -42 -3 -57 0 -17 12 -46 -67 -1 W 60 31 -32 7 39 26 10 59 -12 31 42 6 49 4 48 28 52 37 -5 47 Y 52 23 -38 0 29 17 2 48 -13 23 34 -1 39 -1 40 20 44 9 -5 41 V 13 -21 -83 -42 -10 -28 -41 6 -67 -11 -6 -44 0 -52 -1 -22 4 -46 -65 9 A R N D C Q E G H I L K M F P S T W Y V Entries shown: score of standard matrix subtracted from the adjusted one

Optimal alignments implied by modes A and D Mode A: 29.7 bits (H = 0.51 nats) Mode D: 31.8 bits (H = 0.51 nats) Mode C: 33.1 bits (H = 0.44 nats)

Substitution matrices compared Mode A: Standard BLOSUM-62 matrix. Mode B: Composition-adjusted matrix; no constraint on relative entropy (H). Mode C: Composition-adjusted matrix; H constrained to equal a constant (0.44 nats). Mode D: Composition-adjusted matrix; H constrained to equal that of the standard matrix in the new compositional context.

Performance of various matrices on 143 pairs of related sequences (FEBS J. 272:5101-5109)

Empirical rules for invoking compositional adjustment when comparing two sequences 1: The length ratio of the longer to the shorter sequence is less than 3.

One metric definition of distance between two composition vectors (IEEE Trans. Info. Theo. 49:1858-1860)

Empirical rules for invoking compositional adjustment when comparing two sequences 1: The length ratio of the longer to the shorter sequence is less than 3. 2: The distance d between the compositions of the two sequences is less than 0.16.

Law of cosines In a triangle with sides of length a,b and c, the angle opposite the side of length c is

Empirical rules for invoking compositional adjustment when comparing two sequences 1: The length ratio of the longer to the shorter sequence is less than 3. 2: The distance d between the compositions of the two sequences is less than 0.16. 3: The angle θ made by the compositions of the two sequences with the standard composition is less than 70o.

ROCn curves for Aravind set (NAR 29: 2994-3005) b

ROCn curves for SCOP set (Proc IEEE 9: 1834-1847)

Future directions Possible less extensive use of SEG when compositional adjustment is invoked. Application to PSI-BLAST.