Comparing Database Search Methods & Improving the Performance of PSI-BLAST Stephen Altschul.

Slides:

Advertisements

Similar presentations

Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.

Advertisements

1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪莊凱翔.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.

1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.

Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.

BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.

OUTLINE Scoring Matrices Probability of matching runs Quality of a database match.

Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.

Heuristic alignment algorithms and cost matrices

Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.

Author: Jason Weston et., al PANS Presented by Tie Wang Protein Ranking: From Local to global structure in protein similarity network.

Cbio course, spring 2005, Hebrew University (Alignment) Score Statistics.

Genome Analysis 2007 Lecture 7 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Iterative homology searching (PSI-BLAST)

Similar Sequence Similar Function Charles Yan Spring 2006.

Sequence Alignment III CIS 667 February 10, 2004.

Heuristic Approaches for Sequence Alignments

Introduction to Bioinformatics - Tutorial no. 8 Predicting protein structure PSI-BLAST.

Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman.

Sequence alignment, E-value & Extreme value distribution

Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.

Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.

Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.

Developing Pairwise Sequence Alignment Algorithms

Multiple testing correction

Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒黃尹柔田耕豪蕭逸嫻謝朝茂莊閔傑 2014/05/12 1.

BLAST What it does and what it means Steven Slater Adapted from pt.

Testing statistical significance scores of sequence comparison methods with structure similarity Tim Hulsen NCMLS PhD Two-Day Conference

Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.

Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.

Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.

Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.

Bacterial Genetics - Assignment and Genomics Exercise: Aims –To provide an overview of the development and.

Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,

BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.

CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.

Comp. Genomics Recitation 3 The statistics of database searching.

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Evaluating Results of Learning Blaž Zupan

PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.

Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.

Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.

Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.

Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.

Construction of Substitution matrices

Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.

Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University

Step 3: Tools Database Searching

Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.

The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.

Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.

Chapter 6 - Profiles1 Assume we have a family of sequences. To search for other sequences in the family we can Search with a sequence from the family Search.

Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, All rights reserved.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

Stephen Altschul National Center for Biotechnology Information

Testing sequence comparison methods with structure Organon, Oss Tim Hulsen.

BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.

Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.

Bioinformatics Shared Resource Bioinformatics : How to… Bioinformatics Shared Resource Kutbuddin Doctor, PhD.

Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.

Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.

Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)

9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.

Sequence alignment, E-value & Extreme value distribution

Introduction to bioinformatics 2007

1-month Practical Course Genome Analysis Iterative homology searching

Presentation transcript:

Comparing Database Search Methods & Improving the Performance of PSI-BLAST Stephen Altschul

“Gold standards” for protein classification Traditional curated sequence databases with family and superfamily classifications: PIR SWISS-PROT Structure-based protein domain classification: SCOP

Measuring retrieval accuracy Sequence Search RelatedUnrelated Positive TP True Positive FP False Positive P = TP + FP Negative FN False Negative TN True Negative N = FN + TN R = TP + FN U = FP + TN Sensitivity: TP/RSpecificity: TP/P

Receiver Operating Characteristic curve False +True – False – True +

Random retrieval on a ROC plot

Line of fixed sensitivity

Line of fixed specificity

Line of fixed crossover ratio

ROC score: area under the ROC curve

Region of interest in ROC analysis

Truncated ROC, or ROC n curve 0 10 –3 Fraction unrelated accepted

ROC n score: area under the ROC n curve

Questions concerning ROC analysis What false-positive cutoff value should be used? When does it make sense to pool the results of database searches? When are the ROC scores for two different methods significantly different?

Marginal ratio of true to false positives

Definition of the ROC n score t: Total number of related sequences t i : Number of related sequences (true positives) returned before the ith false positive

“Random distribution” of ROC n scores Bootstrap resampling can be used to assign a statistical significance to differences in ROC n scores. Under reasonable assumptions, the distribution of bootstrapped ROC n scores is approximately normal. Resampling a small subset in a large database is equivalent to resampling the subset with independent Poisson distributions with mean 1.

Bootstrap resampling of false positives Retrieval Ranking of the Database The false records are the noise Only false records are resampled with replacement The true records are well characterized.

Mean and variance for the normal distribution of ROC n scores yielded by resampling only the false positives

Mean and variance for the normal distribution of the difference of two ROC n scores, yielded by resampling only the false positives

PSI-BLAST in a nutshell With a protein sequence as query, use BLAST to search a protein sequence database. Collapse significant local alignments (those with E- value less than or equal to a set threshold h) into a multiple alignment, using the residues of the query sequence as alignment-column placeholders. Abstract a position-specific score matrix from the multiple alignment. Search the database with the score matrix as query. Iterate a fixed number of times, or until convergence.

Protocol for evaluating PSI-BLAST For each query sequence, search a comprehensive protein sequence database (e.g. NCBI’s nr) through a fixed number of PSI-BLAST iterations, or until convergence. Use the resulting position-specific score matrix to search the “gold standard” database. Pool the search results for ROC analysis.

The effect of acceptance threshold h on PSI-BLAST accuracy

Some ideas for improving PSI-BLAST 1. New statistical parameters 2. Smith-Waterman alignment 3. Substitution matrix frequency ratios 4. Apply SEG to database sequences 5. Composition-based statistics 6. “Concentrated” accounting of gaps 7. “Dispersed” accounting of gaps 8. Exponentiate Henikoff weights 9. Reverse sequence normalization 10. Window for amino acid composition 11. Use pseudocounts with composition window 12. Vary gap costs 13. Generalized affine gap costs 14. Substitution score offset 15. Information-dependent pseudocount parameter 16. Database-sequence length- normalization 17. Restricted score rescaling 18. Adjust purging percentage 19. Adjust pseudocount parameter 20. Adjust acceptance threshold

The effect of composition-based statistics on PSI-BLAST accuracy

Composition-based statistics Statistics based on “standard” amino acid frequencies can differ by orders of magnitude from those based upon the peculiar composition of two proteins. Standard protein: 4.5 % N DNA pol III, β chain [M. genitalium]: 12.1 % N DNA pol III, β chain [C. jejuni]: 7.6 % N Depending upon the composition assumed, a search of nr with M. genitalium DNA pol III as query yields different E-values for C. jejuni DNA pol III, as well as for the highest-scoring false positive: “Standard” statistics: Composition-based statistics: At a threshold of , “standard” statistics yield 54 true positives, while at 0.1, composition-based statistics yield 55 true positives.

The effect of dispersed accounting of gaps on PSI-BLAST accuracy

The effect of restricted score rescaling and parameter tuning on PSI-BLAST accuracy

Accuracy of PSI-BLAST Program versionROC 100 score Original h = ± Composition-based statistics h = ± “Dispersed” gap accounting h = ± Restricted score rescaling b = 9 ; p = ± 0.003

PSI-BLAST accuracy as a function of the number of iterations

Literature ROC analysis Swets, J.A. (1988) Science 240: Gribskov, M. & Robinson, N.L. (1996) Comput. Chem. 20:25-33 PSI-BLAST Altschul, S.F. et al. (1997) Nucl. Acids Res. 25: Composition-based statistics Karplus, K. et al. (1998) Bioinformatics 14: Schäffer, A.A. et al. (1999) Bioinformatics 15: Mott, R. (2000) J. Mol. Biol. 300: Statistics of ROC n resampling Schäffer, A.A. et al. (2001) Nucl. Acids Res. 29: Spouge, J.L. & Czabarka, E. (2002) ISMB Poster 133A

Acknowledgements Analysis of ROC n score distribution John Spouge Eva Czabarka Improvements to PSI- BLAST Alejandro Schäffer L. Aravind Thomas Madden Sergei Shavirin John Spouge Yuri Wolf Eugene Koonin