1-month Practical Course Genome Analysis Iterative homology searching

Slides:



Advertisements
Similar presentations
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 3: BLAST Sequence Analysis.
Bioinformatics and Statistics: A Real World Example Joseph D. Szustakowski.
Introduction to Bioinformatics
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Master Course Sequence Alignment Lecture 8 Database searching (2)
Bioinformatics For MNW 2 nd Year Lecture 20: Homology searching using heuristic methods Integrative Bioinformatics Institute VU (IBIVU)
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Genome Analysis 2007 Lecture 7 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Iterative homology searching (PSI-BLAST)
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Point Specific Alignment Methods
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
Sequence alignment, E-value & Extreme value distribution
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
BLAST What it does and what it means Steven Slater Adapted from pt.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Iterative homology searching using PSI-BLAST, scoring statistics and performance evaluation Introduction to bioinformatics 2008 Lecture 10 C E N T R F.
Bacterial Genetics - Assignment and Genomics Exercise: Aims –To provide an overview of the development and.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
Comp. Genomics Recitation 3 The statistics of database searching.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Significance in protein analysis
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Construction of Substitution matrices
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Sequence database searching – Homology searching Dynamic Programming (DP) too slow for repeated database searches. Therefore fast heuristic methods: FASTA.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Sequence similarity, BLAST alignments & multiple sequence alignments
Blast Basic Local Alignment Search Tool
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Identifying templates for protein modeling:
Sequence comparison: Significance of similarity scores
BLAST.
Point Specific Alignment Methods
Sequence comparison: Significance of similarity scores
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
Sequence alignment, E-value & Extreme value distribution
1-month Practical Course
BLAST Slides adapted & edited from a set by
Introduction to bioinformatics 2007
Presentation transcript:

1-month Practical Course Genome Analysis Iterative homology searching F O I G A V B M S U 1-month Practical Course Genome Analysis Iterative homology searching Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam The Netherlands www.ibivu.cs.vu.nl heringa@cs.vu.nl

PSI (Position Specific Iterated) BLAST basic idea use results from BLAST query to construct a profile matrix search database with profile instead of query sequence iterate

A Profile Matrix (Position Specific Scoring Matrix – PSSM) This is the same as a profile without position-specific gap penalties

PSI BLAST Searching with a Profile aligning profile matrix to a simple sequence like aligning two sequences except score for aligning a character with a matrix position is given by the matrix itself not a substitution matrix

PSI BLAST: Constructing the Profile Matrix Figure from: Altschul et al. Nucleic Acids Research 25, 1997

PSI BLAST: Determining Profile Elements the value for a given element of the profile matrix is given by: where the probability of seeing amino acid ai in column j is estimated as: Observed frequency Pseudocount e.g.  = number of sequences in profile, =1

PSI-BLAST iteration Q Q Database hits PSSM PSSM Database hits Query sequence xxxxxxxxxxxxxxxxx Gapped BLAST search Q Query sequence xxxxxxxxxxxxxxxxx Database hits A C D . Y iterate PSSM Pi Px Gapped BLAST search A C D . Y PSSM Pi Px Database hits

PSI-BLAST steps in words PSI-BLAST steps in words Query sequences are first scanned for the presence of so-called low-complexity regions (Wooton and Federhen, 1996), i.e. regions with a biased composition likely to lead to spurious hits; are excluded from alignment. The program then initially operates on a single query sequence by performing a gapped BLAST search Then, the program takes significant local alignments (hits) found, constructs a multiple alignment (master-slave alignment) and abstracts a position-specific scoring matrix (PSSM) from this alignment. Rescan the database in a subsequent round, using the PSSM, to find more homologous sequences. Iteration continues until user decides to stop or search has converged

PSI-BLAST entry page Paste your query sequence Switch this off for default run

1 - This portion of each description links to the sequence record for a particular hit. 2 - Score or bit score is a value calculated from the number of gaps and substitutions associated with each aligned sequence. The higher the score, the more significant the alignment. Each score links to the corresponding pairwise alignment between query sequence and hit sequence (also referred to as subject sequence). 3 - E Value (Expect Value) describes the likelihood that a sequence with a similar score will occur in the database by chance. The smaller the E Value, the more significant the alignment. For example, the first alignment has a very low E value of e-117 meaning that a sequence with a similar score is very unlikely to occur simply by chance. 4 - These links provide the user with direct access from BLAST results to related entries in other databases. ‘L’ links to LocusLink records and ‘S’ links to structure records in NCBI's Molecular Modeling DataBase.

‘X’ residues denote low-complexity sequence fragments that are ignored

Alignment Bit Score B = (S – ln K) / ln 2 S is the raw alignment score The bit score (‘bits’) B has a standard set of units The bit score B is calculated from the number of gaps and substitutions associated with each aligned sequence. The higher the score, the more significant the alignment  and K and are the statistical parameters of the scoring system (BLOSUM62 in Blast). See Altschul and Gish, 1996, for a collection of values for  and K over a set of widely used scoring matrices. Because bit scores are normalized with respect to the scoring system, they can be used to compare alignment scores from different searches based on different scoring schemes (a.a. exchange matrices)

Normalised sequence similarity The p-value is defined as the probability of seeing at least one unrelated score S greater than or equal to a given score x in a database search over n sequences. This probability follows the Poisson distribution (Waterman and Vingron, 1994): P(x, n) = 1 – e-nP(S x), where n is the number of sequences in the database Depending on x and n (fixed)

Normalised sequence similarity Statistical significance The E-value is defined as the expected number of non-homologous sequences with score greater than or equal to a score x in a database of n sequences: E(x, n) = nP(S  x) For example, if E-value = 0.01, then the expected number of random hits with score S  x is 0.01, which means that this E-value is expected by chance only once in 100 independent searches over the database. if the E-value of a hit is 5, then five fortuitous hits with S  x are expected within a single database search, which renders the hit not significant.

A model for database searching score probabilities Scores resulting from searching with a query sequence against a database follow the Extreme Value Distribution (EDV) (Gumbel, 1955). Using the EDV, the raw alignment scores are converted to a statistical score (E value) that keeps track of the database amino acid composition and the scoring scheme (a.a. exchange matrix)

Extreme Value Distribution y = 1 – exp(-e-(x-)) Probability density function for the extreme value distribution resulting from parameter values  = 0 and  = 1, [y = 1 – exp(-e-x)], where  is the characteristic value and  is the decay constant.

Extreme Value Distribution (EDV) EDV approximation real data You know that an optimal alignment of two sequences is selected out of many suboptimal alignments, and that a database search is also about selecting the best alignment(s). This bodes well with the EDV which has a right tail that falls off more slowly than the left tail. Compared to using the normal distribution, when using the EDV an alignment has to score further away from the expected mean value to become a significant hit.

Extreme Value Distribution The probability of a score S to be larger than a given value x can be calculated following the EDV as: E-value: P(S  x) = 1 – exp(-e -(x-)), where  =(ln Kmn)/, and K a constant that can be estimated from the background amino acid distribution and scoring matrix (see Altschul and Gish, 1996, for a collection of values for  and K over a set of widely used scoring matrices).

Extreme Value Distribution Using the equation for  (preceding slide), the probability for the raw alignment score S becomes P(S  x) = 1 – exp(-Kmne-x). In practice, the probability P(Sx) is estimated using the approximation 1 – exp(-e-x)  e-x, which is valid for large values of x. This leads to a simplification of the equation for P(Sx): P(S  x)  e-(x-) = Kmne-x. The lower the probability (E value) for a given threshold value x, the more significant the score S.

Normalised sequence similarity Statistical significance Database searching is commonly performed using an E-value in between 0.1 and 0.001. Low E-values decrease the number of false positives in a database search, but increase the number of false negatives, thereby lowering the sensitivity of the search.

Words of Encouragement “There are three kinds of lies: lies, damned lies, and statistics” – Benjamin Disraeli “Statistics in the hands of an engineer are like a lamppost to a drunk – they’re used more for support than illumination” “Then there is the man who drowned crossing a stream with an average depth of six inches.” – W.I.E. Gates

Conserved hypothetical proteins have putative homologues in the database but of unknown function