Rationale for searching sequence databases

Slides:

Advertisements

Similar presentations

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.

Advertisements

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪莊凱翔.

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.

BLAST Sequence alignment, E-value & Extreme value distribution.

1 CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments Tamer Kahveci CISE Department University of Florida.

Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.

Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.

Database Searching for Similar Sequences Search a sequence database for sequences that are similar to a query sequence Search a sequence database for sequences.

Sequence Similarity Searching Class 4 March 2010.

Heuristic alignment algorithms and cost matrices

We continue where we stopped last week: FASTA – BLAST

Database searching. Purposes of similarity search Function prediction by homology (in silico annotation) Function prediction by homology (in silico annotation)

Overview of sequence database searching techniques and multiple alignment May 1, 2001 Quiz on May 3-Dynamic programming- Needleman-Wunsch method Learning.

Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.

Similar Sequence Similar Function Charles Yan Spring 2006.

Heuristic Approaches for Sequence Alignments

BLAST and Multiple Sequence Alignment

Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman.

Sequence alignment, E-value & Extreme value distribution

From Pairwise Alignment to Database Similarity Search.

Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.

Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,

TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,

Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.

Database Searching BLAST and FastA.

An Introduction to Bioinformatics

BLAST What it does and what it means Steven Slater Adapted from pt.

Protein Sequence Alignment and Database Searching.

NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)

Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.

Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.

Computational Biology, Part 9 Efficient database searching methods Robert F. Murphy Copyright  1996, 1999, All rights reserved.

Scoring Matrices April 23, 2009 Learning objectives- 1) Last word on Global Alignment 2) Understand how the Smith-Waterman algorithm can be applied to.

Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.

Local alignment, BLAST and Psi-BLAST October 25, 2012 Local alignment Quiz 2 Learning objectives-Learn the basics of BLAST and Psi-BLAST Workshop-Use BLAST2.

Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,

Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?

BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.

CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.

Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.

Rationale for searching sequence databases June 25, 2003 Writing projects due July 11 Learning objectives- FASTA and BLAST programs. Psi-Blast Workshop-Use.

Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.

Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.

©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

Lecture 7 CS5661 Heuristic PSA “Words” to describe dot-matrix analysis Approaches –FASTA –BLAST Searching databases for sequence similarities –PSA –Alternative.

Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.

Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.

Sequence Alignment.

Construction of Substitution matrices

Step 3: Tools Database Searching

Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.

©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.

What is BLAST? Basic BLAST search What is BLAST?

Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,

Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,

BLAST and Psi-BLAST and MSA Nov. 1, 2012 Workshop-Use BLAST2 to determine local sequence similarities. Homework #6 due Nov 8 Chapter 5, Problem 8 Chapter.

9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.

Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.

What is BLAST? Basic BLAST search What is BLAST?

Blast Basic Local Alignment Search Tool

Basics of BLAST Basic BLAST Search - What is BLAST?

BLAST Anders Gorm Pedersen & Rasmus Wernersson.

Identifying templates for protein modeling:

Lecture #7: FASTA & LFASTA

Basic Local Alignment Search Tool

Sequence alignment, E-value & Extreme value distribution

Presentation transcript:

Rationale for searching sequence databases May 11, 2004 Writing projects due May 25 Quiz #3 on Thurs., May 20 Learning objectives-Why do we search sequence databases? Understand the Smith-Waterman algorithm of local alignment and the concept of backtracing. FASTA and BLAST programs. Psi-Blast Workshop-Use of Psi-BLAST to determine sequence similarities. Homework-Due May 20

Why search sequence databases? 1. I have just sequenced a gene. What is known about the gene I sequenced? 2. I have a unique sequence. Is there similarity to another gene that has a known function? 3. I found a new gene in a lower organism. Is it similar to a gene from another species? 4. I have decided to work on a new gene. The people in the field will not give me the plasmid. I need the complete cDNA sequence to perform PCR.

Perfect Searches First “hit” should be an exact match. Next “hits” should contain all of the genes that are related to your gene (homologs) Next “hits” should be similar but are not homologs

How does one achieve the “perfect search”? Comparison Matrices (PAM vs. BLOSUM) Database Search Algorithms Databases Search Parameters Expect Value-change threshold for score reporting Translation-of DNA sequence into protein Filtering-remove repeat sequences

Smith-Waterman Algorithm Advances in Applied Mathematics, 2:482-489 (1981) The Smith-Waterman algorithm is a local alignment tool used to obtain sensitive pairwise similarity alignments. Smith-Waterman algorithm uses dynamic programming. Operating via a matrix, the algorithm uses backtracing and tests alternative paths to the highest scoring alignments, and selects the optimal path as the highest ranked alignment. The sensitivity of the Smith-Waterman algorithm makes it useful for finding local areas of similarity between sequences that are too dissimilar for alignment. The S-W algorithm uses a lot of computer memory. BLAST and FASTA are other search algorithms that use some aspects of S-W.

Smith-Waterman (cont. 1) a. It searches for both full and partial sequence matches . b. Assigns a score to each pair of amino acids -uses similarity scores -uses positive scores for related residues -uses negative scores for substitutions and gaps c. Initializes edges of the matrix with zeros d. As the scores are summed in the matrix, any sum below 0 is recorded as a zero. e. Begins backtracing at the maximum value found anywhere in the matrix. f. Continues the backtrace until the score falls to 0.

Smith-Waterman (cont. 2) H E A G A W G H E E Put zeros on borders. Assign initial scores based on a scoring matrix. Calculate new scores based on adjacent cell scores. If sum is less than zero or equal to zero begin new scoring with next cell. P A W H E 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 5 0 0 0 0 0 0 0 0 0 0 3 0 2012 4 0 0 0 0 10 2 0 0 1 12182214 6 0 0 2 16 8 0 0 4101828 20 0 0 0 82113 5 0 41020 27 0 0 0 6131912 4 0 416 26 0

Smith-Waterman (cont. 3) H E A G A W G H E E P A W H E 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 5 0 0 0 0 0 0 0 0 0 0 3 0 2012 4 0 0 0 0 10 2 0 0 1 12182214 6 0 0 2 16 8 0 0 4101828 20 0 0 0 82113 5 0 41020 27 0 0 0 6131912 4 0 416 26 0 Begin backtrace at the maximum value found anywhere on the matrix. Continue the backtrace until score falls to zero AWGHE || || AW-HE Score=28

Calculation of percent similarity A W G H E A W - H E 5 15 -5 10 6 Blosum45 SCORES -3 GAP EXT. PENALTY % SIMILARITY = NUMBER OF POS. SCORES DIVIDED BY NUMBER OF AAs IN REGION x 100 % OVERALL SIMILARITY = NUMBER OF POS. SCORES DIVIDED BY NUMBER OF TOTAL AAs IN REGION x 100 % SIMILARITY = 4/5 x 100 = 80% %OVERALL SIMILARITY = 4/5 x 100 = 80% Similarity Score = 28

FASTA (Pearson and Lipman 1988) This is a combination of word search and Smith-Waterman algorithm The query sequence is divided into small words of certain size. The initial comparison of the query sequence to the database is performed using these “words”. If these “words” are located on the same diagonal in an array the region surrounding the diagonals are analyzed further. Search time is only proportional to size of database not (database*query sequence)

The FASTA program is the uses Hash tables. These tables speed the process of word search. Query Sequence = TCTCTC 123456 (position number) Database Sequence = TTCTCTC 1234567 (position number) You choose to use word size = 4 for your table (total number of words in your table is 44 = 256) ? Sequence (total of 256) Position w/in query Position w/in DB Offset (Q minus DB) TCTC 1,3 2,4 -1 or -3 or 1 CTCT 2 3 -1 TTCT 1

FASTA Steps 2 1 4 3 Local regions of Rescore the local regions Different offset values 2 1 Identical offset values in a contiguous sequence Diagonals are extended Local regions of identity are found Rescore the local regions using PAM or Blos. matrix 4 3 Create a gapped alignment in a narrow segment and then perform S-W alignment Eliminate short diagonals below a cutoff score

Summary of FASTA steps 1. Analyzes database for identical matches that are contiguous (between 5 and 10 amino acids in length (same offset values)). 2. Longest diagonals are scored again using the PAM matrix (or other matrix). The best scores are saved as “init1” scores. 3. Short diagonals are removed. 4. Long diagonals that are neighbors are joined. The score for this joined region is “initn”. This score may be lower due to a penalty for a gap. 5. A S-W dynamic programming alignment is performed around the joined sequences to give an “opt” score. Thus, the time-consuming S-W step is performed only on top scoring sequences

The ktup value The ktup (for k-tuples) value stands for the length of the word used to search for identity. For proteins a ktup value of 3 would give a hash table of 203 elements (8000 entries). The higher the ktup value the less likely you will get a match unless it is identical (remember the dot plots). The lower the ktup value the more background you will have The higher the ktup value the faster analysis (fewer diagonals). The following rules typically apply when using FASTA: ktup analysis____________________ 1 proteins- distantly related 2 proteins- somewhat related (default) 3 DNA-default

FASTA Versions FASTA-nucleotide or protein sequence searching FASTx/-compares a translated DNA query sequence FASTy to a protein sequence database (forward or backward translation of the query) tFASTx/-compares protein query sequence to tFASTy DNA sequence database that has been translated into three forward and three reverse reading frames

FASTA Statistical Significance A way of measuring the significance of a score considers the mean of the random score distribution. The difference between the similarity score for your single alignment and the mean of the random score distribution is normalized by the standard deviation of that random score distribution. This is the Z-score. Higher Z-scores are better because the further the real score is from this mean (in standard deviation units) the more significant it is.

FASTA Statistical Significance Z score for a single alignment= (similarity score - mean score from database) standard deviation from database  ( scores)2  scores2 - Stand. Dev. = Total#ofSequences Total#ofSequences

Mean similarity scores of complete database Mean similarity scores of related records

FASTA statistics (cont.) Using the distribution of the z-scores in the database, the FastA program can estimate the number of sequences that would be expected to produce, purely by chance, a z-score greater than or equal to the z-score obtained in the search. This is reported as the E() value. This value is the number of sequences you would expect to find with this score by searching a database of random sequences. Thus, when z the E()

Evaluating the Results of FASTA Best SCORES Init1: 2847 Initn: 2847 Opt: 2847 z-score: 2609.2 E(): 1.4e-138 Smith-Waterman score: 2847; 100.0% identity in 413 overlap Good SCORES Init1: 719 Initn: 748 Opt: 793 z-score: 734.0 E(): 3.8e-34 Smith-Waterman score: 796; 41.3% identity in 378 overlap Mediocre SCORES Init1: 249 Initn: 304 Opt: 260 z-score: 243.2 E(): 8.3e-07 Smith-Waterman score: 270; 35.0% identity in 183 overlap

BLAST Basic Local Alignment Search Tool Speed is achieved by: Pre-indexing the database before the search Parallel processing Uses a hash table that contains neighborhood words rather than just random words.

Neighborhood words The program declares a hit if the word taken from the query sequence has a score >= T when a scoring matrix is used. This allows the word size (W (this is similar to ktup value)) to be kept high (for speed) without sacrificing sensitivity. If T is increased by the user the number of background hits is reduced and the program will run faster

Comparison Matrices In general, the BLOSUM series is thought to be superior to the PAM series because it is derived from areas of conserved sequences. It is important to vary the parameters when performing a sequence comparison. Similarity scores for truly related sequences are usually not sensitive to changes in scoring matrix and gap penalty. Thus, if your “hits list” holds up after changing these parameters you can be more sure that you are detecting similar sequences.

Which Program should one use? Most researchers use methods for determining local similarities: Smith-Waterman (gold standard) FASTA BLAST } Do not find every possible alignment of query with database sequence. These are used because they run faster than S-W

What are the different BLAST programs? compares an amino acid query sequence against a protein sequence database blastn compares a nucleotide query sequence against a nucleotide sequence database blastx compares a nucleotide query sequence translated in all reading frames against a protein sequence database tblastn compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames tblastx compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. Please note that tblastx program cannot be used with the nr database on the BLAST Web page.

When to use the correct program Problem Program Explanation Identify Unknown Protein BLASTP; FASTA3 General protein comparison. Use ktup=2 for speed; ktup=1 for sensitive search. Smith-Waterman Slower than FASTA3 and BLAST but provides maximum sensitivity TFASTX3;TFASTY3; TBLASTN Use if homolog cannot be found in protein databases; Approx. 33% slower Psi-BLAST Finds distantly related sequences. It replaces the query sequence with a position-specific score matrix after an initial BLASTP search. Then it uses the matrix to find distantly related sequences

When to use the correct program (cont. 1) Problem Program Explanation Identify new orthologs TFASTX3;TFASTY3 TBLASTN:TBLASTX Use PAM matrix <=20 or BLOSUM90 to avoid detecting distant relationships. Search EST sequences w/in the same species. Always attempt to translate your sequence into protein prior to searching. Identify EST Sequence FASTX3;FASTY3; BLASTX;TBLASTX Identify DNA Sequence FASTA;BLASTN Nucleotide sequence comparision

Choosing the database Remember that the E value increases approximately linearly with database size. When searching for distant relationships always use the smallest database likely to contain the homolog of interest. Thought problem: If the E-value one obtains for a search is 12 in Swiss-PROT and the E-value one obtains for same search is 74 in PIR how large is PIR compared to Swiss-PROT? 74/12 = ~6

Filtering Repetitive Sequences Over 50% of genomic DNA is repetitive This is due to: retrotransposons ALU region microsatellites centromeric sequences, telomeric sequences 5’ Untranslated Region of ESTs Example of ESTs with simple low complexity regions: T27311 GGGTGCAGGAATTCGGCACGAGTCTCTCTCTCTCTCTCTCTCTCTCTC TCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTC

Filtering Repetitive Sequences (cont. 1) Programs like BLAST have the option of filtering out low complex regions. Repetitive sequences increase the chance of a match during a database search

PSI-BLAST PSI-position specific iterative a position specific scoring matrix (PSSM) is constructed automatically from multiple HSPs of initial BLAST search. Normal E value is used This PSSM is as the new scoring matrix for a second BLAST search. Low E value is used E=.001. Result-1) obtain distantly related sequences 2) find out the important residues that provide function or structure.