Database Searching for Similar Sequences Search a sequence database for sequences that are similar to a query sequence Search a sequence database for sequences.

Slides:



Advertisements
Similar presentations
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Advertisements

Bioinformatics Tutorial I BLAST and Sequence Alignment.
BLAST Sequence alignment, E-value & Extreme value distribution.
1 CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments Tamer Kahveci CISE Department University of Florida.
Sequence Alignment Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
We continue where we stopped last week: FASTA – BLAST
Database searching. Purposes of similarity search Function prediction by homology (in silico annotation) Function prediction by homology (in silico annotation)
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
1 BLAST – A heuristic algorithm Anjali Tiwari Pannaben Patel Pushkala Venkataraman.
Similar Sequence Similar Function Charles Yan Spring 2006.
Heuristic Approaches for Sequence Alignments
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Sequence alignment, E-value & Extreme value distribution
From Pairwise Alignment to Database Similarity Search.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 17 th, 2013.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Speed Up DNA Sequence Database Search and Alignment by Methods of DSP
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
An Introduction to Bioinformatics
Protein Sequence Alignment and Database Searching.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Computational Biology, Part 9 Efficient database searching methods Robert F. Murphy Copyright  1996, 1999, All rights reserved.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Rationale for searching sequence databases June 25, 2003 Writing projects due July 11 Learning objectives- FASTA and BLAST programs. Psi-Blast Workshop-Use.
Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.
BLAST, which stands for basic local alignment search tool, is a heuristic algorithm that is used to find similar sequences of amino acids or nucleotides.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Part 2- OUTLINE Introduction and motivation How does BLAST work?
Constructing Probability Matrices Redux Suppose we live in a world with only 3 amino acids: Alanine Leucine Serine Furthermore suppose: Alanine Leucine.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
BLAST, which stands for basic local alignment search tool, is a heuristic algorithm that is used to find similar sequences of amino acids or nucleotides.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Sequence Alignment.
Construction of Substitution matrices
Doug Raiford Phage class: introduction to sequence databases.
2016/1/27Summer Course1 Pattern Search Problems Part I: Fundament Concept.
Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Sequence database searching – Homology searching Dynamic Programming (DP) too slow for repeated database searches. Therefore fast heuristic methods: FASTA.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
What is BLAST? Basic BLAST search What is BLAST?
Basics of BLAST Basic BLAST Search - What is BLAST?
Homology Search Tools Kun-Mao Chao (趙坤茂)
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Bioinformatics and BLAST
Homology Search Tools Kun-Mao Chao (趙坤茂)
Sequence alignment, Part 2
Basic Local Alignment Search Tool (BLAST)
BIOINFORMATICS Fast Alignment
Basic Local Alignment Search Tool
Homology Search Tools Kun-Mao Chao (趙坤茂)
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Database Searching for Similar Sequences Search a sequence database for sequences that are similar to a query sequence Search a sequence database for sequences that are similar to a query sequence provide a list of database sequences with which the query sequence can be aligned well provide a list of database sequences with which the query sequence can be aligned well Key issue: Key issue:efficiency

Database Searching for Similar Sequences Methods Smith-Waterman requires order N 2 L computations Smith-Waterman requires order N 2 L computations Popular database searching methods (heuristic methods) Popular database searching methods (heuristic methods) FASTA [Pearson & Lipman, 1988] FASTA [Pearson & Lipman, 1988] BLAST [Altschul et al., 1990] BLAST [Altschul et al., 1990] Tradeoffs of using the heuristic fast method Tradeoffs of using the heuristic fast method Accuracy (Sensitivity and Selectivity) Accuracy (Sensitivity and Selectivity)

FASTAFASTA Problem with Smith-Waterman algorithm: Too many calculations “wasted” by comparing regions that have nothing in common Problem with Smith-Waterman algorithm: Too many calculations “wasted” by comparing regions that have nothing in common Initial insight: Regions that are similar between two sequences are likely to share short stretches that are identical Initial insight: Regions that are similar between two sequences are likely to share short stretches that are identical Basic method: Look for similar regions only near short stretches that match exactly --- “Hit and extend” sequence searching Basic method: Look for similar regions only near short stretches that match exactly --- “Hit and extend” sequence searching

1 Diagonal Method Example LVIQAAYFRAH s = AIQAAMDV t =t =t =t = offset … Y V R Q L I H FA Look-up table ,6, Offset vector

Limitations of FASTA FASTA can miss significant similarity since: FASTA can miss significant similarity since: For nucleic acids, due to codon “wobble”, DNA sequences may look like XXy where X’s are conserved and y’s are not For nucleic acids, due to codon “wobble”, DNA sequences may look like XXy where X’s are conserved and y’s are not GGuUCuACgAAg and GGcUCcACaAAA both code for the same peptide sequence (Gly-Ser-Thr-Lys) but they don’t match with k- tuple size of 3 or higher GGuUCuACgAAg and GGcUCcACaAAA both code for the same peptide sequence (Gly-Ser-Thr-Lys) but they don’t match with k- tuple size of 3 or higher For proteins, similar sequences do not have to share identical residues For proteins, similar sequences do not have to share identical residues Gly-Asp-Gly-Lys-Gly is quite similar to Gly-Glu-Gly-Arg-Gly but there is no match with k-tuple of size 2 Gly-Asp-Gly-Lys-Gly is quite similar to Gly-Glu-Gly-Arg-Gly but there is no match with k-tuple of size 2 Asp-Lys-Val is quite similar to Glu-Arg-Ile yet it is missed even with k-tuple size of 1 Asp-Lys-Val is quite similar to Glu-Arg-Ile yet it is missed even with k-tuple size of 1 Score ? Ala-Ala-Ala-Ala-Ala vs Ala-Ala-Ala-Ala-Ala Score ?

BLAST What does BLAST stand for? What does BLAST stand for? Basic Local Alignment Search Tool Basic Local Alignment Search Tool

BLASTBLAST BLAST is similar to FASTA but it searches for words which score above T rather than that match exactly. It is also faster because its implementation has been optimized to work with parallel UNIX architecture from an early stage. BLAST is similar to FASTA but it searches for words which score above T rather than that match exactly. It is also faster because its implementation has been optimized to work with parallel UNIX architecture from an early stage. Reference Reference S. F. Altschul, W. Gish, W. Miller, E. W. Myers and D. J. Lipman. Basic Local Alignment Search Tool. J. Mol. Biol. 215: (1990) S. F. Altschul, W. Gish, W. Miller, E. W. Myers and D. J. Lipman. Basic Local Alignment Search Tool. J. Mol. Biol. 215: (1990)

BLAST basics BLAST is mainly a 3-step algorithm: BLAST is mainly a 3-step algorithm: Compile list of high-scoring strings (words) Compile list of high-scoring strings (words) Search for hits – each hit gives a seed Search for hits – each hit gives a seed Extend seeds to obtain segment pairs Extend seeds to obtain segment pairs

BLASTBLAST For protein sequences, the list of high-scoring words consists of all words with w characters that scores at least T with some word in the query sequence (w = 3 or 4 for protein search, 11 or 12 for nucleotide sequences). For protein sequences, the list of high-scoring words consists of all words with w characters that scores at least T with some word in the query sequence (w = 3 or 4 for protein search, 11 or 12 for nucleotide sequences). Search for “hits” using a hash table or a finite state machine. Search for “hits” using a hash table or a finite state machine. Key concept: Searching for words which score above T rather than that match exactly Key concept: Searching for words which score above T rather than that match exactly

BLAST method for proteins 1. Compile a list of words which give a score above T when paired with the query sequence. Example using PAM-120 for query sequence ACDE (w=4, T=17): Example using PAM-120 for query sequence ACDE (w=4, T=17): A C D E A C D E ACDE = = 22 try all possibilities: try all possibilities: AAAA = = 0 no good AAAC = = -7 no good...too slow, try directed change...too slow, try directed change

Generating word list A C D E A C D E ACDE = = 22 change 1st pos. to all acceptable substitutions change 1st pos. to all acceptable substitutions gCDE = = 20 ok (=pCDE,sCDE,tCDE) nCDE = = 19 ok (=dCDE,eCDE, nCDE,vCDE) nCDE,vCDE) iCDE = = 18 ok (=qCDE) kCDE = = 17 ok (=mCDE) change 2nd pos.: can't - all alternatives negative and the other three positions only add up to 13 change 2nd pos.: can't - all alternatives negative and the other three positions only add up to 13 change 3rd pos. in combination with first position change 3rd pos. in combination with first position gCnE = = 17 ok continue - use recursion continue - use recursion

BLAST method for proteins 2. Scan the database for hits with the compiled list of words. Use finite state machine (actually used) Use finite state machine (actually used) Calculate a state transition table that tells what state to go to based on the next character in the sequence Calculate a state transition table that tells what state to go to based on the next character in the sequence 3. Extend hits in both directions to form segment pairs (without allowing gaps)

BLAST method for proteins Example of a finite state machine for string matching: (input alphabet: a,b,c) Example of a finite state machine for string matching: (input alphabet: a,b,c) Word: ababaca a b b a a a aabbaac Database sequence: bcabccaaababacababacabb

exercise Construct a finite state machine that recognize the word: Construct a finite state machine that recognize the word:ATG Assuming the sequence is a nucleotide sequence Assuming the sequence is a nucleotide sequence

BLAST Method for DNA 1. Make list of all words of length w in the query sequence (often w=11 or 12) 2. Compress database by packing 4 nucleotides into a single byte (use auxiliary table to tell you where sequences start and stop within the compressed database) -- doesn't allow for unspecified bases (wildcards)

BLAST Method for DNA 3. Compress the words from the query sequence the same way. 4. Search the compressed database for matches with the compressed words Since all frames of the query sequence are considered separately, any match of length w>=11 must contain a match of length 8 that lies on a byte boundary of one of the words from the query sequence. Thus can scan a (packed) byte at a time, improving speed 4-fold over comparing one nucleotide at a time.

Low-Complexity Regions Low-complexity regions are segments that contains certain bases or amino acid more often than one would expect in “normal” nucleotide or protein sequences. Low-complexity regions are segments that contains certain bases or amino acid more often than one would expect in “normal” nucleotide or protein sequences. Problem: if query sequence has a stretch of unusual base composition (e.g., A-T rich) or a repeated sequence element (e.g., Alu sequence) there will be many hits with "uninteresting" regions. Problem: if query sequence has a stretch of unusual base composition (e.g., A-T rich) or a repeated sequence element (e.g., Alu sequence) there will be many hits with "uninteresting" regions.

Low-Complexity Regions Solution : Solution : Make a list of the words occurring very frequently (more frequently than expected by chance). Make a list of the words occurring very frequently (more frequently than expected by chance). Remove these words from the query list of words before searching database. (The words are replaced by strings of Xs.) Remove these words from the query list of words before searching database. (The words are replaced by strings of Xs.)

BLAST Statistical significance A key to the utility of BLAST is the ability to calculate expected probabilities of occurrence of maximum segment pairs (MSPs) given w and T A key to the utility of BLAST is the ability to calculate expected probabilities of occurrence of maximum segment pairs (MSPs) given w and T This allows BLAST to rank matching sequences in order of “significance” and to cut off listings at a user-specified probability This allows BLAST to rank matching sequences in order of “significance” and to cut off listings at a user-specified probability

Choosing Values for w and T Trade-off: sensitivity vs. running-time Trade-off: sensitivity vs. running-time Choosing a value for w Choosing a value for w Small w: many matches to expand Small w: many matches to expand Big w: many words to be generated Big w: many words to be generated w=3/4 is a good compromise w=3/4 is a good compromise Choosing a value for T Choosing a value for T Small T: greater sensitivity, more matches to expand Small T: greater sensitivity, more matches to expand

BLAST Notes May fail to find optimal MSPs May fail to find optimal MSPs May miss seeds if T is too stringent May miss seeds if T is too stringent Empirically, 10 to 50 times faster than Smith-Waterman Empirically, 10 to 50 times faster than Smith-Waterman

Basic BLAST Family BLASTN BLASTN DNA to DNA database DNA to DNA database BLASTP BLASTP protein to protein database protein to protein database TBLASTN TBLASTN DNA (translated) to protein database DNA (translated) to protein database BLASTX BLASTX protein to DNA database (translated) protein to DNA database (translated) TBLASTX TBLASTX DNA (translated) to DNA database (translated) DNA (translated) to DNA database (translated)

BLAST Refinements gapped alignments gapped alignments “two-hit” method for extending word pairs “two-hit” method for extending word pairs Iterate with position-specific matrix (PSI- BLAST) Iterate with position-specific matrix (PSI- BLAST) Pattern-hit initiated BLAST (PHI-BLAST) Pattern-hit initiated BLAST (PHI-BLAST)