Part 2- OUTLINE Introduction and motivation How does BLAST work?

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
Bioinformatics Tutorial I BLAST and Sequence Alignment.
BLAST Sequence alignment, E-value & Extreme value distribution.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Introduction to Bioinformatics
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Aligning sequences and searching databases
Database Searching for Similar Sequences Search a sequence database for sequences that are similar to a query sequence Search a sequence database for sequences.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Heuristic alignment algorithms and cost matrices
We continue where we stopped last week: FASTA – BLAST
Slide 1 EE3J2 Data Mining Lecture 20 Sequence Analysis 2: BLAST Algorithm Ali Al-Shahib.
Sequence Alignment vs. Database Task: Given a query sequence and millions of database records, find the optimal alignment between the query and a record.
1 BLAST – A heuristic algorithm Anjali Tiwari Pannaben Patel Pushkala Venkataraman.
Introduction to bioinformatics
Similar Sequence Similar Function Charles Yan Spring 2006.
Heuristic Approaches for Sequence Alignments
BLAST.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
1 Lesson 3 Aligning sequences and searching databases.
Sequence alignment, E-value & Extreme value distribution
From Pairwise Alignment to Database Similarity Search.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Inferring function by homology The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
An Introduction to Bioinformatics
BLAST What it does and what it means Steven Slater Adapted from pt.
BLAST Workshop Maya Schushan June 2009.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases Work Steps Extract and analyze results.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Local alignment, BLAST and Psi-BLAST October 25, 2012 Local alignment Quiz 2 Learning objectives-Learn the basics of BLAST and Psi-BLAST Workshop-Use BLAST2.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Rationale for searching sequence databases June 25, 2003 Writing projects due July 11 Learning objectives- FASTA and BLAST programs. Psi-Blast Workshop-Use.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
(PSI-)BLAST & MSA via Max-Planck. Where? (to find homologues) Structural templates- search against the PDB Sequence homologues- search against SwissProt.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Sequence Alignment.
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Blast Basic Local Alignment Search Tool
Basics of BLAST Basic BLAST Search - What is BLAST?
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Sequence alignment, Part 2
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

will be considered as homologous Why BLAST? Finding homologous Homology- similarity between sequences that result from a common ancestor. Sequences look alike  probably have the same function and structure. Use a sequence as a search query in order to find homologous sequences in a data base. Save time! – exploit the knowledge you have about your homologues, and conclude about your query. More then: 25% for proteins 70% for nucleotides will be considered as homologous

Answering basic questions such as: Why BLAST? Finding homologous Answering basic questions such as: Which bacterial species have a protein that is related in lineage to a certain protein with known amino-acid sequence? Where does a certain sequence of DNA originate? What other genes encode proteins that exhibit structures or motifs such as ones that have just been determined?

Searching a sequence database Why BLAST? Searching a sequence database The idea: Use your sequence as a query to find homologous sequences in a sequence database Database A sequence taken from Venter’s trip

Searching a sequence database Why BLAST? Searching a sequence database Database query

Searching a sequence database Why BLAST? Searching a sequence database Database hit query

Why BLAST? Why Heuristics ? Database Query Assuming 10 comparisons in every second, a full comparison of the query to the database requires 11.5 days. Database Query 107 sequences

Why BLAST? Why Heuristics ? Assuming 10 comparisons in every second, a full comparison of the query to the database requires 11.5 days. Database Query 107 sequences 11.5 days is ok if we are doing it once. 150,000 searches (at least!!) are performed per day: >82,000,000 sequence records in GenBank.

Why BLAST? Terminology Query sequence - the sequence with which we are searching the database Hit – a sequence found in the database, suspected as homologous to the query sequence

BLAST (Basic Local Alignment Search Tool) How does BLAST work? BLAST (Basic Local Alignment Search Tool) Goal: A fast search for homologues in a huge database One of the most widespread bioinformatics programs: Provides a solution to a fundamental need Emphasizes speed over sensitivity  the databases are enorsmous and will only grow larger and larger… Cannot guarantee optimal alignment  after finding the homologs via BLAST, an additional alignment program is needed Altschul, S.F.,Gish, W., Miller, W., Myers, E.W., and Lipman,D.J(1990) “basic local alignment search tool” J. Mol. Biol. 215: 403-410

BLAST (Basic Local Alignment Search Tool) How does BLAST work? BLAST (Basic Local Alignment Search Tool) The underlying hypothesis: when two sequences are similar there are short ungapped regions of high similarity between them The heuristic: Discard irrelevant sequences Perform exact local alignment only with the remaining sequences Altschul, S.F.,Gish, W., Miller, W., Myers, E.W., and Lipman,D.J(1990) “basic local alignment search tool” J. Mol. Biol. 215: 403-410

Searching a sequence database How does BLAST work? Searching a sequence database Idea: In order to find homologous sequences to a sequence of interest, one should compute its pairwise alignment against all known sequences in a database, and detect the best scoring significant homologs Query sequence - the sequence with which we are searching Hit – a sequence found in the database, suspected as homologous (HSP- the matched region)

BLAST Main paradigm Yes No For each database record & query: Look for common words instead of trying all possible alignments between two sequences. If many common words are found: Then – The query and the record are homologues Find common words between record and query? Yes No Possible Homologs: Save record for further analysis Probably not homologs: Discard record Retrieve next record from database

Searching a sequence database How does BLAST work? Searching a sequence database Inputs: Query sequence Database of sequences Word size (use default…) Substitution matrix (use default…) Gap penalty (use default…)

How does BLAST work? The parameters- W : Word size – find W-mers in target/query 2-3 for aa, 6-11 for nucleotides. T : Threshold – focus on pairs scoring >T usually 11-13 X : Drop-off – stop extending when loss >X S : Score – the final score of segment pair

How do we discard irrelevant sequences quickly? How does BLAST work? How do we discard irrelevant sequences quickly? Divide the database into words of length w (default: w = 3 for protein and w = 7 for DNA) Save the words in a look-up table that can be searched quickly WTD TDF DFG FGY GYP … WTDFGYPAILKGGTAC

BLAST: discarding sequences How does BLAST work? BLAST: discarding sequences When the user enters a query sequence, it is also divided into words. For each word, neighbor words are defined according to a scoring matrix (e.g., BLOSUM62 for proteins) with the cutoff level (T) GFC (20) GFB GPC (11) WAC (5)

BLAST: discarding sequences How does BLAST work? BLAST: discarding sequences A list is compiled including the possible neighboring words, for which only exact matches to word in the database are accepted. The words whose scores are greater than the threshold T will remain in the possible matching words list, while those with lower scores will be discarded. GFC (20) GFB GPC (11) WAC (5)

How does BLAST work? The algorithm: s t Align a query sequence with the database. Find “hits”: short word pairs of length W with an ungapped alignment score of at least T. Extend alignments until score drops more than X below hitherto best score Consumes most of the processing time (>90%) s t

Try to extend the alignment How does BLAST work? Try to extend the alignment Stop extending when the score of the alignment drops X beneath the maximal score obtained so far Discard segments with score < S ASKIOPLLWLAASFLHNEQAPALSDAN JWQEOPLWPLAASOIHLFACNSIFYAS Score=15 Score=17 Score=14

How does BLAST work? Two-Hit Gapped BLAST The goal: Faster algorithm Reduce number of extensions Observations: HSP much longer than W often contains more than one word-pair Idea: focus on two or more words on same diagonal

Look for a seed: hits on the same diagonal which can be connected How does BLAST work? Neighbor word Look for a seed: hits on the same diagonal which can be connected A At least 2 hits on the same diagonal with distance which is smaller than a predetermined cutoff Database record This is the filtering stage – many unrelated hits are filtered, saving lots of time! Query

How does BLAST work? Two-Hit Gapped BLAST The new gapped BLAST algorithm: Start with the two hit method- (a) find two hits of score higher then T, within a distance A. (b) invoke an ungapped extension on the second hit. If the HSP generated has an expected score: (a) Trigger a gapped extension (b) If the final score has a significant E-value – report the gapped alignment.

The result – local alignment How does BLAST work? The result – local alignment The result of BLAST will be a series of local alignments between the query and the different hits found

How does BLAST work? The scoring system BLAST uses BLOSSOM62 as the scoring matrix to perform the alignment (default).

How does BLAST work? E-value To asses the bits score we calculate E-value: E-value = The expected number of HSP’s with a score of at least S For each score S there is a specific E-value. Small E-value  better score

In practice – BLAST uses estimations. How does BLAST work? E-value Theoretically, we could trust any result with an E-value ≤ 1 In practice – BLAST uses estimations. E-values of 10-4 and lower indicate a significant homology. E-values between 10-4 and 10-2 should be checked (similar domains, maybe non-homologous). E-values between 10-2 and 1 do not indicate a good homology

Low complexity regions- filter How does BLAST work? Low complexity regions- filter Low-complexity region- a region of a sequence is composed of few kinds of elements. These regions might give high scores that confuse the program to find the actual significant sequences in the database should be filtered out with specialized programs.

Query sequence: DNA or protein? For coding sequences, we can use the DNA sequence or the protein sequence to search for similar sequences. Which is preferable if we want to learn about homology?

Query sequence: DNA or protein? Query type Nucleotides: a four letter alphabet Amino acids: a twenty letter alphabet Two random DNA sequences will, on average, have 25% identity Two random protein sequences will, on average, have 5% identity

Query sequence: DNA or protein? Query types Which search is preferable? The genetic code is redundant. Some amino acids are coded by more than one codon. Therefore, the DNA sequence can change while the amino acid sequence will remain the same. 2. Nucleotides: a four letter alphabet. Amino acids: a twenty letter alphabet. 3. Protein comparison matrices are much more sensitive than those for DNA, i.e., similarity relationships are defined between two amino acids (PAM/Blosum). 4. DNA databases are much larger, meaning more random hits.

Query sequence: DNA or protein? Amino acids are better! Selection (and hence conservation) works (mostly) at the protein level: CTTTCA = Leu-Ser TTGAGT = Leu-Ser

Why use a nucleotide sequence after all? Query sequence: DNA or protein? Protein sequence comparisons typically double the evolutionary look-back time over DNA sequence comparisons. Evolutionary distant proteins will exhibit a high similarity rather than a high identity. Hits can exhibit a long alignment (homology) or a short alignment (conserved domains). Why use a nucleotide sequence after all? Amino acids are better!

Query type The sequence query can be a nucleotide sequence or an amino acid sequence. But … we can translate the query sequence! The search is performed against a nucleotide or amino acid database. But … we can use translated databases! (e.g., trEMBL) All types of searches are possible: Query: DNA Protein Database: DNA Protein

Query type Nucleotide query can be translated and searched against protein databases: Translate all reading frames (3 + 3) Find long ORF. Amino acid query can be back-translated to and searched against nucleotide databases? During translation we lose information. A single amino acid sequence can be back-translated to many possible nucleotide sequences .

Query type 1. amino acid query against protein database (blastp) identifying a protein sequence finding similar sequences in protein databases. 2. nucleotide query against nucleotide database (blastn) In non-coding regions (no ORF found)- Identify the query sequence or find similar sequences. Find primer binding sites or map short contiguous motifs 3. compares translated nucleotide query against protein database. (blastx) Useful when the query includes a coding region, and we try to find homologous proteins. Used extensively in analyzing EST sequences. This search is more sensitive than nucleotide blast since the comparison is performed at the protein level. 4. protein query against translated nucleotide database (tblastn) useful for finding protein homologs in unnannotated nucleotide data of coding regions (e.g., ESTs, draft genome records (HTG)). 5. translated nucleotide query against translated nucleotide database. (tblastz) Useful for identifying novel genes in error prone query sequences. Used for identifying potential proteins encoded by single pass read ESTs.

Position Specific Iterated BLAST BLAST vs. PSI-BLAST PSI-BLAST Position Specific Iterated BLAST Use sequence information to build position specific scoring matrices More sensitive After 1 BLAST iteration, we invoke the different PSI-BLAST for a number of additional iterations

BLAST vs. PSI-BLAST PSI-BLAST Step 1: Set a standard protein-protein BLAST search (BLOSUM62) Build a position specific scoring matrix (PSSM) according to MSA of the alignment results with low E-value. Step 2: Set a BLAST search using the PSSM to evaluate the alignment. PSSM vs. DB instead of seq vs. DB Update the PSSM according to the new result Go back to the beginning of step two or stop.

BLAST vs. PSI-BLAST PSI-BLAST Searching with a Profile aligning profile matrix to a simple sequence like aligning two sequences except score for aligning a character with a matrix position is given by the matrix itself not a substitution matrix

BLAST vs. PSI-BLAST PSI-BLAST Figure from: Altschul et al. Nucleic Acids Research 25, 1997

BLAST vs. PSI-BLAST Testing PSI-BLAST Compare sensitivity and speed of: • Smith-Waterman • Original BLAST • Gapped BLAST • PSI-BLAST

BLAST vs. PSI-BLAST Testing PSI-BLAST All but one are true homologs PSI-BLAST is faster and more sensitive Other BLAST algorithms good as well

The power of PSI-BLAST: BLAST vs. PSI-BLAST The power of PSI-BLAST: A much sensitive scoring system . each position has its own pattern probabilities . Different weight to conserved positions. Important motifs are bounded Lowers the level of random noise. Finding distant relatives.

BLAST vs. PSI-BLAST Lets sum up… Blast is a fast way to find homologues No analytic theory that estimates the statistical significance of gapped alignments Gap scores have been selected by trial and error. applying different scoring matrix  No grantee for gap scores PSI-BLAST finds weak homologues fast

Finding & selecting homologues Where? (to find homologues) Structural templates- search against the PDB Sequence homologues- search against SwissProt or Uniprot or UniRef90 (recommended!) How many? As many as possible, as long as the MSA looks good (examples in the next hour…)

Finding & selecting homologues How long? (length of homologues) Fragments- short homologues (less than 50,60% the query’s length) = bad alignment Ensure your sequences exhibit the wanted domain(s) N/C terminal tend to vary in length between homologues Can use HSPs or full sequences, depends on which case you are working on… How close? (distance from query sequence) All too close- no information Too many too far- bad alignment Ensure that you have a balanced collection!

Finding & selecting homologues From who? (which species the sequence belongs to) Don’t care, all homologues are welcome Orthologues/paralogues may be helpful Sequences from distant/close species provide different types of information Which method? (BLAST/PSI-BLAST) Depends on the protein, available homologues, the goal in mind…