BLAST Workshop Maya Schushan June 2009.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Introduction to Bioinformatics
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Aligning sequences and searching databases
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Sequence Alignment vs. Database Task: Given a query sequence and millions of database records, find the optimal alignment between the query and a record.
Introduction to bioinformatics
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Heuristic Approaches for Sequence Alignments
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Sequence Alignments Revisited
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
1 Lesson 3 Aligning sequences and searching databases.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Inferring function by homology The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
An Introduction to Bioinformatics
Protein Sequence Alignment and Database Searching.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Part 2- OUTLINE Introduction and motivation How does BLAST work?
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
Pairwise Sequence Alignment Exercise 2. || || ||||| ||| || || ||||||||||||||||||| MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE… ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACG.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
BIOINFORMATICS Ayesha M. Khan Spring Lec-6.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Homologues finding and Multiple Sequence Alignment Maya Schushan November 2010.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Pairwise Sequence Alignment and Database Searching
Sequence similarity, BLAST alignments & multiple sequence alignments
Blast Basic Local Alignment Search Tool
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
Sequence alignment, E-value & Extreme value distribution
BLAST Slides adapted & edited from a set by
Presentation transcript:

BLAST Workshop Maya Schushan June 2009

Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases Work Steps Extract and analyze results

will be considered as homologous Why BLAST? Finding homologous Homology- similarity between sequences that result from a common ancestor. Sequences look alike  probably have the same function and structure. Use a sequence as a search query in order to find homologous sequences in a data base. Save time! – exploit the knowledge you have about your homologues, and conclude about your query. More then: 25% for proteins 70% for nucleotides will be considered as homologous

Identify sequence motifs Why BLAST? Finding homologous Identify sequence motifs

Why BLAST? Finding homologous Find out which region are evolutionary conserved  important for function and\or structure

Why BLAST? Finding homologous Construct phylogenetic trees  understand the evolution of the sequence’s family

Why BLAST? Finding homologous Inferring function for a novel sequence  learning from previous data available for homologous sequences

Why BLAST? Finding homologous Finding out if your protein sequence has a structure (or a close homologue has one….)

How does BLAST work? What Is An Alignment? Before we can understand how BLAST works, we first have to understand the principles of sequence alignment….

How does BLAST work? What Is An Alignment? VLSPADKTNVKAAWAKVGAHAAGHG Comparing 2 (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences. VLSPADKTNVKAAWAKVGAHAAGHG ||| | | |||| | |||| VLSEAEWQLVLHVWAKVEADVAGHG

How does BLAST work? ? or What Is An Alignment? T C A T G C A T T G A process of lining-up 2 or more sequences to achieve maximum level of identity, in order to find homologies. T C A T G C A T T G ? T C A T G C A T T G T C A T G C A T T G or

How does BLAST work? What Is An Alignment? S = ACTG S’ = AC_TG S’ = ACTG S’ = ACTG T = AGT T’ = A_GT_ T’ = AGT_ T’ = _AGT Good: Identical characters- match. Bad: Different characters- mismatch; gap (InDel). Each pair of characters gets a value, depending on its identity. The similarity score of the alignment is the sum of pair values.

Example: Aligning Two Globins How does BLAST work? General Alignment Methodology What Is An Alignment? Example: Aligning Two Globins Human Hemoglobin (HH): VLSPADKTNVKAAWGKVGAHAGYEG Sperm Whale Myoglobin (SWM): VLSEGEWQLVLHVWAKVEADVAGHG

Example: Aligning Two Globins How does BLAST work? What Is An Alignment? Example: Aligning Two Globins No Gaps: Percent identity: 36 Percent similarity: 40 (HH) VLSPADKTNVKAAWGKVGAHAGYEG (SWM) VLSEGEWQLVLHVWAKVEADVAGHG

Example: Aligning Two Globins How does BLAST work? What Is An Alignment? Example: Aligning Two Globins With Gaps: Gaps: 2 Percent identity: 45.833 (instead of 36 without gaps) Percent similarity: 54.167 (instead of 40 without gaps) (HH) VLSPADKTNVKAAWGKVGAH-AGYEG (SWM) VLSEGEWQLVLHVWAKVEADVAGH-G

How does BLAST work? What Is An Alignment? Alignment Scoring 1. Assume independent mutation model 2. Score at each position Positive if the same/similar Negative if different or gap 3. Score of an alignment is sum of position score Each position considered separately

How does BLAST work? What Is An Alignment? Scoring Matrix A matrix n  n : n=4 for DNA, n=20 for proteins Each entry matrix defines the score for observing the two letters in the alignment Positive if likely to change Negative otherwise T C G A 1 -5

How does BLAST work? What Is An Alignment? DNA scoring matrices Transitions – purine to purine or pyrmidine to pyrmidine (4 possibilities) Transversions – purine to pyrmidine or pyrmidine to purine (8 possibilities) By chance alone transversions should occur twice as often as transitions. De-facto transitions are more frequent than transversions.

How does BLAST work? What Is An Alignment? DNA scoring matrices T C G From To 2 -4 -6 Transversion Transition Match

Proteins scoring matrices How does BLAST work? What Is An Alignment? Proteins scoring matrices Observation: some substitutions are more frequent than others, e.g., chemically similar amino acids As for DNA, protein matrices define the probabilities of change between the different amino acids Popular matrices are based on empirical data: PAM & BLOSUM

How does BLAST work? What Is An Alignment? PAM Matrices PAM matrices are based on sequences with 85% identity. The changes are “accepted” by natural selection 1 PAM unit: the probability of 1 point mutation per 100 residues. Multiplying PAM1 by itself gives higher PAMs matrices that are suitable for larger evolutionary distance.

How does BLAST work? What Is An Alignment? BLOSUM Matrices Based on BLOCKS database: Low BLUSOM numbers for distant sequences, High BLUSOM numbers for similar sequence BLOSUMn is based on sequences that shared at least n percent identity, generally: BLOSUM62 for general use BLOSUM80 for close relations BLOSUM45 for distant relations

Proteins scoring matrices How does BLAST work? What Is An Alignment? Proteins scoring matrices Closer sequences PAM100 = BLOSUM90 PAM120 = BLOSUM80 PAM160 = BLOSUM60 PAM200 = BLOSUM52 PAM250 = BLOSUM45 Distant sequences

How do we calculate gap scores How does BLAST work? How do we calculate gap scores Same substitution scores are applied on gapped and ungapped local alignments. Appropriate gap scores have been selected over the years by trial and error  default gap scores If you wish to apply a different scoring matrix- No grantee that the gap scores will remain appropriate!!!! large penalty for opening and much smaller one for extending it are most effective

How does BLAST work? What Is An Alignment? Scoring The final score of the alignment is the sum of the positive scores and penalty scores: + Number of Identities + Number of Similarities - Number of Gap insertions - Number of Gap extensions Alignment score Scoring Matrix Gap penalties

BLAST (Basic Local Alignment Search Tool) How does BLAST work? BLAST (Basic Local Alignment Search Tool) Goal: A fast search for homologues in a huge database The underlying hypothesis: when two sequences are similar there are short ungapped regions of high similarity between them The heuristic: Discard irrelevant sequences Perform exact local alignment only with the remaining sequences Altschul, S.F.,Gish, W., Miller, W., Myers, E.W., and Lipman,D.J(1990) “basic local alignment search tool” J. Mol. Biol. 215: 403-410

Searching a sequence database How does BLAST work? Searching a sequence database Idea: In order to find homologous sequences to a sequence of interest, one should compute its pairwise alignment against all known sequences in a database, and detect the best scoring significant homologs Query sequence - the sequence with which we are searching Hit – a sequence found in the database, suspected as homologous

How does BLAST work? The parameters- W : Word size – find W-mers in target/query 2-3 for aa, 6-11 for nucleotides. T : Threshold – focus on pairs scoring >T usually 11-13 X : Drop-off – stop extending when loss >X S : Score – the final score of segment pair

How does BLAST work? The algorithm: s t Align a query sequence with the database. Find “hits”: short word pairs of length W with an ungapped alignment score of at least T. Extend alignments until score drops more than X below hitherto best score Consumes most of the processing time (>90%) s t

How do we discard irrelevant sequences quickly? How does BLAST work? How do we discard irrelevant sequences quickly? Divide the database into words of length w (default: w = 3 for protein and w = 7 for DNA) Save the words in a look-up table that can be searched quickly WTD TDF DFG FGY GYP … WTDFGYPAILKGGTAC

BLAST: discarding sequences How does BLAST work? BLAST: discarding sequences When the user enters a query sequence, it is also divided into words Search the database for consecutive neighboring words neighbor words are defined according to a scoring matrix (e.g., BLOSUM62 for proteins) with a certain cutoff level GFC (20) GFB GPC (11) WAC (5)

Look for a seed: hits on the same diagonal which can be connected How does BLAST work? Neighbor word Look for a seed: hits on the same diagonal which can be connected A At least 2 hits on the same diagonal with distance which is smaller than a predetermined cutoff Database record This is the filtering stage – many unrelated hits are filtered, saving lots of time! Query

Try to extend the alignment How does BLAST work? Try to extend the alignment Stop extending when the score of the alignment drops X beneath the maximal score obtained so far Discard segments with score < S ASKIOPLLWLAASFLHNEQAPALSDAN JWQEOPLWPLAASOIHLFACNSIFYAS Score=15 Score=17 Score=14

How does BLAST work? Two-Hit Gapped BLAST The new gapped BLAST algorithm: Start with the two hit method- (a) find two hits of score higher then T, within a distance A. (b) invoke an ungapped extension on the second hit. If the HSP generated has an expected score: (a) Trigger a gapped extension (b) If the final score has a significant E-value – report the gapped alignment.

The result – local alignment How does BLAST work? The result – local alignment The result of BLAST will be a series of local alignments between the query and the different hits found

How does BLAST work? The scoring system BLAST uses BLOSSOM62 as the scoring matrix to perform the alignment (default).

How does BLAST work? E-value Small E-value  better score To asses the bits score we calculate E-value: E-value = The expected number of HSP’s with a score of at least S: For each score S there is a specific E-value. Small E-value  better score

In practice – BLAST uses estimations. How does BLAST work? E-value Theoretically, we could trust any result with an E-value ≤ 1 In practice – BLAST uses estimations. E-values of 10-4 and lower indicate a significant homology. E-values between 10-4 and 10-2 should be checked (similar domains, maybe non-homologous). E-values between 10-2 and 1 do not indicate a good homology

How does BLAST work? PSI-BLAST Step 1: Set a standard protein-protein BLAST search (BLOSUM62) Build a position specific scoring matrix (PSSM) according to MSA of the alignment results with low E-value. Step 2: Set a BLAST search using the PSSM to evaluate the alignment. PSSM vs. DB instead of seq vs. DB Update the PSSM according to the new result Go back to the beginning of step two or stop.

How does BLAST work? PSI-BLAST The difference- The score for aligning a letter with a pattern position is given by the matrix itself! The matrix is of the length of the original seq. (L* 20) No theory for deriving gap costs  Gap scores the same as the one in the 1st iteration 1 2 3 4 5 6 7 8 9 A .1 .3 .2 .8 D .6 .4 L .7 .9

The power of PSI-BLAST: How does BLAST work? The power of PSI-BLAST: A much sensitive scoring system . each position has its own pattern probabilities . Different weight to conserved positions. Important motifs are bounded Lowers the level of random noise. Finding distant relatives.

How does BLAST work? Lets sum up… Blast is a fast way to find homologues No analytic theory that estimates the statistical significance of gapped alignments Gap scores have been selected by trial and error. applying different scoring matrix  No grantee for gap scores PSI-BLAST finds weak homologues fast