Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 3.11 BLAST. Lecture 3.12 BLAST B asic L ocal A lignment S earch T ool Developed in 1990 and 1997 (S. Altschul) A heuristic method for performing.

Similar presentations


Presentation on theme: "Lecture 3.11 BLAST. Lecture 3.12 BLAST B asic L ocal A lignment S earch T ool Developed in 1990 and 1997 (S. Altschul) A heuristic method for performing."— Presentation transcript:

1 Lecture 3.11 BLAST

2 Lecture 3.12 BLAST B asic L ocal A lignment S earch T ool Developed in 1990 and 1997 (S. Altschul) A heuristic method for performing local alignments through searches of high scoring segment pairs (HSP’s) 1st to use statistics to predict significance of initial matches - saves on false leads Offers both sensitivity and speed

3 Lecture 3.13 Looks for clusters of nearby or locally dense “similar or homologous” k-tuples Uses “look-up” tables to shorten search time Uses larger “word size” than FASTA to accelerate the search process Performs both Global and Local alignment Fastest and most frequently used sequence alignment tool -- THE STANDARD BLAST

4 Lecture 3.14 BLAST Access NCBI BLAST http://www.ncbi.nlm.nih.gov/BLAST/ Canadian Bioinformatics Resource BLAST http://cbr-rbc.nrc-cnrc.gc.ca/blast/ European Bioinformatics Institute BLAST http://www.ebi.ac.uk/blastall/ http://www.ebi.ac.uk/blast2/

5 Lecture 3.15

6 6

7 7

8 8 Different Flavours of BLAST BLASTP - protein query against protein DB BLASTN - DNA/RNA query against GenBank (DNA) BLASTX - 6 frame trans. DNA query against proteinDB TBLASTN - protein query against 6 frame GB transl. TBLASTX - 6 frame DNA query to 6 frame GB transl. PSI-BLAST - protein ‘profile’ query against protein DB PHI-BLAST - protein pattern against protein DB

9 Lecture 3.19 Other BLAST Services MEGABLAST - for comparison of large sets of long DNA sequences RPS-BLAST - Conserved Domain Detection BLAST 2 Sequences - for performing pairwise alignments for 2 chosen sequences Genomic BLAST - for alignments against select human, microbial or malarial genomes VecScreen - for detecting cloning vector contamination in sequenced data

10 Lecture 3.110 Running NCBI BLAST

11 Lecture 3.111 MT0895 MMKIQIYGTGCANCQMLEKNAREAVKELG IDAEFEKIKEMDQILEAGLTALPGLAVDG ELKIMGRVASKEEIKKILS

12 Lecture 3.112 Paste in sequence (FASTA format, raw sequence or type in GI or accession number) Running NCBI BLAST >Mysequence MT0895 KIQIYGTGCANCQMLEKNAREAVKELGIDAE FEKIKEMDQILEAGLTALPGLAVDGELKIDS > KIQIYGTGCANCQMLEKNAREAVKELGIDAE FEKIKEMDQILEAGLTALPGLAVDGELKIDS OR KIQIYGTGCANCQMLEKNAREAVKELGIDAE FEKIKEMDQILEAGLTALPGLAVDGELKIDS OR

13 Lecture 3.113 Choose a range of interest in the sequence “set subsequences” (not usually used) Select the database from pull-down menu (usually choose nr = non-redundant) Keep CD Search “check box” on Leave “Options” unchanged (use defaults) Go to “Format” menu and adjust Number of descriptions and alignments as desired Running NCBI BLAST

14 Lecture 3.114 Running NCBI BLAST Select Database

15 Lecture 3.115 Conserved Domain Database Contains a collection of pre-identified functional or structural domains Derived from Pfam and Smart databases as well as other sources Uses Reverse Position Specific BLAST (RPS-BLAST) to perform search Query sequence is compared to a PSSM derived from each of the aligned domains

16 Lecture 3.116 Running NCBI BLAST Click BLAST!

17 Lecture 3.117 Formatting Results

18 Lecture 3.118 BLAST Format Options

19 Lecture 3.119 BLAST Output

20 Lecture 3.120 BLAST Output

21 Lecture 3.121 BLAST Output

22 Lecture 3.122 BLAST Output

23 Lecture 3.123 BLAST Output

24 Lecture 3.124 BLAST Output

25 Lecture 3.125 BLAST Parameters Identities - No. & % exact residue matches Positives - No. and % similar & ID matches Gaps - No. & % gaps introduced Score - Summed HSP score (S) Bit Score - a normalized score (S’) Expect (E) - Expected # of chance HSP aligns P - Probability of getting a score > X T - Minimum word or k-tuple score (Threshold)

26 Lecture 3.126 BLAST - Rules of Thumb Expect (E-value) is equal to the number of BLAST alignments with a given Score that are expected to be seen simply due to chance Don’t trust a BLAST alignment with an Expect score > 0.01 (Grey zone is between 0.01 - 1) Expect and Score are related, but Expect contains more information. Note that %Identies is more useful than the bit Score Recall Doolittle’s Curve (%ID vs. Length, next slide) %ID > 30 - numres/50 If uncertain about a hit, perform a PSI-BLAST search

27 Lecture 3.127 Doolittle’s Curve Twilight Zone

28 Lecture 3.128 Getting the Most from BLAST

29 Lecture 3.129 BLAST Options

30 Lecture 3.130 BLAST Options Composition-based statistics (Yes) Sequence Complexity Filter (Yes) Expect (E) value (10) Word Size (3) Substitution or Scoring Matrix (Blosum62) Gap Insertion Penalty (11) Gap Extension Penalty (1)

31 Lecture 3.131 Composition Statistics Recent addition to BLAST algorithm Permits calculated E (Expect) values to account for amino acid composition of queries and database hits Improves accuracy and reduces false positives Effectively conducts a different scoring procedure for each sequence in database

32 Lecture 3.132 LCR’s (low complexity) Watch out for… –transmembrane or signal peptide regions –coil-coil regions –short amino acid repeats (collagen, elastin) –homopolymeric repeats BLAST uses SEG to mask amino acids BLAST uses DUST to mask bases

33 Lecture 3.133 Scoring Matrices BLOSUM Matrices –Developed by Henikoff & Henikoff (1992) –BLOcks SUbstitution Matrix –Derived from the BLOCKS database PAM Matrices –Developed by Schwarz and Dayhoff (1978) –Point Accepted Mutation –Derived from manual alignments of closely related proteins

34 Lecture 3.134 How to Make Your Own Matrix ACDEFGH.. ACDEFGK.. AADEFGH.. GCDEFGH.. ACAEYGK.. ACAEFAH.. PerformCalculateFill Sub AlignmentFrequenciesMatrix f (A,A) = A A C D 0.8 -- -- C D... E 0.2 0.8 -- #A obs #A exp 0.0 0.3 1.0 -- -- -- f (C,A) = #C/A obs #A exp #C exp +

35 Lecture 3.135 PAM versus BLOSUM First useful scoring matrix for protein Assumed a Markov Model of evolution (I.e. all sites equally mutable and independent) Derived from small, closely related proteins with ~15% divergence Much later entry to matrix “sweepstakes” No evolutionary model is assumed Built from PROSITE derived sequence blocks Uses much larger, more diverse set of protein sequences (30% - 90% ID)

36 Lecture 3.136 PAM versus BLOSUM Higher PAM numbers to detect more remote sequence similarities Lower PAM numbers to detect high similarities 1 PAM ~ 1 million years of divergence Errors in PAM 1 are scaled 250X in PAM 250 Lower BLOSUM numbers to detect more remote sequence similarities Higher BLOSUM numbers to detect high similarities Sensitive to structural and functional subsitution Errors in BLOSUM arise from errors in alignment

37 Lecture 3.137 PAM Matricies PAM 40 - prepared by multiplying PAM 1 by itself a total of 40 times best for short alignments with high similarity PAM 120 - prepared by multiplying PAM 1 by itself a total of 120 times best for general alignment PAM 250 - prepared by multiplying PAM 1 by itself a total of 250 times best for detecting distant sequence similarity

38 Lecture 3.138 BLOSUM Matricies BLOSUM 90 - prepared from BLOCKS sequences with >90% sequence ID best for short alignments with high similarity BLOSUM 62 - prepared from BLOCKS sequences with >62% sequence ID best for general alignment (default) BLOSUM 30 - prepared from BLOCKS sequences with >30% sequence ID best for detecting weak local alignments

39 Lecture 3.139 Scraping the Bottom of the Barrel with Psi-BLAST

40 Lecture 3.140 PSI-BLAST Algorithm Perform initial alignment with BLAST using BLOSUM 62 substitution matrix Construct a multiple alignment from matches Prepare position specific scoring matrix Use PSSM profile as the scoring matrix for a second BLAST run against database Repeat steps 3-5 until convergence

41 Lecture 3.141 PSI-BLAST

42 Lecture 3.142 PSI-BLAST PresS Iterate!

43 Lecture 3.143 PSI-BLAST PresS Iterate!

44 Lecture 3.144 PSI-BLAST

45 Lecture 3.145 PSI-BLAST For Protein Sequences ONLY Much more sensitive than BLAST Slower (iterative process) Often yields results that are as good as many common threading methods SHOULD BE YOUR FIRST CHOICE IN ANALYZING A NEW SEQUENCE

46 Lecture 3.146 BLAST against PDB

47 Lecture 3.147 Still Confused? http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html

48 Lecture 3.148 Conclusions BLAST is the most important program in bioinformatics (maybe all of biology) BLAST is based on sound statistical principles (key to its speed and sensitivity) A basic understanding of its principles is key for using/interpreting BLAST output Use NBLAST or MEGABLAST for DNA Use PSI-BLAST for protein searches


Download ppt "Lecture 3.11 BLAST. Lecture 3.12 BLAST B asic L ocal A lignment S earch T ool Developed in 1990 and 1997 (S. Altschul) A heuristic method for performing."

Similar presentations


Ads by Google