Presentation is loading. Please wait.

Presentation is loading. Please wait.

David Wishart David Wishart University of Alberta

Similar presentations


Presentation on theme: "David Wishart David Wishart University of Alberta"— Presentation transcript:

1 David Wishart David Wishart University of Alberta
February 18th, 2004 BLAST David Wishart University of Alberta Lecture 3.1 (c) 2004 CGDN

2 Objectives Gain familiarity with sequence searches and comparisons via web-based BLAST To understand the BLAST algorithm To understand the principles of BLAST scoring and BLAST statistics To understand scoring matrices To become aware of other BLAST services and applications Lecture 3.1

3 Let’s try an experiment...
ACDEAGHNKLM... KKDEFGHPKLM... SCDEFCHLKLM... Align MCDEFGHNKLV... ACDEFGHIKLM... QCDEFGHAKLM... AQQQFGHIKLPI... WCDEFGHLKLM... SMDEFAHVKLM... ACDEFGFKKLM... Lecture 3.1

4 What kind of score distribution do you get?
Lecture 3.1

5 What kind of distribution?
Gaussian? Poisson? Other? Lecture 3.1

6 Extreme Value Distribution
P(x) = e e -e x x Lecture 3.1

7 Why is this important? If you can predict the usual score distribution prior to performing an alignment search then it is possible to predict which alignments and which sequences will be worth aligning Saves on time! Gives a significance value (not just a raw score) to sequence alignments Lecture 3.1

8 BLAST Basic Local Alignment Search Tool
Developed in 1990 and 1997 (S. Altschul) A heuristic method for performing local alignments through searches of high scoring segment pairs (HSP’s) 1st to use statistics to predict significance of initial matches - saves on false leads Offers both sensitivity and speed Lecture 3.1

9 BLAST Looks for clusters of nearby or locally dense “similar or homologous” k-tuples Uses “look-up” tables to shorten search time Uses larger “word size” than FASTA to accelerate the search process Performs both Global and Local alignment Fastest and most frequently used sequence alignment tool -- THE STANDARD Lecture 3.1

10 Key References Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) "Basic local alignment search tool." J. Mol. Biol. 215: Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997)"Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic Acids Res. 25: Lecture 3.1

11 BLAST Access NCBI BLAST Canadian Bioinformatics Resource BLAST
Canadian Bioinformatics Resource BLAST European Bioinformatics Institute BLAST Lecture 3.1

12 Lecture 3.1

13 Lecture 3.1

14 Lecture 3.1

15 Different Flavours of BLAST
BLASTP - protein query against protein DB BLASTN - DNA/RNA query against GenBank (DNA) BLASTX - 6 frame trans. DNA query against proteinDB TBLASTN - protein query against 6 frame GB transl. TBLASTX - 6 frame DNA query to 6 frame GB transl. PSI-BLAST - protein ‘profile’ query against protein DB PHI-BLAST - protein pattern against protein DB Lecture 3.1

16 Other BLAST Services MEGABLAST - for comparison of large sets of long DNA sequences RPS-BLAST - Conserved Domain Detection BLAST 2 Sequences - for performing pairwise alignments for 2 chosen sequences Genomic BLAST - for alignments against select human, microbial or malarial genomes VecScreen - for detecting cloning vector contamination in sequenced data Lecture 3.1

17 Running NCBI BLAST Lecture 3.1

18 MT0895 MMKIQIYGTGCANCQMLEKNAREAVKELGIDAEFEKIKEMDQILEAGLTALPGLAVDGELKIMGRVASKEEIKKILS Lecture 3.1

19 Running NCBI BLAST OR OR
Paste in sequence (FASTA format, raw sequence or type in GI or accession number) >Mysequence MT0895 KIQIYGTGCANCQMLEKNAREAVKELGIDAEFEKIKEMDQILEAGLTALPGLAVDGELKIDS OR > KIQIYGTGCANCQMLEKNAREAVKELGIDAEFEKIKEMDQILEAGLTALPGLAVDGELKIDS OR KIQIYGTGCANCQMLEKNAREAVKELGIDAEFEKIKEMDQILEAGLTALPGLAVDGELKIDS Lecture 3.1

20 Running NCBI BLAST Choose a range of interest in the sequence “set subsequences” (not usually used) Select the database from pull-down menu (usually choose nr = non-redundant) Keep CD Search “check box” on Leave “Options” unchanged (use defaults) Go to “Format” menu and adjust Number of descriptions and alignments as desired Lecture 3.1

21 Running NCBI BLAST Select Database Lecture 3.1

22 Conserved Domain Database
Contains a collection of pre-identified functional or structural domains Derived from Pfam and Smart databases as well as other sources Uses Reverse Position Specific BLAST (RPS-BLAST) to perform search Query sequence is compared to a PSSM derived from each of the aligned domains Lecture 3.1

23 Running NCBI BLAST Click BLAST! Lecture 3.1

24 Formatting Results Lecture 3.1

25 BLAST Format Options Lecture 3.1

26 BLAST Output Lecture 3.1

27 BLAST Output Lecture 3.1

28 BLAST Output Lecture 3.1

29 BLAST Output Lecture 3.1

30 BLAST Output Lecture 3.1

31 BLAST Output Lecture 3.1

32 BLAST Parameters Identities - No. & % exact residue matches
Positives - No. and % similar & ID matches Gaps - No. & % gaps introduced Score - Summed HSP score (S) Bit Score - a normalized score (S’) Expect (E) - Expected # of chance HSP aligns P - Probability of getting a score > X T - Minimum word or k-tuple score (Threshold) Lecture 3.1

33 BLAST - Rules of Thumb Expect (E-value) is equal to the number of BLAST alignments with a given Score that are expected to be seen simply due to chance Don’t trust a BLAST alignment with an Expect score > 0.01 (Grey zone is between ) Expect and Score are related, but Expect contains more information. Note that %Identies is more useful than the bit Score Recall Doolittle’s Curve (%ID vs. Length, next slide) %ID > 30 - numres/50 If uncertain about a hit, perform a PSI-BLAST search Lecture 3.1

34 Doolittle’s Curve Twilight Zone Lecture 3.1

35 BLAST Statistics Lecture 3.1

36 Extreme Value Distribution
Lecture 3.1

37 Extreme Value Distribution
Kmne-lS is called Expect or E-value In BLAST E = 10 so P = If E is small (<0.01) then P is small If Matches = 1 and Mismatches = -1 then: l = lnq/p and K = (q-p)2/q p = probability of match = 0.05 q = probability of not match = 0.95 Then l = 2.94 and K =0.85 m = length of sequence & n = length of database S = score for given HSP Lecture 3.1

38 How Does BLAST Really Work?
Lecture 3.1

39 BLAST Algorithm Query: TPQGQRQGQ….. TPQ PQG QGQ GQR QRQ RQG …
, AAA AAC AAD PQG QGQ YYY AGA AGC AAN … PEG QGM AAG GAC AAE … PRG MGQ GAA AAQ … PMG QAQ GAG QGN Lecture 3.1

40 BLAST Algorithm Query: TPQGQRQGQ….. AAA AAC AAD ... PQG QGQ YYY
AGA AGC AAN … PEG QGM AAG GAC AAE … PRG MGQ GAA AAQ … PMG QAQ Database: CTVTPMGQREAE… HSP Lecture 3.1

41 High-scoring Segment Pairs
PQG 18 PEG 15 PRG 14 PMG 14 PNG 13 PDG 13 PQG 12 etc. Query: LNKCKTPQGQRQGQQWIKQPLMDKN L TP+GQR++++W+ P+ D Sbjct: LDCTVTPMGQREAERWLHMPVRDTR T Lecture 3.1

42 Extending HSP’s E = kNe X Cumulative Score S T Extension (# aa) -ls
Number of HSP’s found purely by chance -ls X T S Lecture 3.1

43 Visualizing HSP’s T P Q G Q R Q G Q C T V P M G Q R Lecture 3.1

44 Connecting HSP’s Lecture 3.1

45 The Final Result Lecture 3.1

46 Getting the Most from BLAST
Lecture 3.1

47 BLAST Options Lecture 3.1

48 BLAST Options Composition-based statistics (Yes)
Sequence Complexity Filter (Yes) Expect (E) value (10) Word Size (3) Substitution or Scoring Matrix (Blosum62) Gap Insertion Penalty (11) Gap Extension Penalty (1) Lecture 3.1

49 Composition Statistics
Recent addition to BLAST algorithm Permits calculated E (Expect) values to account for amino acid composition of queries and database hits Improves accuracy and reduces false positives Effectively conducts a different scoring procedure for each sequence in database Lecture 3.1

50 LCR’s (low complexity)
Watch out for… transmembrane or signal peptide regions coil-coil regions short amino acid repeats (collagen, elastin) homopolymeric repeats BLAST uses SEG to mask amino acids BLAST uses DUST to mask bases Lecture 3.1

51 Scoring Matrices BLOSUM Matrices PAM Matrices
Developed by Henikoff & Henikoff (1992) BLOcks SUbstitution Matrix Derived from the BLOCKS database PAM Matrices Developed by Schwarz and Dayhoff (1978) Point Accepted Mutation Derived from manual alignments of closely related proteins Lecture 3.1

52 How to Make Your Own Matrix
C D ... ACDEFGH.. ACDEFGK.. AADEFGH.. GCDEFGH.. ACAEYGK.. ACAEFAH.. #Aobs f(A,A) = A #Aexp C D #C/Aobs f(C,A) = E #Aexp + #Cexp Perform Calculate Fill Sub Alignment Frequencies Matrix Lecture 3.1

53 How to Make a PAM Matrix X = Multiply Matrices N times to make PAM “X”
log Take the log Lecture 3.1

54 PAM versus BLOSUM First useful scoring matrix for protein
Assumed a Markov Model of evolution (I.e. all sites equally mutable and independent) Derived from small, closely related proteins with ~15% divergence Much later entry to matrix “sweepstakes” No evolutionary model is assumed Built from PROSITE derived sequence blocks Uses much larger, more diverse set of protein sequences (30% - 90% ID) Lecture 3.1

55 PAM versus BLOSUM Higher PAM numbers to detect more remote sequence similarities Lower PAM numbers to detect high similarities 1 PAM ~ 1 million years of divergence Errors in PAM 1 are scaled 250X in PAM 250 Lower BLOSUM numbers to detect more remote sequence similarities Higher BLOSUM numbers to detect high similarities Sensitive to structural and functional subsitution Errors in BLOSUM arise from errors in alignment Lecture 3.1

56 PAM Matricies PAM 40 - prepared by multiplying PAM 1 by itself a total of 40 times best for short alignments with high similarity PAM prepared by multiplying PAM 1 by itself a total of 120 times best for general alignment PAM prepared by multiplying PAM 1 by itself a total of 250 times best for detecting distant sequence similarity Lecture 3.1

57 PAM250 Lecture 3.1

58 BLOSUM Matricies BLOSUM 90 - prepared from BLOCKS sequences with >90% sequence ID best for short alignments with high similarity BLOSUM 62 - prepared from BLOCKS sequences with >62% sequence ID best for general alignment (default) BLOSUM 30 - prepared from BLOCKS sequences with >30% sequence ID best for detecting weak local alignments Lecture 3.1

59 BLOSUM62 Lecture 3.1

60 Scraping the Bottom of the Barrel with Psi-BLAST
Lecture 3.1

61 PSI-BLAST Algorithm Perform initial alignment with BLAST using BLOSUM 62 substitution matrix Construct a multiple alignment from matches Prepare position specific scoring matrix Use PSSM profile as the scoring matrix for a second BLAST run against database Repeat steps 3-5 until convergence Lecture 3.1

62 Position Specific Scoring Matrix (PSSM)
<e>i = log2(qi/pi) Lecture 3.1

63 PSI-BLAST Lecture 3.1

64 PSI-BLAST PresS Iterate! Lecture 3.1

65 PSI-BLAST PresS Iterate! Lecture 3.1

66 PSI-BLAST Lecture 3.1

67 PSI-BLAST For Protein Sequences ONLY Much more sensitive than BLAST
Slower (iterative process) Often yields results that are as good as many common threading methods SHOULD BE YOUR FIRST CHOICE IN ANALYZING A NEW SEQUENCE Lecture 3.1

68 BLAST against PDB Lecture 3.1

69 Still Confused? Lecture 3.1

70 Conclusions BLAST is the most important program in bioinformatics (maybe all of biology) BLAST is based on sound statistical principles (key to its speed and sensitivity) A basic understanding of its principles is key for using/interpreting BLAST output Use NBLAST or MEGABLAST for DNA Use PSI-BLAST for protein searches Lecture 3.1


Download ppt "David Wishart David Wishart University of Alberta"

Similar presentations


Ads by Google