Download presentation
Presentation is loading. Please wait.
1
Slide 1 EE3J2 Data Mining Lecture 20 Sequence Analysis 2: BLAST Algorithm Ali Al-Shahib
2
Slide 2 Protein Structures
3
Slide 3 Protein Structures
4
Slide 4 Background One method of finding the function of an unknown protein (X) is to compare its sequence with a known protein sequence (Y). If the sequences match to a certain degree then we can say X has similar functions to Y. Protein sequences contain 20 different types of ‘letters’. In biology these are known as amino acids.
5
Slide 5 BLAST One method of performing this sequence comparison is called Basic Local Alignment Search Tool (BLAST). Developed in 1990 and 1997 (S. Altschul) A heuristic method for performing local alignments through searches of high scoring segment pairs (HSP’s). An HSP consists of two sequence fragments of arbitrary but equal length whose alignment is locally maximal and for which the alignment score meets or exceeds a threshold or cutoff score (also called neighbourhood score threshold).
6
Slide 6 Why Is BLAST Useful? hereditary non-polyposis colon cancer gene sequence hereditary non-polyposis colon cancer gene sequence A Lindblom et al (1993) Nat Genet 5:279
7
Slide 7 BLASTing a Sequence http://www.ncbi.nlm.nih.gov/blast/Blast.cgihttp://www.ncbi.nlm.nih.gov/blast/Blast.cgi
8
Slide 8 BLAST Results hereditary non-polyposis colon cancer DNA mismatch repair protein hereditary non-polyposis colon cancer DNA mismatch repair protein
9
Slide 9 Another Example Database: swissprot: 86,593 sequences; 31,411,157 total letters Score E Sequences producing significant alignments: (bits) Value SW:HBB_HUMAN P02023 HEMOGLOBIN BETA CHAIN. (human) 306 2e-83 SW:HBB_GORGO P02024 HEMOGLOBIN BETA CHAIN. (gorilla) 305 4e-83 SW:HBB2_PANLE P18988 HEMOGLOBIN BETA-2 CHAIN. (lion) 302 3e-82 SW:HBB_HYLLA P02025 HEMOGLOBIN BETA CHAIN. (gibbon) 300 8e-82 SW:HBB_PREEN P02032 HEMOGLOBIN BETA CHAIN. (Hanumam langur) 298 5e-81 SW:HBB_COLPO P19885 HEMOGLOBIN BETA CHAIN. (Colobus) 295 3e-80 SW:HBB_CERAE P02028 HEMOGLOBIN BETA CHAIN. (Green monkey) 295 3e-80 SW:HBB_MACFU P02027 HEMOGLOBIN BETA CHAIN. (Japanese macaque) 293 2e-79 SW:HBB_CALAR P18985 HEMOGLOBIN BETA CHAIN. (Marmoset) 292 2e-79 SW:HBB_ATEGE P02034 HEMOGLOBIN BETA CHAIN. (Spider monkey) 292 2e-79 SW:HBB_MANSP P08259 HEMOGLOBIN BETA CHAIN. (Mandrill) 291 4e-79 … SW:HBB1_RAT P02091 HEMOGLOBIN BETA CHAIN, (Rat) 255 4e-68 SW:HBB_ERIEU P02059 HEMOGLOBIN BETA CHAIN. (Hedgehog) 252 2e-67 SW:HBB_PANPO P04244 HEMOGLOBIN BETA CHAIN. (Bison) 251 5e-67 SW:HBB_BISBO P09422 HEMOGLOBIN BETA CHAIN. (Leopard) 251 5e-67
10
Slide 10 BLAST Parameters Identities - No. & % exact residue matches Positives - No. and % similar & ID matches Gaps - No. & % gaps introduced Score - Summed HSP score (S) Bit Score - a normalized score (S’) Expect (E) - Expected # of chance HSP aligns P - Probability of getting a score > X T - Minimum word or k-tuple score (Threshold)
11
Slide 11 Different Flavours of BLAST BLASTP - protein query against protein DB BLASTN - DNA/RNA query against GenBank (DNA) BLASTX - 6 frame trans. DNA query against proteinDB TBLASTN - protein query against 6 frame GB transl. TBLASTX - 6 frame DNA query to 6 frame GB transl. PSI-BLAST - protein ‘profile’ query against protein DB PHI-BLAST - protein pattern against protein DB
12
Slide 12 BLAST Algorithm Source: NCBI
13
Slide 13 A Question Question: Given the protein sequence SLAALLNKCKTPQGQRLVNQW and the word length L= 3, explain how the BLAST algorithm is used to find the highest scoring alignment between the sequences
14
Slide 14 Answer: Explaining the BLAST Algorithm 1. Query sequence must be split into words of defined length. A list of words of length 3 (L) in the query protein sequence is made starting with positions 1,2, and 3; then 2,3, and 4; etc. Our query sequence: SLAALLNKCKTPQGQRLVNQW SLA, LAA, AAL, ALL, LLN, LNK, NKC, KCK, CKT,PQG,QGQ,GQR,QRL,RLV,LVN,VNQ NQW
15
Slide 15 Con…BLAST Algorithm 2. Define a threshold alignment score T (neighbourhood score threshold). 3. Find all word-pairs of length L with score ≥ T e.g Find all w such that S(w, PQG) ≥ T In another words, the query sequence are evaluated with any other combination of three amino acids. This is done using a scoring matrix (e.g., BLOSUM 62). Note: There are a total 20 x 20 x 20 = 8,000 possible match scores for a word
16
Slide 16 Con…BLAST Algorithm Neighbourhood words to PQG PQG18 PEG15 PRG14 PKG14 PDG13 PHG13 PMG13 PSG13 PQA12 PQN12 Neighbourhood Score Threshold (T=13) Neighbourhood words Note: This procedure is repeated for each three-letter word in the query sequence
17
Slide 17 Con….BLAST Algorithm 4. Now, search database for all ‘hits’ - sequences with exact matches to each w. 5. Extend in both directions alignment of ‘hits’ while score increases – producing High Scoring Pair’s (locally optimal ungapped alignments). 6. Return sequences with HSP’s which have significantly (statistically) higher scores than a threshold Smax Smax obtained empirically from random sequences
18
Slide 18 Con….BLAST Algorithm So…. SLAALLNKCKTPQGQRLVNQW +LA++L+ TP G R++ +W TLASVLDCTVTPMGSRMLKRW High Scoring Segment Pair’s
19
Slide 19 Con….BLAST Algorithm 7. Varying the threshold alignment score T Search time decreases as T is increased, fewer word pairs are found Sensitivity of search decreases as T is increased, word pairs overlooked (homologous (or similar) sequences may be discarded). Note: The score of the alignment Smax AND the associated statistical significance are required to assess whether homology is suggested.
20
Slide 20 Conclusions Protein Sequences BLAST Algorithm
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.