Slide 1 EE3J2 Data Mining Lecture 20 Sequence Analysis 2: BLAST Algorithm Ali Al-Shahib.

EE3J2 Data Mining Lecture 20 Sequence Analysis 2: BLAST Algorithm Ali Al-Shahib

Protein Structures

Background  One method of finding the function of an unknown protein (X) is to compare its sequence with a known protein sequence (Y).  If the sequences match to a certain degree then we can say X has similar functions to Y.  Protein sequences contain 20 different types of ‘letters’. In biology these are known as amino acids.

BLAST  One method of performing this sequence comparison is called Basic Local Alignment Search Tool (BLAST).  Developed in 1990 and 1997 (S. Altschul)  A heuristic method for performing local alignments through searches of high scoring segment pairs (HSP’s).  An HSP consists of two sequence fragments of arbitrary but equal length whose alignment is locally maximal and for which the alignment score meets or exceeds a threshold or cutoff score (also called neighbourhood score threshold).

Why Is BLAST Useful? hereditary non-polyposis colon cancer gene sequence hereditary non-polyposis colon cancer gene sequence A Lindblom et al (1993) Nat Genet 5:279

BLASTing a Sequence http://www.ncbi.nlm.nih.gov/blast/Blast.cgihttp://www.ncbi.nlm.nih.gov/blast/Blast.cgi

BLAST Results hereditary non-polyposis colon cancer DNA mismatch repair protein hereditary non-polyposis colon cancer DNA mismatch repair protein

Another Example Database: swissprot: 86,593 sequences; 31,411,157 total letters Score E Sequences producing significant alignments: (bits) Value SW:HBB_HUMAN P02023 HEMOGLOBIN BETA CHAIN. (human) 306 2e-83 SW:HBB_GORGO P02024 HEMOGLOBIN BETA CHAIN. (gorilla) 305 4e-83 SW:HBB2_PANLE P18988 HEMOGLOBIN BETA-2 CHAIN. (lion) 302 3e-82 SW:HBB_HYLLA P02025 HEMOGLOBIN BETA CHAIN. (gibbon) 300 8e-82 SW:HBB_PREEN P02032 HEMOGLOBIN BETA CHAIN. (Hanumam langur) 298 5e-81 SW:HBB_COLPO P19885 HEMOGLOBIN BETA CHAIN. (Colobus) 295 3e-80 SW:HBB_CERAE P02028 HEMOGLOBIN BETA CHAIN. (Green monkey) 295 3e-80 SW:HBB_MACFU P02027 HEMOGLOBIN BETA CHAIN. (Japanese macaque) 293 2e-79 SW:HBB_CALAR P18985 HEMOGLOBIN BETA CHAIN. (Marmoset) 292 2e-79 SW:HBB_ATEGE P02034 HEMOGLOBIN BETA CHAIN. (Spider monkey) 292 2e-79 SW:HBB_MANSP P08259 HEMOGLOBIN BETA CHAIN. (Mandrill) 291 4e-79 … SW:HBB1_RAT P02091 HEMOGLOBIN BETA CHAIN, (Rat) 255 4e-68 SW:HBB_ERIEU P02059 HEMOGLOBIN BETA CHAIN. (Hedgehog) 252 2e-67 SW:HBB_PANPO P04244 HEMOGLOBIN BETA CHAIN. (Bison) 251 5e-67 SW:HBB_BISBO P09422 HEMOGLOBIN BETA CHAIN. (Leopard) 251 5e-67

BLAST Parameters  Identities - No. & % exact residue matches  Positives - No. and % similar & ID matches  Gaps - No. & % gaps introduced  Score - Summed HSP score (S)  Bit Score - a normalized score (S’)  Expect (E) - Expected # of chance HSP aligns  P - Probability of getting a score > X  T - Minimum word or k-tuple score (Threshold)

Different Flavours of BLAST  BLASTP - protein query against protein DB  BLASTN - DNA/RNA query against GenBank (DNA)  BLASTX - 6 frame trans. DNA query against proteinDB  TBLASTN - protein query against 6 frame GB transl.  TBLASTX - 6 frame DNA query to 6 frame GB transl.  PSI-BLAST - protein ‘profile’ query against protein DB  PHI-BLAST - protein pattern against protein DB

BLAST Algorithm Source: NCBI

A Question Question: Given the protein sequence SLAALLNKCKTPQGQRLVNQW and the word length L= 3, explain how the BLAST algorithm is used to find the highest scoring alignment between the sequences

Answer: Explaining the BLAST Algorithm 1. Query sequence must be split into words of defined length. A list of words of length 3 (L) in the query protein sequence is made starting with positions 1,2, and 3; then 2,3, and 4; etc. Our query sequence: SLAALLNKCKTPQGQRLVNQW SLA, LAA, AAL, ALL, LLN, LNK, NKC, KCK, CKT,PQG,QGQ,GQR,QRL,RLV,LVN,VNQ NQW

Con…BLAST Algorithm 2. Define a threshold alignment score T (neighbourhood score threshold). 3. Find all word-pairs of length L with score ≥ T e.g Find all w such that S(w, PQG) ≥ T In another words, the query sequence are evaluated with any other combination of three amino acids. This is done using a scoring matrix (e.g., BLOSUM 62). Note: There are a total 20 x 20 x 20 = 8,000 possible match scores for a word

Con…BLAST Algorithm Neighbourhood words to PQG PQG18 PEG15 PRG14 PKG14 PDG13 PHG13 PMG13 PSG13 PQA12 PQN12 Neighbourhood Score Threshold (T=13) Neighbourhood words Note: This procedure is repeated for each three-letter word in the query sequence

Con….BLAST Algorithm 4. Now, search database for all ‘hits’ - sequences with exact matches to each w. 5. Extend in both directions alignment of ‘hits’ while score increases – producing High Scoring Pair’s (locally optimal ungapped alignments). 6. Return sequences with HSP’s which have significantly (statistically) higher scores than a threshold Smax Smax obtained empirically from random sequences

Con….BLAST Algorithm  So…. SLAALLNKCKTPQGQRLVNQW +LA++L+ TP G R++ +W TLASVLDCTVTPMGSRMLKRW High Scoring Segment Pair’s

Con….BLAST Algorithm 7. Varying the threshold alignment score T Search time decreases as T is increased, fewer word pairs are found Sensitivity of search decreases as T is increased, word pairs overlooked (homologous (or similar) sequences may be discarded). Note: The score of the alignment Smax AND the associated statistical significance are required to assess whether homology is suggested.

Conclusions  Protein Sequences  BLAST Algorithm

Slide 1 EE3J2 Data Mining Lecture 20 Sequence Analysis 2: BLAST Algorithm Ali Al-Shahib.

Similar presentations

Presentation on theme: "Slide 1 EE3J2 Data Mining Lecture 20 Sequence Analysis 2: BLAST Algorithm Ali Al-Shahib."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Slide 1 EE3J2 Data Mining Lecture 20 Sequence Analysis 2: BLAST Algorithm Ali Al-Shahib.

Similar presentations

Presentation on theme: "Slide 1 EE3J2 Data Mining Lecture 20 Sequence Analysis 2: BLAST Algorithm Ali Al-Shahib."— Presentation transcript:

Similar presentations

About project

Feedback