Basic Local Alignment Search Tool (BLAST) Katie Moreland
Overview Sequence Alignment Dynamic Programming BLAST tutorial Example execution of BLAST References
Sequence Alignment In bioinformatics, a sequence alignment is a way of arranging the primary sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. (http://wikipedia.org) Example Alignment: G A A T T C A G T T A G G A - T C - G - - A
Sequence Alignment Cont… Motivations: Similar primary structure in proteins implies similar form and function Similar short sequences can lead to motif finding (ie: promoter regions) Similarities between gene regions can be used for phylogenetic classification
Sequence Similarity Alignments are not unique Need a way to compare alignments to find optimal Optimal Alignment is the alignment that maximizes the overall score (may not be unique) Three possibilities when aligning a character for each string: (perfect match, mismatch, indel) Align the two characters Perfect Match Mismatch C C C G Insertion/Deletion (indel) Gap in 1st string (S) Gap in 2nd string (T) - C C -
Sequence Similarity Cont… Simple Metric: σ(x,x) = 1 (match) σ(x,y) = -1 (mismatch) σ(x,-) = σ(-,x) = -1 (indel) In practice it is useful to define a substitution matrix such as PAM250 to take probabilities of certain mutations into account. ie: cost of mutation to a chemically similar amino-acid less than cost of mutation to dissimilar amino-acid Cost of indels depends on application
Intro to Dynamic Programming Used to reduce time complexity of algorithms with certain properties Characteristics of Dynamic Programming: Overlapping subproblems (otherwise recursion/divide and conquer) Optimality of subproblems (ie: Shortest Path)
Intro to Dynamic Programming Two types of alignment Global (Needleman-Wunsch) Attempt to align every residue in the sequences Most useful when sequences are similar in size and sequence Local (Smith-Waterman) Finds an alignment for parts of the two strings Most useful for dissimilar sequences that share regions of similarity or contain similar motifs
Needleman-Wunsch Algorithm Input: Two strings, S and T Construct a matrix with |S|+1 rows and |T|+1 columns Label each row with a symbol from S and each column with a symbol from T, except for the first position in each which represents an initial gap Beginning at upper left corner: Move diagonally to represent aligning the two characters from the strings Move right to represent inserting a space in S Move down to represent insert a space in T Update when newScore > oldScore (include arrow to show which cell we came from) Optimal alignment score is in bottom right corner of matrix Backtrack to find optimal alignment
Needleman-Wunsch Algorithm Sequences to Align: S : GCTC T : CGTTC Simple Scoring Function: σ(x,x) = 2 (match) σ(x,y) = -1 (mismatch) σ(x,-) = σ(-,x) = -1 (indel)
Tracing Needleman-Wunsch
Tracing Needleman-Wunsch
Tracing Needleman-Wunsch -1
Tracing Needleman-Wunsch -1
Tracing Needleman-Wunsch -1 -2 +1
Tracing Needleman-Wunsch -1 -2 +1
Tracing Needleman-Wunsch -1 -2 -3 +1
Tracing Needleman-Wunsch -1 -2 -3 +1
Tracing Needleman-Wunsch -1 -2 -3 -4 +1
Tracing Needleman-Wunsch -1 -2 -3 -4 +1
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1 +2
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1 +2
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1 +2
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1 +2
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1 +2
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1 +2
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1 +2
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1 +2
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1 +2
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1 +2
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1 +2
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1 +2
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1 +2
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1 +2
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1 +2
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1 +2
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1 +2 +4
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1 +2 +4
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1 +2 +4
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1 +2 +4
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1 +2 +4
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1 +2 +4
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1 +2 +4
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1 +2 +4
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1 +2 +4
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1 +2 +4
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1 +2 +4
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1 +2 +4
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1 +2 +4
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1 +2 +4
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1 +2 +4
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1 +2 +4
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1 +2 +4
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1 +2 +4
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1 +2 +4
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1 +2 +4
Tracing Needleman-Wunsch -1 -2 -3 -4 -5 +1 +2 +4
Modifications for Local Alignment Allow the algorithm to restart whenever it is advantageous to do so (start the algorithm from any position in S or T) If 0 > newScore, set score for cell I,j to 0 The optimal score is now the maximum value in all cells of the matrix (stop at any position in S or T)
Other Modifications Use a gap penalty function to accommodate large areas of gaps vs many gaps of size 1 Biological motivations (ie: mutations, cDNA matching)
BLAST Basic Local Alignment Search Tool Features: Uses: http://www.ncbi.nlm.nih.gov/BLAST/ Features: Finds regions of local similarity between sequences Heuristic approach achieves efficiency (important when searching entire databases of sequences) Computes statistical significance of matches Uses: Infer evolutionary/functional relationships Identify members of gene families
BLAST Algorithm Three Stages Find hotspots – exact matches of word length=W in the two sequences being considered (idea: good alignments for sequences will share regions of similarity, find first) Extend hotspots in both directions using ungapped alignment to increase alignment score, pass high scoring sequences to stage 3 Perform gapped alignment between the 2 sequences using variation of Smith-Waterman algorithm. Only statistically significant alignments are displayed to the user.
BLAST Input FASTA format >gi|532319|pir|TVFV2E|TVFV2E envelope protein ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLM NTTVTTGLLLNGSYSENRTQIWQKHRTSNDSALILLNKHYNL TVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWCHFPS NWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPE TANLWFNCHGEFFYCKMDWFLNYLNNLTVDADHNECKNTS GTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKKTYAPPRE GHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRY KLVEITPIGFAPTEVRRYTGGHERQKRVPFVXXXXXXXXXXX XXXXXXXXXXXVQSQHLLAGILQQQKNL LAAVEAQQQMLKLTIWGVK Accession/GI Number Found using GenBank In FASTA example, gi number is 532319
BLAST Input
BLAST Options Select Program: Select database(s) to search blastp, blastn, etc Select database(s) to search Nr default, contains GenBank, PDB, SwissProt, and others Gapped/Ungapped Alignment Search within certain organism
BLAST Options Cont… Filtering on/off E Value Threshold On by default, locates low complexity regions in a sequence and removes them before performing an alignment Low complexity region: a region with highly biased amino acid composition E Value Threshold Default =10, represents the number of hits one can expect to find by chance when searching the database Substitution Matrix Default: BLOSUM62 Assigns probability for each alignment position that a given substitution is known to occur Other matrices are supported, including PAM matrices
BLAST Options
Advanced BLAST Options -G Cost to open a gap [Integer] default = 11 -E Cost to extend a gap [Integer] default = 1 -e Expectation value (E) [Real] default = 10.0 -W Word size default is 11 for blastn, 3 for other programs. -v Number of one-line descriptions (V) [Integer] default = 100 -b Number of alignments to show (B) [Integer]
BLAST Output Request ID Query Information Database Information Taxonomy Reports Link Graphical Display of alignments Description of significant alignments Pairwise alignments
BLAST Output Cont…
Taxonomy Reports Lineage Report Organism Report Taxonomy Report Hierarchical tree structure representing how many hits occurred in each group 'focused' on the organism which yielded the strongest BLAST hit Organism Report Groups hits by species Taxonomy Report Summary of relationships between organisms in BLAST hit list
Graphical Display of Alignments displays the top 100 sequence alignments for a search by default Thick red bar at top represents query sequence, numbers correspond to amino acid residues Hits represented by colored bars, mouse over the bar to view the definition and score in the text box, click to go to pairwise alignment Bar color represents alignment similarity score Color Key given above query sequence to determine ranges of similarities for a particular color
Graphical Display of Alignments
Description of Significant Alignments Listed in order of decreasing significance Default number displayed=100
Pairwise Alignments
BLAST Demonstration >gi|2501594|sp|Q57997|Y577_METJA PROTEIN MJ0577 MSVMYKKILYPTDFSETAEIALKHVKAFKTLKAEEVILLHVIDEREIKKRDIFSLLLGVAGLNKSVEEFENELKNKLTEEAKNKMENIKKELEDVGFKVKDIIVVGIPHEEIVKIAEDEGVDIIIMGSHGKTNLKEILLGSVTENVIKKSNKPVLVVKRKNS http://www.ncbi.nlm.nih.gov/BLAST/
References Altschul, SF, W Gish, W Miller, EW Myers, and DJ Lipman. Basic local alignment search tool. J Mol Biol 215(3):403-10, 1990." 2. BLAST Tutorials http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html http://www.ornl.gov/sci/techresources/Human_Genome/posters/chromosome/blast.shtml http://wikipedia.org 4. Hatzivassiloglou, V. http://www.hlt.utdallas.edu/%7Evh/Courses/Fall06/Lectures/Alignment%20part%203.ppt