CS 5263 Bioinformatics Lecture 7: Heuristic Sequence Alignment Tools (BLAST) Multiple Sequence Alignment
Roadmap Last lecture review –Sequence alignment statistics Today –Heuristic alignment algorithms Basic Local Alignment Search Tools –Multiple sequence alignment algorithms
Sequence Alignment Statistics Substitution matrices –How is the BLOSUM-N matrix made –How to make your own substitution matrix –What’s the meaning of an arbitrary substitution matrix Significance of sequence alignments –P-value estimation for Global alignment scores Local alignment scores –Critical score for local alignment to guarantee significance
Heuristic Local Aligners -- BLAST and alike
State of biological databases Sequenced Genomes: Human 3 10 9 Yeast1.2 10 7 Mouse2.7 10 9 Rat2.6 10 9 Neurospora 4 10 7 Fugu fish3.3 10 8 Tetraodon3 10 8 Mosquito 2.8 10 8 Drosophila1.2 10 8 Worm 1.0 10 8 Rice1.0 10 9 Arabidopsis1.2 10 8 sea squirts 1.6 10 8 Current rate of sequencing: 4 big labs 3 10 9 bp /year/lab 10s small labs Private sectors
State of biological databases Number of genes in these genomes: Vertebrate: ~30,000 Insects: ~14,000 Worm: ~17,000 Fungi: ~6,000-10,000 Small organisms: 100s-1,000s Each known or predicted gene has an associated protein sequence >1,000,000 known / predicted protein sequences
Some useful applications of alignments Given a newly discovered gene, - Does it occur in other species? Assume we try Smith-Waterman: The entire genomic database Our new gene May take several weeks!
Some useful applications of alignments Given a newly sequenced organism, - Which subregions align with other organisms? -Potential genes - Other functional units Assume we try Smith-Waterman: The entire genomic database Our newly sequenced mammal 3 > 1000 years ???
BLAST Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J Mol Biol 1990 –The most widely used bioinformatics tool –One of the most cited papers in the history of science Which is better: long mediocre match or a few nearby, short, strong matches with the same total score? –Score-wise, exactly equivalent –Biologically, later may be more interesting, & is common –At least, if must miss some, rather miss the former BLAST is a heuristic algorithm emphasizing the later –speed/sensitivity tradeoff: BLAST may miss former, but gains greatly in speed
BLAST Available at NCBI (National Center for Biotechnology Information) for download and online use. Along with many sequence databases Main idea: 1.Construct a dictionary of all the words in the query 2.Initiate a local alignment for each word match between query and DB Running Time: O(MN) However, orders of magnitude faster than Smith-Waterman query DB
BLAST Original Version Dictionary: All words of length k (~11 for DNA, 3 for proteins) Alignment initiated between words of alignment score T (typically T = k) Alignment: Ungapped extensions until score below statistical threshold Output: All local alignments with score > statistical threshold …… query DB query scan
BLAST Original Version A C G A A G T A A G G T C C A G T C C C T T C C T G G A T T G C G A Example: k = 4, T = 4 The matching word GGTC initiates an alignment Extension to the left and right with no gaps until alignment falls < 50% Output: GTAAGGTCC GTTAGGTCC
Gapped BLAST A C G A A G T A A G G T C C A G T C T G A T C C T G G A T T G C G A Added features: Pairs of words can initiate alignment Extensions with gaps in a band around anchor Output: GTAAGGTCCAGT GTTAGGTC-AGT
Example Query: gattacaccccgattacaccccgattaca (29 letters) [2 mins] Database: All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences) 1,726,556 sequences; 8,074,398,388 total letters >gi| |gb|AC | Oryza sativa chromosome 3 BAC OSJNBa0087C10 genomic sequence, complete sequence Length = Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plusgi| |gb|AC | Query: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| Sbjct: tacacccagattacaccccga Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus Query: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| Sbjct: tacacccagattacaccccga >gi| |gb|AC | Oryza sativa chromosome 3 BAC OSJNBa0052F07 genomic sequence, complete sequence Length = Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plusgi| |gb|AC | Query: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| Sbjct: 3891 tacacccagattacaccccga 3911
Example Query: Human atoh enhancer, 179 letters[1.5 min] Result: 57 blast hits 1. gi| |gb|AF |AF Homo sapiens ATOH1 enhanc e-95 gi| |gb|AF |AF gi| |gb|AC | Mus musculus Strain C57BL6/J ch e-68gi| |gb|AC |264 3.gi| |gb|AF |AF Mus musculus Atoh1 enhanc e-66gi| |gb|AF |AF gi| |gb|AF | Gallus gallus CATH1 (CATH1) gene e-12gi| |gb|AF |78 5.gi| |emb|AL | Zebrafish DNA sequence from clo e-05gi| |emb|AL |54 6.gi| |gb|AC | Oryza sativa chromosome 10 BAC O gi| |gb|AC |44 7.gi| |ref|NM_ | Mus musculus suppressor of Ty gi| |ref|NM_ |42 8.gi| |gb|BC | Mus musculus, Similar to suppres gi| |gb|BC |42 gi| |gb|AF |AF218258gi| |gb|AF |AF Mus musculus Atoh1 enhancer sequence Length = 1517 Score = 256 bits (129), Expect = 9e-66 Identities = 167/177 (94%), Gaps = 2/177 (1%) Strand = Plus / Plus Query: 3 tgacaatagagggtctggcagaggctcctggccgcggtgcggagcgtctggagcggagca 62 ||||||||||||| ||||||||||||||||||| |||||||||||||||||||||||||| Sbjct: 1144 tgacaatagaggggctggcagaggctcctggccccggtgcggagcgtctggagcggagca 1203 Query: 63 cgcgctgtcagctggtgagcgcactctcctttcaggcagctccccggggagctgtgcggc 122 |||||||||||||||||||||||||| ||||||||| |||||||||||||||| ||||| Sbjct: 1204 cgcgctgtcagctggtgagcgcactc-gctttcaggccgctccccggggagctgagcggc 1262 Query: 123 cacatttaacaccatcatcacccctccccggcctcctcaacctcggcctcctcctcg 179 ||||||||||||| || ||| |||||||||||||||||||| ||||||||||||||| Sbjct: 1263 cacatttaacaccgtcgtca-ccctccccggcctcctcaacatcggcctcctcctcg 1318
BLAST Score: bit score vs raw score Bit score is converted from raw score by taking into account K and : S’ = ( S – log K) / log 2 To compute E-value from bit score: E = KM’N’ e - S = M’N’ 2 -S’ Critical score is now: S* = log 2 (M’N’) If S’ >> S*: significant If S’ << S*: not significant (M’ ~ M, N’ ~ N)
Different types of BLAST blastn: search nucleic acid databases blastp: search protein databases blastx: you give a nucleic acid sequence, search protein databases tblastn: you give a protein sequence, search nucleic acid databases tblastx: you give a nucleic sequence, search nucleic acid database, implicitly translate both into protein sequences
BLAST cons and pros Advantages –Fast!!!! –A few minutes to search a database of bases Disadvantages –Sensitivity may be low –Often misses weak homologies New improvement –Make it even faster Mainly for aligning very similar sequences or really long sequences –E.g. whole genome vs whole genome –Make it more sensitive PSI-BLAST: iteratively add more homologous sequences PatternHunter: discontinuous seeds
Variants of BLAST NCBI-BLAST: most widely used version WU-BLAST: (Washington University BLAST): another popular version Optimized, added features MEGABLAST: Optimized to align very similar sequences. Linear gap penalty BLAT: Blast-Like Alignment Tool BlastZ: Optimized for aligning two genomes PSI-BLAST: BLAST produces many hits Those are aligned, and a pattern is extracted Pattern is used for next search; above steps iterated Sensitive for weak homologies Slower
Pattern hunter Instead of exact matches of consecutive matches of k-mer, we can look for discontinuous matches –My query sequence looks like: ACGTAGACTAGCAGTTAAG –Search for sequences in database that match AXGXAGXCTAXC X stands for don’t care Seed:
Pattern hunter A good seed may give you both a higher sensitivity and higher specificity You may think is the best seed –Because mutation in the third position of a codon often doesn’t change the amino acid –Best seed is actually Empirically determined How to design such seed is an open problem May combine multiple random seeds
Things we’ve covered so far Global alignment –Needleman-Wunsch and variants Local Alignment –Smith-Waterman Improvement on space and time More accurate gap penalty models Heuristic algorithms –BLAST families Statistics for sequence alignment
Commonality: They all deal with aligning two sequences –Pair-wise sequence alignment Next: Aligning multiple sequences all together –Multiple sequence alignment Motivation: –A faint similarity between two sequences becomes very significant if present in many sequences Protein domains Motifs responsible for gene regulation
Definition Given N sequences x 1, x 2,…, x N : –Insert gaps (-) in each sequence x i, such that All sequences have the same length L Score of the global mapping is maximum Pairwise alignment: a hypothesis on the evolutionary relationship between the letters of two sequences Same for a multiple alignment!
Scoring Function Ideally: –Find alignment that maximizes probability that sequences evolved from common ancestor x y z w v ? Phylogenetic tree or evolution tree
Scoring Function (cont’d) Unfortunately: too many parameters Compromises: –Ignore phylogenetic tree Compute from pair-wise scores –Based on sum of all pair-wise scores –Based on scores with a consensus sequence
First assumption Columns are independent –Similar in pair-wise alignment Therefore, the score of an alignment is the sum of all columns Need to decide how to score a single column
Scoring Function: Sum Of Pairs Definition: Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x:AC-GCGG-C y:AC-GC-GAG z:GCCGC-GAG Induces : x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG
Sum Of Pairs (cont’d) The sum-of-pairs score of an alignment is the sum of the scores of all induced pairwise alignments S(m) = k<l s(m k, m l ) s(m k, m l ):score of induced alignment (k,l)
Example: x:AC-GCGG-C y:AC-GC-GAG z:GCCGC-GAG ACGT - A1 C 1 G 1 T (A,A) + (A,G) x 2 = -1 (C,C) x 3 = 3 (-,A) x 2 + (A,A) = -1 Total score = (-1) (-2) (-2) (-1) + (-1) = 5
Sum Of Pairs (cont’d) Drawback: no evolutionary characterization –Every sequence derived from all others Heuristic way to incorporate evolution tree –Weighted Sum of Pairs: Human Mouse Chicken S(m) = k<l w kl s(m k, m l ) w kl : weight decreasing with distance Duck
Consensus score -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC CAG-CTATCAC--GACCGC----TCGATTTGCTCGAC CAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Find optimal consensus string m * to maximize S(m) = i s(m *, m i ) s(m k, m l ):score of pairwise alignment (k,l) Consensus sequence:
Multiple Sequence Alignments Algorithms
Multidimensional Dynamic Programming (MDP) Generalization of Needleman-Wunsh: Find the longest path in a high-dimensional cube –As opposed to a two-dimensional grid Uses a N-dimensional matrix –As apposed to a two-dimensional array Entry F(i 1, …, i k ) represents score of optimal alignment for s 1 [1..i 1 ], … s k [1..i k ] F(i 1,i 2,…,i N ) = max (all neighbors of a cell) (F(nbr)+S(current))
Example: in 3D (three sequences): 2 3 – 1 = 7 neighbors/cell F(i-1,j-1,k-1) + S(x i, x j, x k ), F(i-1,j-1,k ) + S(x i, x j, -), F(i-1,j,k-1) + S(x i, -, x k ), F(i,j,k) = max F(i,j-1,k-1) + S(-, x j, x k ), F(i-1,j,k ) + S(x i, -, -), F(i,j-1,k ) + S(-, x j, -), F(i,j,k-1) + S(-, -, x k ) Multidimensional Dynamic Programming (MDP) (i,j,k) (i,j,k-1) (i-1,j,k-1) (i-1,j-1,k-1) (i-1,j-1,k) (i,j-1,k) (i-1,j,k) (i,j-1,k-1)
Multidimensional Dynamic Programming (MDP) Running Time: 1.Size of matrix:L N ; Where L = length of each sequence N = number of sequences 2.Neighbors/cell: 2 N – 1 Therefore………………………… O(2 N L N )
Faster MDP Carrillo & Lipman, 1988 –Branch and bound –Other heuristics Practical for about 6 sequences of length about
Progressive Alignment Multiple Alignment is NP-hard Most used heuristic: Progressive Alignment Algorithm: 1.Align two of the sequences x i, x j 2.Fix that alignment 3.Align a third sequence x k to the alignment x i,x j 4.Repeat until all sequences are aligned Running Time: O(NL 2 ) Each alignment takes O(L 2 ) Repeat N times
Progressive Alignment When evolutionary tree is known: –Align closest first, in the order of the tree Example: Order of alignments: 1. (x,y) 2. (z,w) 3. (xy, zw) x w y z
Progressive Alignment: CLUSTALW CLUSTALW: most popular multiple protein alignment Algorithm: 1.Find all d ij : alignment dist (x i, x j ) High alignment score => short distance 2.Construct a tree (Neighbor-joining hierarchical clustering. Will discuss in future) 3.Align nodes in order of decreasing similarity + a large number of heuristics
CLUSTALW example S 1 ALSK S 2 TNSD S 3 NASK S 4 NTSD
CLUSTALW example S 1 ALSK S 2 TNSD S 3 NASK S 4 NTSD s1s1 s2s2 s3s3 s4s4 s1s s2s2 083 s3s3 07 s4s4 0 Distance matrix
CLUSTALW example S 1 ALSK S 2 TNSD S 3 NASK S 4 NTSD s1s1 s2s2 s3s3 s4s4 s1s s2s2 083 s3s3 07 s4s4 0 s1s1 s3s3 s2s2 s4s4
CLUSTALW example S 1 ALSK S 2 TNSD S 3 NASK S 4 NTSD s1s1 s2s2 s3s3 s4s4 s1s s2s2 083 s3s3 07 s4s4 0 s1s1 s3s3 s2s2 s4s4 -ALSK NA-SK
CLUSTALW example S 1 ALSK S 2 TNSD S 3 NASK S 4 NTSD s1s1 s2s2 s3s3 s4s4 s1s s2s2 083 s3s3 07 s4s4 0 s1s1 s3s3 s2s2 s4s4 -ALSK NA-SK -TNSD NT-SD
CLUSTALW example S 1 ALSK S 2 TNSD S 3 NASK S 4 NTSD s1s1 s2s2 s3s3 s4s4 s1s s2s2 083 s3s3 07 s4s4 0 s1s1 s3s3 s2s2 s4s4 -ALSK NA-SK -TNSD NT-SD -ALSK -TNSD NA-SK NT-SD
Problems with progressive alignment: Depend on pair-wise alignments If sequences are very distantly related, much higher likelihood of errors Initial alignments are “frozen” even when new evidence comes Example: x:GAAGTT y:GAC-TT z:GAACTG w:GTACTG Iterative Refinement Frozen! Now clear: correct y should be GA-CTT
Iterative Refinement Algorithm (Barton-Stenberg): 1.Align most similar x i, x j 2.Align x k most similar to (x i x j ) 3.Repeat 2 until (x 1 …x N ) are aligned 4.For j = 1 to N, Remove x j, and realign to x 1 …x j-1 x j+1 …x N 5.Repeat 4 until convergence Progressive alignment
Iterative Refinement (cont’d) For each sequence y 1.Remove y 2.Realign y (while rest fixed) x y z x,z fixed projection allow y to vary Note: Guaranteed to converge (why?) Running time: O(kNL 2 ), k: number of iterations
Iterative Refinement Example: align (x,y), (z,w), (xy, zw): x:GAAGTTA y:GAC-TTA z:GAACTGA w:GTACTGA After realigning y: x:GAAGTTA y:G-ACTTA + 3 matches z:GAACTGA w:GTACTGA
Iterative Refinement Example not handled well: x:GAAGTTA y 1 :GAC-TTA y 2 :GAC-TTA y 3 :GAC-TTA z:GAACTGA w:GTACTGA Realigning any single y i changes nothing
Restricted MDP Similar to bounded DP in pair-wise alignment 1.Construct progressive multiple alignment m 2.Run MDP, restricted to radius R from m Running Time: O(2 N R N-1 L) x y z
Restricted MDP x:GAAGTTA y 1 :GAC-TTA y 2 :GAC-TTA y 3 :GAC-TTA z:GAACTGA w:GTACTGA Within radius 1 of the optimal Restricted MDP will fix it.
Other approaches Profile Hidden Markov Models –Statistical learning methods –Will discuss in future
Multiple alignment tools Clustal W (Thompson, 1994) –Most popular PRRP (Gotoh, 1993) HMMT (Eddy, 1995) DIALIGN (Morgenstern, 1998) T-Coffee (Notredame, 2000) MUSCLE (Edgar, 2004) Align-m (Walle, 2004) PROBCONS (Do, 2004)
In summary Multiple alignment algorithms: –MDP (too slow) Branch & Bound doesn’t solve the problem entirely –Progressive alignment: clustalW –Iterative refinement –Restricted MDP