CS 5263 Bioinformatics Lecture 7: Heuristic Sequence Alignment Tools (BLAST) Multiple Sequence Alignment.

CS 5263 Bioinformatics Lecture 7: Heuristic Sequence Alignment Tools (BLAST) Multiple Sequence Alignment

Roadmap Last lecture review –Sequence alignment statistics Today –Heuristic alignment algorithms Basic Local Alignment Search Tools –Multiple sequence alignment algorithms

Sequence Alignment Statistics Substitution matrices –How is the BLOSUM-N matrix made –How to make your own substitution matrix –What’s the meaning of an arbitrary substitution matrix Significance of sequence alignments –P-value estimation for Global alignment scores Local alignment scores –Critical score for local alignment to guarantee significance

Heuristic Local Aligners -- BLAST and alike

State of biological databases Sequenced Genomes: Human 3  10 9 Yeast1.2  10 7 Mouse2.7  10 9 Rat2.6  10 9 Neurospora 4  10 7 Fugu fish3.3  10 8 Tetraodon3  10 8 Mosquito 2.8  10 8 Drosophila1.2  10 8 Worm 1.0  10 8 Rice1.0  10 9 Arabidopsis1.2  10 8 sea squirts 1.6  10 8 Current rate of sequencing: 4 big labs  3  10 9 bp /year/lab 10s small labs Private sectors

State of biological databases Number of genes in these genomes: Vertebrate: ~30,000 Insects: ~14,000 Worm: ~17,000 Fungi: ~6,000-10,000 Small organisms: 100s-1,000s Each known or predicted gene has an associated protein sequence >1,000,000 known / predicted protein sequences

Some useful applications of alignments Given a newly discovered gene, - Does it occur in other species? Assume we try Smith-Waterman: The entire genomic database Our new gene 10 4 10 10 - 10 11 May take several weeks!

Some useful applications of alignments Given a newly sequenced organism, - Which subregions align with other organisms? -Potential genes - Other functional units Assume we try Smith-Waterman: The entire genomic database Our newly sequenced mammal 3  10 9 10 10 - 10 11 > 1000 years ???

BLAST Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J Mol Biol 1990 –The most widely used bioinformatics tool –One of the most cited papers in the history of science Which is better: long mediocre match or a few nearby, short, strong matches with the same total score? –Score-wise, exactly equivalent –Biologically, later may be more interesting, & is common –At least, if must miss some, rather miss the former BLAST is a heuristic algorithm emphasizing the later –speed/sensitivity tradeoff: BLAST may miss former, but gains greatly in speed

BLAST Available at NCBI (National Center for Biotechnology Information) for download and online use. http://blast.ncbi.nlm.nih.gov/http://blast.ncbi.nlm.nih.gov/ Along with many sequence databases Main idea: 1.Construct a dictionary of all the words in the query 2.Initiate a local alignment for each word match between query and DB Running Time: O(MN) However, orders of magnitude faster than Smith-Waterman query DB

BLAST  Original Version Dictionary: All words of length k (~11 for DNA, 3 for proteins) Alignment initiated between words of alignment score  T (typically T = k) Alignment: Ungapped extensions until score below statistical threshold Output: All local alignments with score > statistical threshold …… query DB query scan

BLAST  Original Version A C G A A G T A A G G T C C A G T C C C T T C C T G G A T T G C G A Example: k = 4, T = 4 The matching word GGTC initiates an alignment Extension to the left and right with no gaps until alignment falls < 50% Output: GTAAGGTCC GTTAGGTCC

Gapped BLAST A C G A A G T A A G G T C C A G T C T G A T C C T G G A T T G C G A Added features: Pairs of words can initiate alignment Extensions with gaps in a band around anchor Output: GTAAGGTCCAGT GTTAGGTC-AGT

Example Query: gattacaccccgattacaccccgattaca (29 letters) [2 mins] Database: All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences) 1,726,556 sequences; 8,074,398,388 total letters >gi|28570323|gb|AC108906.9| Oryza sativa chromosome 3 BAC OSJNBa0087C10 genomic sequence, complete sequence Length = 144487 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plusgi|28570323|gb|AC108906.9| Query: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| Sbjct: 125138 tacacccagattacaccccga 125158 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus Query: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| Sbjct: 125104 tacacccagattacaccccga 125124 >gi|28173089|gb|AC104321.7| Oryza sativa chromosome 3 BAC OSJNBa0052F07 genomic sequence, complete sequence Length = 139823 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plusgi|28173089|gb|AC104321.7| Query: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| Sbjct: 3891 tacacccagattacaccccga 3911

Example Query: Human atoh enhancer, 179 letters[1.5 min] Result: 57 blast hits 1. gi|7677270|gb|AF218259.1|AF218259 Homo sapiens ATOH1 enhanc... 355 1e-95 gi|7677270|gb|AF218259.1|AF218259355 2.gi|22779500|gb|AC091158.11| Mus musculus Strain C57BL6/J ch... 264 4e-68gi|22779500|gb|AC091158.11|264 3.gi|7677269|gb|AF218258.1|AF218258 Mus musculus Atoh1 enhanc... 256 9e-66gi|7677269|gb|AF218258.1|AF218258256 4.gi|28875397|gb|AF467292.1| Gallus gallus CATH1 (CATH1) gene... 78 5e-12gi|28875397|gb|AF467292.1|78 5.gi|27550980|emb|AL807792.6| Zebrafish DNA sequence from clo... 54 7e-05gi|27550980|emb|AL807792.6|54 6.gi|22002129|gb|AC092389.4| Oryza sativa chromosome 10 BAC O... 44 0.068gi|22002129|gb|AC092389.4|44 7.gi|22094122|ref|NM_013676.1| Mus musculus suppressor of Ty... 42 0.27gi|22094122|ref|NM_013676.1|42 8.gi|13938031|gb|BC007132.1| Mus musculus, Similar to suppres... 42 0.27gi|13938031|gb|BC007132.1|42 gi|7677269|gb|AF218258.1|AF218258gi|7677269|gb|AF218258.1|AF218258 Mus musculus Atoh1 enhancer sequence Length = 1517 Score = 256 bits (129), Expect = 9e-66 Identities = 167/177 (94%), Gaps = 2/177 (1%) Strand = Plus / Plus Query: 3 tgacaatagagggtctggcagaggctcctggccgcggtgcggagcgtctggagcggagca 62 ||||||||||||| ||||||||||||||||||| |||||||||||||||||||||||||| Sbjct: 1144 tgacaatagaggggctggcagaggctcctggccccggtgcggagcgtctggagcggagca 1203 Query: 63 cgcgctgtcagctggtgagcgcactctcctttcaggcagctccccggggagctgtgcggc 122 |||||||||||||||||||||||||| ||||||||| |||||||||||||||| ||||| Sbjct: 1204 cgcgctgtcagctggtgagcgcactc-gctttcaggccgctccccggggagctgagcggc 1262 Query: 123 cacatttaacaccatcatcacccctccccggcctcctcaacctcggcctcctcctcg 179 ||||||||||||| || ||| |||||||||||||||||||| ||||||||||||||| Sbjct: 1263 cacatttaacaccgtcgtca-ccctccccggcctcctcaacatcggcctcctcctcg 1318

BLAST Score: bit score vs raw score Bit score is converted from raw score by taking into account K and :  S’ = ( S – log K) / log 2 To compute E-value from bit score:  E = KM’N’ e - S = M’N’ 2 -S’ Critical score is now:  S* = log 2 (M’N’)  If S’ >> S*: significant  If S’ << S*: not significant (M’ ~ M, N’ ~ N)

Different types of BLAST blastn: search nucleic acid databases blastp: search protein databases blastx: you give a nucleic acid sequence, search protein databases tblastn: you give a protein sequence, search nucleic acid databases tblastx: you give a nucleic sequence, search nucleic acid database, implicitly translate both into protein sequences

BLAST cons and pros Advantages –Fast!!!! –A few minutes to search a database of 10 11 bases Disadvantages –Sensitivity may be low –Often misses weak homologies New improvement –Make it even faster Mainly for aligning very similar sequences or really long sequences –E.g. whole genome vs whole genome –Make it more sensitive PSI-BLAST: iteratively add more homologous sequences PatternHunter: discontinuous seeds

Variants of BLAST NCBI-BLAST: most widely used version WU-BLAST: (Washington University BLAST): another popular version Optimized, added features MEGABLAST: Optimized to align very similar sequences. Linear gap penalty BLAT: Blast-Like Alignment Tool BlastZ: Optimized for aligning two genomes PSI-BLAST: BLAST produces many hits Those are aligned, and a pattern is extracted Pattern is used for next search; above steps iterated Sensitive for weak homologies Slower

Pattern hunter Instead of exact matches of consecutive matches of k-mer, we can look for discontinuous matches –My query sequence looks like: ACGTAGACTAGCAGTTAAG –Search for sequences in database that match AXGXAGXCTAXC X stands for don’t care Seed: 101011011101

Pattern hunter A good seed may give you both a higher sensitivity and higher specificity You may think 110110110110 is the best seed –Because mutation in the third position of a codon often doesn’t change the amino acid –Best seed is actually 110100110010101111 Empirically determined How to design such seed is an open problem May combine multiple random seeds

Things we’ve covered so far Global alignment –Needleman-Wunsch and variants Local Alignment –Smith-Waterman Improvement on space and time More accurate gap penalty models Heuristic algorithms –BLAST families Statistics for sequence alignment

Commonality: They all deal with aligning two sequences –Pair-wise sequence alignment Next: Aligning multiple sequences all together –Multiple sequence alignment Motivation: –A faint similarity between two sequences becomes very significant if present in many sequences Protein domains Motifs responsible for gene regulation

Definition Given N sequences x 1, x 2,…, x N : –Insert gaps (-) in each sequence x i, such that All sequences have the same length L Score of the global mapping is maximum Pairwise alignment: a hypothesis on the evolutionary relationship between the letters of two sequences Same for a multiple alignment!

Scoring Function Ideally: –Find alignment that maximizes probability that sequences evolved from common ancestor x y z w v ? Phylogenetic tree or evolution tree

Scoring Function (cont’d) Unfortunately: too many parameters Compromises: –Ignore phylogenetic tree Compute from pair-wise scores –Based on sum of all pair-wise scores –Based on scores with a consensus sequence

First assumption Columns are independent –Similar in pair-wise alignment Therefore, the score of an alignment is the sum of all columns Need to decide how to score a single column

Scoring Function: Sum Of Pairs Definition: Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x:AC-GCGG-C y:AC-GC-GAG z:GCCGC-GAG Induces : x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG - - - -

Sum Of Pairs (cont’d) The sum-of-pairs score of an alignment is the sum of the scores of all induced pairwise alignments S(m) =  k<l s(m k, m l ) s(m k, m l ):score of induced alignment (k,l)

Example: x:AC-GCGG-C y:AC-GC-GAG z:GCCGC-GAG ACGT - A1 C 1 G 1 T 1 - 0 (A,A) + (A,G) x 2 = -1 (C,C) x 3 = 3 (-,A) x 2 + (A,A) = -1 Total score = (-1) + 3 + (-2) + 3 + 3 + (-2) + 3 + (-1) + (-1) = 5

Sum Of Pairs (cont’d) Drawback: no evolutionary characterization –Every sequence derived from all others Heuristic way to incorporate evolution tree –Weighted Sum of Pairs: Human Mouse Chicken S(m) =  k<l w kl s(m k, m l ) w kl : weight decreasing with distance Duck

Consensus score -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC CAG-CTATCAC--GACCGC----TCGATTTGCTCGAC CAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Find optimal consensus string m * to maximize S(m) =  i s(m *, m i ) s(m k, m l ):score of pairwise alignment (k,l) Consensus sequence:

Multiple Sequence Alignments Algorithms

Multidimensional Dynamic Programming (MDP) Generalization of Needleman-Wunsh: Find the longest path in a high-dimensional cube –As opposed to a two-dimensional grid Uses a N-dimensional matrix –As apposed to a two-dimensional array Entry F(i 1, …, i k ) represents score of optimal alignment for s 1 [1..i 1 ], … s k [1..i k ] F(i 1,i 2,…,i N ) = max (all neighbors of a cell) (F(nbr)+S(current))

Example: in 3D (three sequences): 2 3 – 1 = 7 neighbors/cell F(i-1,j-1,k-1) + S(x i, x j, x k ), F(i-1,j-1,k ) + S(x i, x j, -), F(i-1,j,k-1) + S(x i, -, x k ), F(i,j,k) = max F(i,j-1,k-1) + S(-, x j, x k ), F(i-1,j,k ) + S(x i, -, -), F(i,j-1,k ) + S(-, x j, -), F(i,j,k-1) + S(-, -, x k ) Multidimensional Dynamic Programming (MDP) (i,j,k) (i,j,k-1) (i-1,j,k-1) (i-1,j-1,k-1) (i-1,j-1,k) (i,j-1,k) (i-1,j,k) (i,j-1,k-1)

Multidimensional Dynamic Programming (MDP) Running Time: 1.Size of matrix:L N ; Where L = length of each sequence N = number of sequences 2.Neighbors/cell: 2 N – 1 Therefore………………………… O(2 N L N )

Faster MDP Carrillo & Lipman, 1988 –Branch and bound –Other heuristics Practical for about 6 sequences of length about 200-300.

Progressive Alignment Multiple Alignment is NP-hard Most used heuristic: Progressive Alignment Algorithm: 1.Align two of the sequences x i, x j 2.Fix that alignment 3.Align a third sequence x k to the alignment x i,x j 4.Repeat until all sequences are aligned Running Time: O(NL 2 ) Each alignment takes O(L 2 ) Repeat N times

Progressive Alignment When evolutionary tree is known: –Align closest first, in the order of the tree Example: Order of alignments: 1. (x,y) 2. (z,w) 3. (xy, zw) x w y z

Progressive Alignment: CLUSTALW CLUSTALW: most popular multiple protein alignment Algorithm: 1.Find all d ij : alignment dist (x i, x j ) High alignment score => short distance 2.Construct a tree (Neighbor-joining hierarchical clustering. Will discuss in future) 3.Align nodes in order of decreasing similarity + a large number of heuristics

CLUSTALW example S 1 ALSK S 2 TNSD S 3 NASK S 4 NTSD

CLUSTALW example S 1 ALSK S 2 TNSD S 3 NASK S 4 NTSD s1s1 s2s2 s3s3 s4s4 s1s1 0947 s2s2 083 s3s3 07 s4s4 0 Distance matrix

CLUSTALW example S 1 ALSK S 2 TNSD S 3 NASK S 4 NTSD s1s1 s2s2 s3s3 s4s4 s1s1 0947 s2s2 083 s3s3 07 s4s4 0 s1s1 s3s3 s2s2 s4s4

CLUSTALW example S 1 ALSK S 2 TNSD S 3 NASK S 4 NTSD s1s1 s2s2 s3s3 s4s4 s1s1 0947 s2s2 083 s3s3 07 s4s4 0 s1s1 s3s3 s2s2 s4s4 -ALSK NA-SK

CLUSTALW example S 1 ALSK S 2 TNSD S 3 NASK S 4 NTSD s1s1 s2s2 s3s3 s4s4 s1s1 0947 s2s2 083 s3s3 07 s4s4 0 s1s1 s3s3 s2s2 s4s4 -ALSK NA-SK -TNSD NT-SD

CLUSTALW example S 1 ALSK S 2 TNSD S 3 NASK S 4 NTSD s1s1 s2s2 s3s3 s4s4 s1s1 0947 s2s2 083 s3s3 07 s4s4 0 s1s1 s3s3 s2s2 s4s4 -ALSK NA-SK -TNSD NT-SD -ALSK -TNSD NA-SK NT-SD

Problems with progressive alignment: Depend on pair-wise alignments If sequences are very distantly related, much higher likelihood of errors Initial alignments are “frozen” even when new evidence comes Example: x:GAAGTT y:GAC-TT z:GAACTG w:GTACTG Iterative Refinement Frozen! Now clear: correct y should be GA-CTT

Iterative Refinement Algorithm (Barton-Stenberg): 1.Align most similar x i, x j 2.Align x k most similar to (x i x j ) 3.Repeat 2 until (x 1 …x N ) are aligned 4.For j = 1 to N, Remove x j, and realign to x 1 …x j-1 x j+1 …x N 5.Repeat 4 until convergence Progressive alignment

Iterative Refinement (cont’d) For each sequence y 1.Remove y 2.Realign y (while rest fixed) x y z x,z fixed projection allow y to vary Note: Guaranteed to converge (why?) Running time: O(kNL 2 ), k: number of iterations

Iterative Refinement Example: align (x,y), (z,w), (xy, zw): x:GAAGTTA y:GAC-TTA z:GAACTGA w:GTACTGA After realigning y: x:GAAGTTA y:G-ACTTA + 3 matches z:GAACTGA w:GTACTGA

Iterative Refinement Example not handled well: x:GAAGTTA y 1 :GAC-TTA y 2 :GAC-TTA y 3 :GAC-TTA z:GAACTGA w:GTACTGA Realigning any single y i changes nothing

Restricted MDP Similar to bounded DP in pair-wise alignment 1.Construct progressive multiple alignment m 2.Run MDP, restricted to radius R from m Running Time: O(2 N R N-1 L) x y z

Restricted MDP x:GAAGTTA y 1 :GAC-TTA y 2 :GAC-TTA y 3 :GAC-TTA z:GAACTGA w:GTACTGA Within radius 1 of the optimal  Restricted MDP will fix it.

Other approaches Profile Hidden Markov Models –Statistical learning methods –Will discuss in future

Multiple alignment tools Clustal W (Thompson, 1994) –Most popular PRRP (Gotoh, 1993) HMMT (Eddy, 1995) DIALIGN (Morgenstern, 1998) T-Coffee (Notredame, 2000) MUSCLE (Edgar, 2004) Align-m (Walle, 2004) PROBCONS (Do, 2004)

In summary Multiple alignment algorithms: –MDP (too slow) Branch & Bound doesn’t solve the problem entirely –Progressive alignment: clustalW –Iterative refinement –Restricted MDP

CS 5263 Bioinformatics Lecture 7: Heuristic Sequence Alignment Tools (BLAST) Multiple Sequence Alignment.

Similar presentations

Presentation on theme: "CS 5263 Bioinformatics Lecture 7: Heuristic Sequence Alignment Tools (BLAST) Multiple Sequence Alignment."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 5263 Bioinformatics Lecture 7: Heuristic Sequence Alignment Tools (BLAST) Multiple Sequence Alignment.

Similar presentations

Presentation on theme: "CS 5263 Bioinformatics Lecture 7: Heuristic Sequence Alignment Tools (BLAST) Multiple Sequence Alignment."— Presentation transcript:

Similar presentations

About project

Feedback