CS 5263 Bioinformatics Lecture 7: Heuristic Sequence Alignment Tools (BLAST) Multiple Sequence Alignment.

Slides:



Advertisements
Similar presentations
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Advertisements

Multiple Sequence Alignment
BLAST Sequence alignment, E-value & Extreme value distribution.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
CS262 Lecture 9, Win07, Batzoglou History of WGA 1982: -virus, 48,502 bp 1995: h-influenzae, 1 Mbp 2000: fly, 100 Mbp 2001 – present  human (3Gbp), mouse.
Linear-Space Alignment. Subsequences and Substrings Definition A string x’ is a substring of a string x, if x = ux’v for some prefix string u and suffix.
Sequence Similarity. The Viterbi algorithm for alignment Compute the following matrices (DP)  M(i, j):most likely alignment of x 1 …x i with y 1 …y j.
Genomic Sequence Alignment. Overview Dynamic programming & the Needleman-Wunsch algorithm Local alignment—BLAST Fast global alignment Multiple sequence.
Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)
Heuristic alignment algorithms and cost matrices
CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm.
Lecture 8: Multiple Sequence Alignment
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
Bioinformatics and Phylogenetic Analysis
CS262 Lecture 9, Win07, Batzoglou Multiple Sequence Alignments.
Sequence Alignment Cont’d. Needleman-Wunsch with affine gaps Initialization:V(i, 0) = d + (i – 1)  e V(0, j) = d + (j – 1)  e Iteration: V(i, j) = max{
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
CS262 Lecture 9, Win07, Batzoglou Phylogeny Tree Reconstruction
Sequence Alignment.
Sequence Alignment Lecture 2, Thursday April 3, 2003.
Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia.
Multiple Sequence Alignments. Lecture 12, Tuesday May 13, 2003 Reading Durbin’s book: Chapter Gusfield’s book: Chapter 14.1, 14.2, 14.5,
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
CS 6293 Advanced Topics: Current Bioinformatics Lectures 3-4: Pair-wise Sequence Alignment.
Sequence Alignment Cont’d. Linear-space alignment Iterate this procedure to the left and right! N-k * M/2 k*k*
Similar Sequence Similar Function Charles Yan Spring 2006.
CS262 Lecture 4, Win07, Batzoglou Heuristic Local Alignerers 1.The basic indexing & extension technique 2.Indexing: techniques to improve sensitivity Pairs.
Sequence Alignment Cont’d. CS262 Lecture 4, Win06, Batzoglou Indexing-based local alignment (BLAST- Basic Local Alignment Search Tool) 1.SEED Construct.
Sequence Alignment. CS262 Lecture 3, Win06, Batzoglou Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition.
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Sequence alignment, E-value & Extreme value distribution
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.
Chapter 5 Multiple Sequence Alignment.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
BLAST What it does and what it means Steven Slater Adapted from pt.
Multiple Sequence Alignment. Definition Given N sequences x 1, x 2,…, x N :  Insert gaps (-) in each sequence x i, such that All sequences have the.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
Multiple sequence alignments Introduction to Bioinformatics Jacques van Helden Aix-Marseille Université (AMU), France Lab.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
CS 5263 Bioinformatics Lecture 7: Heuristic Sequence Alignment Algorithms (BLAST)
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Sequence Alignment.
Multiple Sequence Alignment
Step 3: Tools Database Searching
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Multiple Sequence Alignments. The Global Alignment problem AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC x y z.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Multiple sequence alignment (msa)
Basics of BLAST Basic BLAST Search - What is BLAST?
CS 6293 Advanced Topics: Translational Bioinformatics
Multiple Sequence Alignment
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

CS 5263 Bioinformatics Lecture 7: Heuristic Sequence Alignment Tools (BLAST) Multiple Sequence Alignment

Roadmap Last lecture review –Sequence alignment statistics Today –Heuristic alignment algorithms Basic Local Alignment Search Tools –Multiple sequence alignment algorithms

Sequence Alignment Statistics Substitution matrices –How is the BLOSUM-N matrix made –How to make your own substitution matrix –What’s the meaning of an arbitrary substitution matrix Significance of sequence alignments –P-value estimation for Global alignment scores Local alignment scores –Critical score for local alignment to guarantee significance

Heuristic Local Aligners -- BLAST and alike

State of biological databases Sequenced Genomes: Human 3  10 9 Yeast1.2  10 7 Mouse2.7  10 9 Rat2.6  10 9 Neurospora 4  10 7 Fugu fish3.3  10 8 Tetraodon3  10 8 Mosquito 2.8  10 8 Drosophila1.2  10 8 Worm 1.0  10 8 Rice1.0  10 9 Arabidopsis1.2  10 8 sea squirts 1.6  10 8 Current rate of sequencing: 4 big labs  3  10 9 bp /year/lab 10s small labs Private sectors

State of biological databases Number of genes in these genomes: Vertebrate: ~30,000 Insects: ~14,000 Worm: ~17,000 Fungi: ~6,000-10,000 Small organisms: 100s-1,000s Each known or predicted gene has an associated protein sequence >1,000,000 known / predicted protein sequences

Some useful applications of alignments Given a newly discovered gene, - Does it occur in other species? Assume we try Smith-Waterman: The entire genomic database Our new gene May take several weeks!

Some useful applications of alignments Given a newly sequenced organism, - Which subregions align with other organisms? -Potential genes - Other functional units Assume we try Smith-Waterman: The entire genomic database Our newly sequenced mammal 3  > 1000 years ???

BLAST Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J Mol Biol 1990 –The most widely used bioinformatics tool –One of the most cited papers in the history of science Which is better: long mediocre match or a few nearby, short, strong matches with the same total score? –Score-wise, exactly equivalent –Biologically, later may be more interesting, & is common –At least, if must miss some, rather miss the former BLAST is a heuristic algorithm emphasizing the later –speed/sensitivity tradeoff: BLAST may miss former, but gains greatly in speed

BLAST Available at NCBI (National Center for Biotechnology Information) for download and online use. Along with many sequence databases Main idea: 1.Construct a dictionary of all the words in the query 2.Initiate a local alignment for each word match between query and DB Running Time: O(MN) However, orders of magnitude faster than Smith-Waterman query DB

BLAST  Original Version Dictionary: All words of length k (~11 for DNA, 3 for proteins) Alignment initiated between words of alignment score  T (typically T = k) Alignment: Ungapped extensions until score below statistical threshold Output: All local alignments with score > statistical threshold …… query DB query scan

BLAST  Original Version A C G A A G T A A G G T C C A G T C C C T T C C T G G A T T G C G A Example: k = 4, T = 4 The matching word GGTC initiates an alignment Extension to the left and right with no gaps until alignment falls < 50% Output: GTAAGGTCC GTTAGGTCC

Gapped BLAST A C G A A G T A A G G T C C A G T C T G A T C C T G G A T T G C G A Added features: Pairs of words can initiate alignment Extensions with gaps in a band around anchor Output: GTAAGGTCCAGT GTTAGGTC-AGT

Example Query: gattacaccccgattacaccccgattaca (29 letters) [2 mins] Database: All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences) 1,726,556 sequences; 8,074,398,388 total letters >gi| |gb|AC | Oryza sativa chromosome 3 BAC OSJNBa0087C10 genomic sequence, complete sequence Length = Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plusgi| |gb|AC | Query: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| Sbjct: tacacccagattacaccccga Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus Query: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| Sbjct: tacacccagattacaccccga >gi| |gb|AC | Oryza sativa chromosome 3 BAC OSJNBa0052F07 genomic sequence, complete sequence Length = Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plusgi| |gb|AC | Query: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| Sbjct: 3891 tacacccagattacaccccga 3911

Example Query: Human atoh enhancer, 179 letters[1.5 min] Result: 57 blast hits 1. gi| |gb|AF |AF Homo sapiens ATOH1 enhanc e-95 gi| |gb|AF |AF gi| |gb|AC | Mus musculus Strain C57BL6/J ch e-68gi| |gb|AC |264 3.gi| |gb|AF |AF Mus musculus Atoh1 enhanc e-66gi| |gb|AF |AF gi| |gb|AF | Gallus gallus CATH1 (CATH1) gene e-12gi| |gb|AF |78 5.gi| |emb|AL | Zebrafish DNA sequence from clo e-05gi| |emb|AL |54 6.gi| |gb|AC | Oryza sativa chromosome 10 BAC O gi| |gb|AC |44 7.gi| |ref|NM_ | Mus musculus suppressor of Ty gi| |ref|NM_ |42 8.gi| |gb|BC | Mus musculus, Similar to suppres gi| |gb|BC |42 gi| |gb|AF |AF218258gi| |gb|AF |AF Mus musculus Atoh1 enhancer sequence Length = 1517 Score = 256 bits (129), Expect = 9e-66 Identities = 167/177 (94%), Gaps = 2/177 (1%) Strand = Plus / Plus Query: 3 tgacaatagagggtctggcagaggctcctggccgcggtgcggagcgtctggagcggagca 62 ||||||||||||| ||||||||||||||||||| |||||||||||||||||||||||||| Sbjct: 1144 tgacaatagaggggctggcagaggctcctggccccggtgcggagcgtctggagcggagca 1203 Query: 63 cgcgctgtcagctggtgagcgcactctcctttcaggcagctccccggggagctgtgcggc 122 |||||||||||||||||||||||||| ||||||||| |||||||||||||||| ||||| Sbjct: 1204 cgcgctgtcagctggtgagcgcactc-gctttcaggccgctccccggggagctgagcggc 1262 Query: 123 cacatttaacaccatcatcacccctccccggcctcctcaacctcggcctcctcctcg 179 ||||||||||||| || ||| |||||||||||||||||||| ||||||||||||||| Sbjct: 1263 cacatttaacaccgtcgtca-ccctccccggcctcctcaacatcggcctcctcctcg 1318

BLAST Score: bit score vs raw score Bit score is converted from raw score by taking into account K and :  S’ = ( S – log K) / log 2 To compute E-value from bit score:  E = KM’N’ e - S = M’N’ 2 -S’ Critical score is now:  S* = log 2 (M’N’)  If S’ >> S*: significant  If S’ << S*: not significant (M’ ~ M, N’ ~ N)

Different types of BLAST blastn: search nucleic acid databases blastp: search protein databases blastx: you give a nucleic acid sequence, search protein databases tblastn: you give a protein sequence, search nucleic acid databases tblastx: you give a nucleic sequence, search nucleic acid database, implicitly translate both into protein sequences

BLAST cons and pros Advantages –Fast!!!! –A few minutes to search a database of bases Disadvantages –Sensitivity may be low –Often misses weak homologies New improvement –Make it even faster Mainly for aligning very similar sequences or really long sequences –E.g. whole genome vs whole genome –Make it more sensitive PSI-BLAST: iteratively add more homologous sequences PatternHunter: discontinuous seeds

Variants of BLAST NCBI-BLAST: most widely used version WU-BLAST: (Washington University BLAST): another popular version Optimized, added features MEGABLAST: Optimized to align very similar sequences. Linear gap penalty BLAT: Blast-Like Alignment Tool BlastZ: Optimized for aligning two genomes PSI-BLAST: BLAST produces many hits Those are aligned, and a pattern is extracted Pattern is used for next search; above steps iterated Sensitive for weak homologies Slower

Pattern hunter Instead of exact matches of consecutive matches of k-mer, we can look for discontinuous matches –My query sequence looks like: ACGTAGACTAGCAGTTAAG –Search for sequences in database that match AXGXAGXCTAXC X stands for don’t care Seed:

Pattern hunter A good seed may give you both a higher sensitivity and higher specificity You may think is the best seed –Because mutation in the third position of a codon often doesn’t change the amino acid –Best seed is actually Empirically determined How to design such seed is an open problem May combine multiple random seeds

Things we’ve covered so far Global alignment –Needleman-Wunsch and variants Local Alignment –Smith-Waterman Improvement on space and time More accurate gap penalty models Heuristic algorithms –BLAST families Statistics for sequence alignment

Commonality: They all deal with aligning two sequences –Pair-wise sequence alignment Next: Aligning multiple sequences all together –Multiple sequence alignment Motivation: –A faint similarity between two sequences becomes very significant if present in many sequences Protein domains Motifs responsible for gene regulation

Definition Given N sequences x 1, x 2,…, x N : –Insert gaps (-) in each sequence x i, such that All sequences have the same length L Score of the global mapping is maximum Pairwise alignment: a hypothesis on the evolutionary relationship between the letters of two sequences Same for a multiple alignment!

Scoring Function Ideally: –Find alignment that maximizes probability that sequences evolved from common ancestor x y z w v ? Phylogenetic tree or evolution tree

Scoring Function (cont’d) Unfortunately: too many parameters Compromises: –Ignore phylogenetic tree Compute from pair-wise scores –Based on sum of all pair-wise scores –Based on scores with a consensus sequence

First assumption Columns are independent –Similar in pair-wise alignment Therefore, the score of an alignment is the sum of all columns Need to decide how to score a single column

Scoring Function: Sum Of Pairs Definition: Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x:AC-GCGG-C y:AC-GC-GAG z:GCCGC-GAG Induces : x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

Sum Of Pairs (cont’d) The sum-of-pairs score of an alignment is the sum of the scores of all induced pairwise alignments S(m) =  k<l s(m k, m l ) s(m k, m l ):score of induced alignment (k,l)

Example: x:AC-GCGG-C y:AC-GC-GAG z:GCCGC-GAG ACGT - A1 C 1 G 1 T (A,A) + (A,G) x 2 = -1 (C,C) x 3 = 3 (-,A) x 2 + (A,A) = -1 Total score = (-1) (-2) (-2) (-1) + (-1) = 5

Sum Of Pairs (cont’d) Drawback: no evolutionary characterization –Every sequence derived from all others Heuristic way to incorporate evolution tree –Weighted Sum of Pairs: Human Mouse Chicken S(m) =  k<l w kl s(m k, m l ) w kl : weight decreasing with distance Duck

Consensus score -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC CAG-CTATCAC--GACCGC----TCGATTTGCTCGAC CAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Find optimal consensus string m * to maximize S(m) =  i s(m *, m i ) s(m k, m l ):score of pairwise alignment (k,l) Consensus sequence:

Multiple Sequence Alignments Algorithms

Multidimensional Dynamic Programming (MDP) Generalization of Needleman-Wunsh: Find the longest path in a high-dimensional cube –As opposed to a two-dimensional grid Uses a N-dimensional matrix –As apposed to a two-dimensional array Entry F(i 1, …, i k ) represents score of optimal alignment for s 1 [1..i 1 ], … s k [1..i k ] F(i 1,i 2,…,i N ) = max (all neighbors of a cell) (F(nbr)+S(current))

Example: in 3D (three sequences): 2 3 – 1 = 7 neighbors/cell F(i-1,j-1,k-1) + S(x i, x j, x k ), F(i-1,j-1,k ) + S(x i, x j, -), F(i-1,j,k-1) + S(x i, -, x k ), F(i,j,k) = max F(i,j-1,k-1) + S(-, x j, x k ), F(i-1,j,k ) + S(x i, -, -), F(i,j-1,k ) + S(-, x j, -), F(i,j,k-1) + S(-, -, x k ) Multidimensional Dynamic Programming (MDP) (i,j,k) (i,j,k-1) (i-1,j,k-1) (i-1,j-1,k-1) (i-1,j-1,k) (i,j-1,k) (i-1,j,k) (i,j-1,k-1)

Multidimensional Dynamic Programming (MDP) Running Time: 1.Size of matrix:L N ; Where L = length of each sequence N = number of sequences 2.Neighbors/cell: 2 N – 1 Therefore………………………… O(2 N L N )

Faster MDP Carrillo & Lipman, 1988 –Branch and bound –Other heuristics Practical for about 6 sequences of length about

Progressive Alignment Multiple Alignment is NP-hard Most used heuristic: Progressive Alignment Algorithm: 1.Align two of the sequences x i, x j 2.Fix that alignment 3.Align a third sequence x k to the alignment x i,x j 4.Repeat until all sequences are aligned Running Time: O(NL 2 ) Each alignment takes O(L 2 ) Repeat N times

Progressive Alignment When evolutionary tree is known: –Align closest first, in the order of the tree Example: Order of alignments: 1. (x,y) 2. (z,w) 3. (xy, zw) x w y z

Progressive Alignment: CLUSTALW CLUSTALW: most popular multiple protein alignment Algorithm: 1.Find all d ij : alignment dist (x i, x j ) High alignment score => short distance 2.Construct a tree (Neighbor-joining hierarchical clustering. Will discuss in future) 3.Align nodes in order of decreasing similarity + a large number of heuristics

CLUSTALW example S 1 ALSK S 2 TNSD S 3 NASK S 4 NTSD

CLUSTALW example S 1 ALSK S 2 TNSD S 3 NASK S 4 NTSD s1s1 s2s2 s3s3 s4s4 s1s s2s2 083 s3s3 07 s4s4 0 Distance matrix

CLUSTALW example S 1 ALSK S 2 TNSD S 3 NASK S 4 NTSD s1s1 s2s2 s3s3 s4s4 s1s s2s2 083 s3s3 07 s4s4 0 s1s1 s3s3 s2s2 s4s4

CLUSTALW example S 1 ALSK S 2 TNSD S 3 NASK S 4 NTSD s1s1 s2s2 s3s3 s4s4 s1s s2s2 083 s3s3 07 s4s4 0 s1s1 s3s3 s2s2 s4s4 -ALSK NA-SK

CLUSTALW example S 1 ALSK S 2 TNSD S 3 NASK S 4 NTSD s1s1 s2s2 s3s3 s4s4 s1s s2s2 083 s3s3 07 s4s4 0 s1s1 s3s3 s2s2 s4s4 -ALSK NA-SK -TNSD NT-SD

CLUSTALW example S 1 ALSK S 2 TNSD S 3 NASK S 4 NTSD s1s1 s2s2 s3s3 s4s4 s1s s2s2 083 s3s3 07 s4s4 0 s1s1 s3s3 s2s2 s4s4 -ALSK NA-SK -TNSD NT-SD -ALSK -TNSD NA-SK NT-SD

Problems with progressive alignment: Depend on pair-wise alignments If sequences are very distantly related, much higher likelihood of errors Initial alignments are “frozen” even when new evidence comes Example: x:GAAGTT y:GAC-TT z:GAACTG w:GTACTG Iterative Refinement Frozen! Now clear: correct y should be GA-CTT

Iterative Refinement Algorithm (Barton-Stenberg): 1.Align most similar x i, x j 2.Align x k most similar to (x i x j ) 3.Repeat 2 until (x 1 …x N ) are aligned 4.For j = 1 to N, Remove x j, and realign to x 1 …x j-1 x j+1 …x N 5.Repeat 4 until convergence Progressive alignment

Iterative Refinement (cont’d) For each sequence y 1.Remove y 2.Realign y (while rest fixed) x y z x,z fixed projection allow y to vary Note: Guaranteed to converge (why?) Running time: O(kNL 2 ), k: number of iterations

Iterative Refinement Example: align (x,y), (z,w), (xy, zw): x:GAAGTTA y:GAC-TTA z:GAACTGA w:GTACTGA After realigning y: x:GAAGTTA y:G-ACTTA + 3 matches z:GAACTGA w:GTACTGA

Iterative Refinement Example not handled well: x:GAAGTTA y 1 :GAC-TTA y 2 :GAC-TTA y 3 :GAC-TTA z:GAACTGA w:GTACTGA Realigning any single y i changes nothing

Restricted MDP Similar to bounded DP in pair-wise alignment 1.Construct progressive multiple alignment m 2.Run MDP, restricted to radius R from m Running Time: O(2 N R N-1 L) x y z

Restricted MDP x:GAAGTTA y 1 :GAC-TTA y 2 :GAC-TTA y 3 :GAC-TTA z:GAACTGA w:GTACTGA Within radius 1 of the optimal  Restricted MDP will fix it.

Other approaches Profile Hidden Markov Models –Statistical learning methods –Will discuss in future

Multiple alignment tools Clustal W (Thompson, 1994) –Most popular PRRP (Gotoh, 1993) HMMT (Eddy, 1995) DIALIGN (Morgenstern, 1998) T-Coffee (Notredame, 2000) MUSCLE (Edgar, 2004) Align-m (Walle, 2004) PROBCONS (Do, 2004)

In summary Multiple alignment algorithms: –MDP (too slow) Branch & Bound doesn’t solve the problem entirely –Progressive alignment: clustalW –Iterative refinement –Restricted MDP