Multiple Sequence Alignment by Iterative Tree-Neighbor Alignments Susan Bibeault June 9, 2000
2 / 29 Outline Problem Statement and Importance Terminology Current Approaches Our Alignment Heuristic Performance Results Conclusions Future Work
3 / 29 Outline Problem Statement and Importance Terminology Current Approaches Our Alignment Heuristic Performance Results Conclusions Future Work
4 / 29 VLSPADNVKAAWGKVGAHAGEYGAEALERMF VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVY GLSDGEWQLVLNVWGKVEADIPGHVLIRLFK V-LSPADN--VKAAWGKVGAHAGEYGAEALERM---F- VHLTPEEKSAVTALWGKVNVD--EVGGEALGRLLVVYP G-LSDGEWQLVLNVWGKVEA---DIPGHVLIRL---FK -VF---- -VLSPADN--VKAAWGKVGAHAGEYGAEALERMF---- VHLVVYP VHLTPEEKSAVTALWGKVNVD--EVGGEALGRLLVVYP -GFK--- -GLSDGEWQLVLNVWGKVEA---DIPGHVLIRLFK--- Multiple Sequence Alignment Problem Given Sequence Set: –Insert gaps into sequences so that evolutionary conserved regions are aligned Important tool –Relate Homologous Proteins –Discover Conserved Regions
5 / 29 Outline Problem Statement and Importance Terminology Current Approaches Our Alignment Heuristic Performance Results Conclusions Future Work
6 / 29 Tree based cost(edge) m Sum of Pairs cost(i,j) cost(i,j) = 6 cost(edge) = 1 m Scoring Multiple Alignments gorilla orangutan gibbon chimpanzee human
7 / 29 Alignments Scoring Cost Matrix: C (aa 1, aa 2 ) Gaps Penalties: Simple: C (aa, -) Affine: C(-) + Len * C (aa,-) Cost(s[1..i],t[i..j]) = min( Cost(s[1..i],t[i..j-1]) – g, Cost(s[1..i-1],t[i..j-1]) – C(s[i],t[j]) Cost(s[1..i-1],t[i..j]) – g)) VLSPADNVKA G L S D G E W Q L V L
8 / 29 Outline Problem Statement and Importance Terminology Current Approaches Our Alignment Heuristic Performance Results Conclusions Future Work
9 / 29 Global Methods –Optimal Algorithms (MSA, MWT, MUSEQAL) –Progressive (MULTALIGN, PILEUP, CLUSTAL, MULTAL, AMULT, DFALIGN, MAP, PRRP, AMPS) Local methods –PIMA, DIALIGN, PRALIGN, MACAW, BlockMaker, Iteralign Combined (GENALIGN, ASSEMBLE, DCA) Statistical (HMMT, SAGA, SAM, Match Box) Parsimony (MALIGN, TreeAlign) Current Approaches Global Methods –Optimal Algorithms (MSA, MWT, MUSEQAL) –Progressive (MULTALIGN, PILEUP, CLUSTAL, MULTAL, AMULT, DFALIGN, MAP, PRRP, AMPS) Local methods –PIMA, DIALIGN, PRALIGN, MACAW, BlockMaker, Iteralign Combined (GENALIGN, ASSEMBLE, DCA) Statistical (HMMT, SAGA, SAM, Match Box) Parsimony (MALIGN, TreeAlign) Global Alignment ABCDEFGHI :::: ABCD-FGHI Local Alignment XXXABCDYYY :::: ZZZABCDEEEE
10 / 29 Outline Problem Statement and Importance Terminology Current Approaches Our Alignment Heuristic Performance Results Conclusions Future Work
11 / 29 Our Heuristic Distance Estimation Tree Construction Node Initialization Tree Partitioning Iteration
12 / 29 PEALNYGWY----SSESDVW PEVIRMQDDNPFSFSQSDVY Estimation of Protein Distance Aligned Sequences Estimated Pair Distances Issue: Implied vs. Optimal Pair Alignments PEAAALYGRFT---IKSDVW PESAALYGRFT---IKSDVW PESLALYNKF---SIKSDVW PEALNYGRY----SSESDVW PEALNYGWY----SSESDVW PEVIRMQDDNPFSFSQSDVY PESLALYNKFSIKSDVW PEALNYGRY-SSESDVW PESLALYNKFSIKSDVW PEAL-NYGRYSSESDVW PESLALYNKF---SIKSDVW PEALNYGRY----SSESDVW
13 / 29 Optimal Pair vs. Implied Pair
14 / 29 Interior Node Classification Interior Nodes Classified by Percent Identity –PID = (# matched residues) / (# total residues) –User Specified Tiers –User Specified Cost Criterion Example: –PID > 60% -- PAM 40 – High Gap Penalties –PID > 40% -- PAM 120 – Medium Gap Penalties –PID < 40% -- PAM 200 – Low Gap Penalty
15 / 29 Ordering Alignments Isolate Sub Trees Threshold PID Order Alignments 1.Sub Tree 2.Border Nodes 3.Integrate All
16 / 29 Interior Alignments Sum of Pairs Bounded Search Implementation Modular Reentrant Flexible Cost Criterion
17 / 29 Generating Consensus Alignment (A1,A2,A3) Consensus X Min ( D i (A i,X) ) For Each Position i: X i = A1 X D1 D2 D3 A3 A2 Min (cost( , A1 i ) + cost( , A2 i ) + cost( , A3 i ))
18 / 29 Outline Problem Statement and Importance Terminology Current Approaches Our Alignment Heuristic Performance Results Conclusions Future Work
19 / 29 Testing the Method BAliBASE benchmark –“Correct” Alignments –Core Blocks of Conserved Motifs –Typical “Hard Problem” Sets Protein Parsimony –Measures “Evolutionary Steps” of Alignment
20 / 29 Baseline BAliBASE SP betterbetter
21 / 29 Baseline BAliBASE TC betterbetter
22 / 29 Baseline - ProtPars betterbetter
23 / 29 Orphans/Families BAliBASE SP betterbetter
24 / 29 Orphans/Families ProtPars betterbetter
25 / 29 Larger Families betterbetter
26 / 29 Outline Problem Statement and Importance Terminology Current Approaches Our Alignment Heuristic Performance Results Conclusions Future Work
27 / 29 Conclusions Solution Quality Captures Evolutionary Information Iterations Converge Quickly Useful Tool
28 / 29 Outline Problem Statement and Importance Terminology Current Approaches Our Alignment Heuristic Performance Results Conclusions Future Work
29 / 29 Future Work Improved Alignment Consensus Multiple Partitioning Thresholds Multiple Solutions Integrated Phylogeny Modifications Parallel Implementation