Multiple Sequence Composition Alignment Name: Yip Chi Kin Date: 21-12-2006
Studied Papers [B03] Composition Alignment [S98] Divide-and-conquer Alignment [M99] DIALIGN Algorithm [SMS03] DCA + Segment-based
Main Aspects ․Dynamic Programming ․Composition Alignment ․Meta-code MSA ․Simultaneous MSA Pairwise Library (Global & Local) Consistency & Ungapped Divide-and-conquer Segment-based (Optimal scores)
Dynamic Programming • -d -d Dot Matrix Edit Graph DP Matrix C T G A matches deletions insertions DP Matrix C T G A C T G A • s(ai,bi) -d -d
Needleman-Wunsch Algorithm Global Alignment Needleman-Wunsch Algorithm Scoring - C T T C T - G C A T -2 -3 -4 -7 -10 -1 -5 -8 -6 -9 GA Results G C A T C - - C T T C T
Smith-Waterman Algorithm Local Alignment 1 3 5 4 2 6 - T T T A C A G G C A G - G A C T Smith-Waterman Algorithm Scoring GA Results - G A A C – G G T - - T T T A C A G G C A G
MSA Methods ․Consistency-based ․Exact method ․Progressive method ․Iterative method ․Stochastic method ․Hidden Markov method
Consistency-based method MSA Concepts Consistency-based method PSAs C - G T C - T G A C G - T A Trace formulation C T G T G A C T A C G Latter formulation C T G T G A C T G A C
MSA Results Results of MSA Aligned regions C T G A G T C - A T C G A Unrealized Realized Consistent
Divide-and-conquer Prefix Suffix Divide Divide Divide Align optimally S1C1 S2C2 S3C3 C1S1 C2S2 C3S3 Divide Divide Align optimally Concatenate
DP Distance Wopt (prefix) Wopt (suffix) Sequence: GTTCATGCCAGGTGTAAATC Suffix Prefix Wopt (prefix) 0 2 4 6 8 10 12 2 1 3 5 7 9 11 4 3 1 3 5 7 9 6 5 3 1 3 5 7 8 7 5 3 1 3 5 10 8 7 5 3 2 3 - C T A T A C - G T A C 3 4 3 4 6 8 10 4 2 3 2 4 6 8 6 4 2 2 2 4 6 8 6 4 2 1 2 4 10 8 6 4 2 0 2 12 10 8 6 4 2 0 C T A T A C - G T A C - Wopt (suffix) CS1,S2[C1,C2] = Wopt (prefix) + Wopt (suffix) – Wopt (total)
Additional-cost CS1,S2[1,1] = 0 CS1,S2[2,2] = 0 CS1,S2[3,3] = 0 0 3 4 7 11 15 19 3 0 3 4 8 12 16 7 4 0 2 4 8 12 11 8 4 0 1 4 8 15 12 8 4 0 0 4 19 15 12 8 4 1 0 C T A T A C G T A C Cost of Diagonal CS1,S2[1,1] = 0 CS1,S2[2,2] = 0 CS1,S2[3,3] = 0 CS1,S2[4,4] = 0 CS1,S2[5,4] = 0 CS1,S2[6,5] = 0 CS1,S2[2,2] = 1 + 2 – 3 = 0 = Wopt [CT,GT] + Wopt [ATAC,ATAC] – Wopt [CTATC,GTATAC] CS1,S2[4,3] = 3 + 1 – 3 = 1 = Wopt [CTAT,GTA] + Wopt [AC,TAC] – Wopt [CTATC,GTATAC]
Space & Time ‘Chain’ of boxes along Diagonal in order to reduce searching time Full sequence searching
DIALIGN Non-Consistent (Simultaneous) Non-Consistent (Cross over) I A F E D C G S P W T Y I A V L F E D C G S P W T Y Consistent diagonals GA Results I A V L F E D C G S P W T Y y I A - V L F E d c G s p w T
Diagonals D1 , D4 and D5 Score = 1.9 + 2.6 + 0.2 = 4.7 Weighting Diagonal Weights where SD is sum of similarity values of same diagonal lD lD is length of diagonal D w(D) = – log P(lD, SD) Overlap weighting Y I A V L F A Y D D L A C V I F G S S W D D V M F Y A E w(D1) = 1.9 w(D3) = 1.5 w(D2) = 1.7 w(D4) = 2.6 w(D5) = 0.2 Diagonals D1 , D4 and D5 Score = 1.9 + 2.6 + 0.2 = 4.7 Diagonals D1 , D2 , D3 and D5 Score = 1.9 + 1.7 + 1.5 + 0.2 = 5.3 Y I A V L F A Y D D Y I A V L F A Y D D L A C V I F G S L A C V I F G S S W D D V M F Y A E S W D D V M F Y A E
Transitivity frontier [1,9] Consistency check Overlap weights 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 S1 S2 S3 Fragments checking f2 f1 f3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 S1 S2 S3 Transitivity frontier [1,9]
Greedy Approach Tandem duplications Greedy Strategy Consistency conflicts M1 (2) M1 (1) M2 M3 M1 (2) M1 (1) M2 M3 S1 S2 S3 S1 S2 S3
Composition Alignment Composition matches Single character match A C G T - CM of Prefix Length Sequence #1 1 + – 2 -1 -2 -3 Sequence #2 Matching Prefix length
Match Length 111010001 001101110 Replaced by 7 7 1110100 01 0011011 10 Composition Matching 3 2 2 2 Prefix length 1 4 9 15 –1 2 2 –2 Replaced by 2 –3
Composition Matching CM of Prefix Length (Total=9) 1 1 1 1 1 1 CM = 2 Sequence #1 1 1 1 1 1 1 CM = 2 Sequence #2 1 1 1 1 1 1 1 1 1 CM = -1 Sequence #1 1 1 1 1 1 1 CM = 1 Sequence #2 1 1 1 1 1 1 1 1 1 CM of Prefix Length (Total=9) Sequence #1 1 1 1 1 1 1 CM = 0 Sequence #2 1 1 1 1 1 1 1 1 1
Meta-Code Code about code Mismatch code Input code Code Reservoir Code for Testing Mismatch code Original Code Control Rule Meta-Code
Reservoir Codes Code ‘A’ in S1 Code ‘G’ Store code in Reservoir S1 Code ‘CT’ Store code in Reservoir S2 Code from S1 Code from S2 Store code in Reservoir S1 If both Codes founded from Reservoir S1 and Reservoir S2 delete this two codes Reservoir Code (e.g. AGRCT) Code ‘G’ Code ‘C’ Code ‘AG’ Code ‘A’ in S1 Code ‘T’ in S2
Meta-Code Rule If reservoir code = r, then stop the looping Looping for creating meta-code If CM length is valid, reservoir code = r, Position = p. Value of r Values of r and p Copy the codes from S1 and S2, p = p –1, output meta-code. Meta-code (e.g. AMT) Codes from S1 and S2
Composition Matching of S1 and S2 in prefix length CM (Lengths & Codes) Composition Matching of S1 and S2 in prefix length S1: S2: Meta Code Length T A C G 1 2 R ART AGRCT GRC GARTC AGRTT AGRCC ARC Reservoir codes in S1 A G Reservoir codes in S2 T C
CM of Metacode 2 4 Invalid length Composition Matching AGRTT GARTC ARC 2 1 Prefix length 2 6 10 12 2 4 –1 ART ART AGRCT GARTC
Composition MSA Composition matching Meta-code MSA | T C G A S1 T A C T C G A Meta-code MSA S1 T A C G TMG GMT CMT GMC AMG TMA S2 New S2
Fixed Segment A = Currency / Cards B = Stock / Structured P. C = Unit Trusts / Bonds D = Insurance / Finance Code catalogue E = Mortgages / Loans ․Semi-global alignment ․Least overlap problem ․Simple segmentation ․Composition alignment ․Weekly behaviour Segment Length LS = 5 Week #1 Week #2 Branch bank #1 … Branch bank #2 Branch bank #3 A C B E D Time Granularities 1t1 1t2 1t3 1t4 1t5 2t1 2t2 2t3 2t4 2t5 3t1 3t2
Family Classifications Meta-Code Branch bank #1 Branch bank #2 Meta-Code Branch bank #3 C B A D PSA Branch bank #1 Branch bank #2 Branch bank #3 C B A D Composition alignment Family Group Fixed-Segment Composition MSA Family Group
Meta-Code Composition MSA Further Problems Meta-Code Composition MSA ․Fixed-segment length ․Prior sequence choice ․Speed-up PSAs ․Nos. of Segments/Codes
Conclusions ․Fixed-segment Composition ․Meta-code Approach (Least Overlap Problems) ․Meta-code Approach (Easier Transform Applications) ․Widespread use of MSA (Simultaneous Multiple Sequences)