Download presentation
Presentation is loading. Please wait.
1
Multiple Sequence Composition Alignment
Name: Yip Chi Kin Date:
2
Studied Papers [B03] Composition Alignment
[S98] Divide-and-conquer Alignment [M99] DIALIGN Algorithm [SMS03] DCA + Segment-based
3
Main Aspects ․Dynamic Programming ․Composition Alignment
․Meta-code MSA ․Simultaneous MSA Pairwise Library (Global & Local) Consistency & Ungapped Divide-and-conquer Segment-based (Optimal scores)
4
Dynamic Programming • -d -d Dot Matrix Edit Graph DP Matrix C T G A
matches deletions insertions DP Matrix C T G A C T G A • s(ai,bi) -d -d
5
Needleman-Wunsch Algorithm
Global Alignment Needleman-Wunsch Algorithm Scoring - C T T C T - G C A T -2 -3 -4 -7 -10 -1 -5 -8 -6 -9 GA Results G C A T C - - C T T C T
6
Smith-Waterman Algorithm
Local Alignment 1 3 5 4 2 6 - T T T A C A G G C A G - G A C T Smith-Waterman Algorithm Scoring GA Results - G A A C – G G T - - T T T A C A G G C A G
7
MSA Methods ․Consistency-based ․Exact method ․Progressive method
․Iterative method ․Stochastic method ․Hidden Markov method
8
Consistency-based method
MSA Concepts Consistency-based method PSAs C - G T C - T G A C G - T A Trace formulation C T G T G A C T A C G Latter formulation C T G T G A C T G A C
9
MSA Results Results of MSA Aligned regions C T G A G T C - A T C G A
Unrealized Realized Consistent
10
Divide-and-conquer Prefix Suffix Divide Divide Divide Align optimally
S1C1 S2C2 S3C3 C1S1 C2S2 C3S3 Divide Divide Align optimally Concatenate
11
DP Distance Wopt (prefix) Wopt (suffix)
Sequence: GTTCATGCCAGGTGTAAATC Suffix Prefix Wopt (prefix) - C T A T A C - G T A C C T A T A C - G T A C - Wopt (suffix) CS1,S2[C1,C2] = Wopt (prefix) + Wopt (suffix) – Wopt (total)
12
Additional-cost CS1,S2[1,1] = 0 CS1,S2[2,2] = 0 CS1,S2[3,3] = 0
C T A T A C G T A C Cost of Diagonal CS1,S2[1,1] = 0 CS1,S2[2,2] = 0 CS1,S2[3,3] = 0 CS1,S2[4,4] = 0 CS1,S2[5,4] = 0 CS1,S2[6,5] = 0 CS1,S2[2,2] = – 3 = 0 = Wopt [CT,GT] + Wopt [ATAC,ATAC] – Wopt [CTATC,GTATAC] CS1,S2[4,3] = – 3 = 1 = Wopt [CTAT,GTA] + Wopt [AC,TAC] – Wopt [CTATC,GTATAC]
13
Space & Time ‘Chain’ of boxes along Diagonal in order to reduce searching time Full sequence searching
14
DIALIGN Non-Consistent (Simultaneous) Non-Consistent (Cross over) I A
F E D C G S P W T Y I A V L F E D C G S P W T Y Consistent diagonals GA Results I A V L F E D C G S P W T Y y I A - V L F E d c G s p w T
15
Diagonals D1 , D4 and D5 Score = 1.9 + 2.6 + 0.2 = 4.7
Weighting Diagonal Weights where SD is sum of similarity values of same diagonal lD lD is length of diagonal D w(D) = – log P(lD, SD) Overlap weighting Y I A V L F A Y D D L A C V I F G S S W D D V M F Y A E w(D1) = 1.9 w(D3) = 1.5 w(D2) = 1.7 w(D4) = 2.6 w(D5) = 0.2 Diagonals D1 , D4 and D5 Score = = 4.7 Diagonals D1 , D2 , D3 and D5 Score = = 5.3 Y I A V L F A Y D D Y I A V L F A Y D D L A C V I F G S L A C V I F G S S W D D V M F Y A E S W D D V M F Y A E
16
Transitivity frontier [1,9]
Consistency check Overlap weights S1 S2 S3 Fragments checking f2 f1 f3 S1 S2 S3 Transitivity frontier [1,9]
17
Greedy Approach Tandem duplications Greedy Strategy
Consistency conflicts M1 (2) M1 (1) M2 M3 M1 (2) M1 (1) M2 M3 S1 S2 S3 S1 S2 S3
18
Composition Alignment
Composition matches Single character match A C G T - CM of Prefix Length Sequence #1 1 + – 2 -1 -2 -3 Sequence #2 Matching Prefix length
19
Match Length 111010001 001101110 Replaced by 7 7 1110100 01 0011011 10
Composition Matching 3 2 2 2 Prefix length 1 4 9 15 –1 2 2 –2 Replaced by 2 –3
20
Composition Matching CM of Prefix Length (Total=9) 1 1 1 1 1 1 CM = 2
Sequence #1 1 1 1 1 1 1 CM = 2 Sequence #2 1 1 1 1 1 1 1 1 1 CM = -1 Sequence #1 1 1 1 1 1 1 CM = 1 Sequence #2 1 1 1 1 1 1 1 1 1 CM of Prefix Length (Total=9) Sequence #1 1 1 1 1 1 1 CM = 0 Sequence #2 1 1 1 1 1 1 1 1 1
21
Meta-Code Code about code Mismatch code Input code Code Reservoir
Code for Testing Mismatch code Original Code Control Rule Meta-Code
22
Reservoir Codes Code ‘A’ in S1 Code ‘G’ Store code in Reservoir S1
Code ‘CT’ Store code in Reservoir S2 Code from S1 Code from S2 Store code in Reservoir S1 If both Codes founded from Reservoir S1 and Reservoir S2 delete this two codes Reservoir Code (e.g. AGRCT) Code ‘G’ Code ‘C’ Code ‘AG’ Code ‘A’ in S1 Code ‘T’ in S2
23
Meta-Code Rule If reservoir code = r, then stop the looping
Looping for creating meta-code If CM length is valid, reservoir code = r, Position = p. Value of r Values of r and p Copy the codes from S1 and S2, p = p –1, output meta-code. Meta-code (e.g. AMT) Codes from S1 and S2
24
Composition Matching of S1 and S2 in prefix length
CM (Lengths & Codes) Composition Matching of S1 and S2 in prefix length S1: S2: Meta Code Length T A C G 1 2 R ART AGRCT GRC GARTC AGRTT AGRCC ARC Reservoir codes in S1 A G Reservoir codes in S2 T C
25
CM of Metacode 2 4 Invalid length Composition Matching AGRTT GARTC
ARC 2 1 Prefix length 2 6 10 12 2 4 –1 ART ART AGRCT GARTC
26
Composition MSA Composition matching Meta-code MSA | T C G A S1 T A C
T C G A Meta-code MSA S1 T A C G TMG GMT CMT GMC AMG TMA S2 New S2
27
Fixed Segment A = Currency / Cards B = Stock / Structured P.
C = Unit Trusts / Bonds D = Insurance / Finance Code catalogue E = Mortgages / Loans ․Semi-global alignment ․Least overlap problem ․Simple segmentation ․Composition alignment ․Weekly behaviour Segment Length LS = 5 Week #1 Week #2 Branch bank #1 … Branch bank #2 Branch bank #3 A C B E D Time Granularities 1t1 1t2 1t3 1t4 1t5 2t1 2t2 2t3 2t4 2t5 3t1 3t2
28
Family Classifications
Meta-Code Branch bank #1 Branch bank #2 Meta-Code Branch bank #3 C B A D PSA Branch bank #1 Branch bank #2 Branch bank #3 C B A D Composition alignment Family Group Fixed-Segment Composition MSA Family Group
29
Meta-Code Composition MSA
Further Problems Meta-Code Composition MSA ․Fixed-segment length ․Prior sequence choice ․Speed-up PSAs ․Nos. of Segments/Codes
30
Conclusions ․Fixed-segment Composition ․Meta-code Approach
(Least Overlap Problems) ․Meta-code Approach (Easier Transform Applications) ․Widespread use of MSA (Simultaneous Multiple Sequences)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.