CrossWA: A new approach of combining pairwise and three-sequence alignments to improve the accuracy for highly divergent sequence alignment Che-Lun Hung, Chun-Yuan Lin, Yeh-Ching Chung, and Chuan Yi Tang National Tsing Hua University, Taiwan Sixth International Conference on Bioinformatics InCoB2007 HKUST, Hong Kong
Outline Introduction Motivation Algorithm Experiments Conclusions 2016/6/12SSLAB, Deportment of computer science, National Tsing Hua University
I ntroduction Multiple sequence alignment (MSA) –NP-hard problem The heuristic methods for MSA –Progressive method ClustalW, T-Coffee, POA, and etc. –Iterative method Muscle, DIALIGN, and etc. –Probabilistic method Probcons, Hmmt, Muscle, and etc. –Anchor-based method MAFFT, Align-m, and etc. 2016/6/13SSLAB, Deportment of computer science, National Tsing Hua University
Introduction (cont’) Pairwise alignment –Use Dynamic programming to find the optimal alignment. [Needleman, J. Mol. Biol 1970; Smith, J. Mol. Biol 1981] Three-sequence alignment –More accurate than pairwise alignment. [Murata, PNAS 1985] –Introduce linear gap penalty. [Gotoh, J. Theor. Biol 1986] –Space has been reduced from O(N 3 ) to O(N 2 ) with affine gap penalty. [Huang, ACM 1994] –Useful for MSA. [Makoto, Bioinformatics 1993; CY Lin, CMCT 2006, ICPP 2007] 2016/6/14SSLAB, Deportment of computer science, National Tsing Hua University
Introduction (cont’) Progressive multiple sequence alignment (Progressive pairwise MSA) –To align pair sequences following the branching order of the guide tree until all sequences are aligned. –The resulting alignment is affected by Initial branching order. –Problems of Gap Gap will not be removed. Insertion gap may be calculated multiple times. [Loytynoja, PNAS 2005] 2016/6/15SSLAB, Deportment of computer science, National Tsing Hua University
Introduction (cont’) Progressive triple MSA - aln3nn –Published on [Matthias, BMC Bioinformatics July, 2007]. –Any alignment step is three-sequence alignment. –The three-sequence alignment uses the affine gap penalty same as [Huang, ACM 1994]. –Use Huang’s three-sequence alignment algorithm. 2016/6/16SSLAB, Deportment of computer science, National Tsing Hua University
Motivation CrossWA - combine three-sequence and pairwise alignments –Minimize the problem of Progressive pairwise MSA Use three-sequence alignment to reduce the affection of initial branching order. –Increase the accuracy of alignment Three-sequence alignment may obtain more accurate alignments. Keep pairwise alignment because three-sequence alignment is not always better than pairwise alignment. For pairwise, using position-specific gap penalty is more accurate than affine gap penalty. [Thompson, Bioinformatics 1995] Introduce position-specific gap penalty into three-sequence alignment which is different to the algorithm “aln3nn”. –Avoid increasing the computing time 2016/6/17SSLAB, Deportment of computer science, National Tsing Hua University
Motivation (cont’) Comparison of three protein sequences among different methods 2016/6/18SSLAB, Deportment of computer science, National Tsing Hua University
Motivation (cont’) Three-sequence alignment VS Progressive pairwise MSA – with three sequences (430 test sets, random selected from BAliBase 2.0 Ref1 -5) –Three-sequence alignment with position-specific gap penalty and sequence weighting 2016/6/19SSLAB, Deportment of computer science, National Tsing Hua University
Motivation (cont’) Progressive pairwise MAS (ClustalW) VS Progressive Triple MSA (aln3nn) – reference set 1, BAliBase 2.0 [Matthias, BMC Bioinformatics 2007, 7] 2016/6/110SSLAB, Deportment of computer science, National Tsing Hua University
General Process of Progressive Multiple sequence alignment . . .. . . Aligning pair sequence or group along the branching order Unaligned sequences Aligned sequences Step 1. Calculating distance matrix Step 2. Constructing guide tree Step 3. Alignment 2016/6/111SSLAB, Deportment of computer science, National Tsing Hua University
Algorithm Process of CrossWA –Step 1. construct distance matrix. –Step 2. build guide tree – Neighbour-Joining. Sequence weights will be calculated. –Step 3. build a new guide tree modified from the guide tree. Branches will be changed for three-sequence and pairwise alignments. Sequence weights will be recalculated. –Step 4. Alignment. Pairwise alignment Three-sequence alignment –Compare with the alignment produced by progressive pairwise alignment with same three sequences and select better one. 2016/6/112SSLAB, Deportment of computer science, National Tsing Hua University
Algorithm (cont’) . . .. . . Aligning pair or three sequences (or groups) along the branching order of new tree Unaligned sequences Aligned sequences . . .. . . Step 1. Calculating distance matrix Step 2. Constructing guide tree VS Step 4. Alignment Step 3. Constructing new tree modified from the guide tree in step /6/113SSLAB, Deportment of computer science, National Tsing Hua University Progressive Pairwise MSA Three-sequence alignment
Algorithm (cont’) The branch changing rule Type I Type II Type III 2016/6/114SSLAB, Deportment of computer science, National Tsing Hua University
Algorithm (cont’) The evaluation of three-sequence alignment A BC S’ = Align(B, C) S’’ = Align(A, S’) ABC T’ = Align(A, B, C) 1.If SP(S’’) > SP(T’) then keep S’’ 2. IF SP(T’) > SP(S’’) then keep T’ 2016/6/115SSLAB, Deportment of computer science, National Tsing Hua University
Algorithm (cont’) Modification of sequence weights –The calculation of sequence weight is same as ClustalW. Weight of Hba_Human = / / / /6 = Length between node A and node C = = Weight of Hba_Human = / /5 = AC A B C DD The strategy of Gap penalty – Introduce position-specific gap penalty into three- sequence alignment (modified from ClustalW). 2016/6/116SSLAB, Deportment of computer science, National Tsing Hua University
Experiments System environment –Linux (AMD opteron G with 512MB of memory) Data source –BAliBASE 2.0 Reference sets (1 – 5). [T-Coffee, Muscle, Probcons, aln3nn, and etc] Reference sets (6 – 8) contain repeats, inversions and transmembrane helices, for which none of the tested algorithms is designed. [Muscle] 2016/6/117SSLAB, Deportment of computer science, National Tsing Hua University
Experiments (cont’) Scoring functions –Sum-of-pair (SP) –Total Column Score (TC) Proportion probability (%) –No. of best alignment of the method/No. of total test sets Comparing algorithms –CrossWA fast, CrossWA full, ClustalW 1.83, T-Coffee 5.05, Muscle 3.6. –CrossWA fast : only use the type I in the branch changing rule. –CrossWA full : use all types in the branch changing rule. 2016/6/118SSLAB, Deportment of computer science, National Tsing Hua University
Experiments (cont’) The comparison of SP scores among different alignment methods Ref1 (81) % Ref2 (19) % Ref3 (12) %Ref4 (7)% Ref5 (12) % CrossWA fast CrossWA full ClustalW T-Coffee Muscle /6/119SSLAB, Deportment of computer science, National Tsing Hua University
Experiment (cont’) The comparison of TC scores among different alignment methods Ref1 (81) % Ref2 (19) % Ref3 (12) %Ref4 (7)% Ref5 (11) % CrossWA fast CrossWA full ClustalW T-Coffee Muscle /6/120SSLAB, Deportment of computer science, National Tsing Hua University
Experiments (cont’) The SP scores for each method of variant average identities in Reference 1 data set <= 25%20% - 40%> 35% CrossWA fast CrossWA full ClustalW T-Coffee Muscle /6/121SSLAB, Deportment of computer science, National Tsing Hua University
Experiments (cont’) The TC scores for each method of variant average identities in Reference 1 data set <= 25%20% - 40%> 35% CrossWA fast CrossWA full ClustalW T-Coffee Muscle /6/122SSLAB, Deportment of computer science, National Tsing Hua University
Experiments (cont’) The performance of CrossWA with 20 sequences 2016/6/123SSLAB, Deportment of computer science, National Tsing Hua University
Experiments (cont’) The Performance of CrossWA with 40 sequences 2016/6/124SSLAB, Deportment of computer science, National Tsing Hua University
Experiments (cont’) Comparison of performance among different methods with 20 sequences 2016/6/125SSLAB, Deportment of computer science, National Tsing Hua University
Experiments (cont’) Comparison of performance among different methods with 40 sequences 2016/6/126SSLAB, Deportment of computer science, National Tsing Hua University
Conclusions Three-sequence alignment can obtain better resulting alignment than pairwise alignment, but not for all data sets. Combining three-sequence alignment and pairwise alignment can keep better alignment at any alignment step in progressive MSA. From the experimental results, CrossWA can be another useful tool to align multiple sequence. CrossWA can be used to align DNA sequences. For aligning Genome data, computing time is a problem. It can be solved by parallel programming. [CY Lin, ICPP 2007] 2016/6/127SSLAB, Deportment of computer science, National Tsing Hua University
Web service /6/128SSLAB, Deportment of computer science, National Tsing Hua University
Reference Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 1970, 48: [Needleman, J Mol Biol 1970] Smith TF, Waterman MS : Identification of common molecular subsequences. J. Mol. Biol. 1981, 147: [Smith, J Mol Biol 1981] Murata M, Richardson JS, Sussman JL: Simultaneous comparison of three protein sequences. Proc Natl Acad Sci U S A. 1985, 82: [Murata, PNAS 1985] Gotoh O: Alignment of three biological sequences with an efficient traceback procedure, J Theor Biol 1986, [Gotoh, J Theor Biol 1986] Huang X: Alignment of three sequences in quadratic space. Applied Computing Review 1993, 1:7-11. [Huang, ACM 1993] Makoto H, Maski H, Masato I, Tomoyuki T: MASCOT: multiple alignment system for protein sequences based on three-way dynamic programming, J Mol Biol 1993, 2: [Makoto, Bioinformatics 1993] 2016/6/129SSLAB, Deportment of computer science, National Tsing Hua University
Reference (cont’) CY Lin, CT Huang, YC Chung, Chuan YT: Parallel Three-sequence Alignment with Space-efficient, Proceedings of the 23th Workshop on Combinatorial Mathematics and Computation Theory, Chang-Hua, Taiwan, April 2006, [CY Lin, CMCT 2006] CY Lin, CT Huang, YC Chung, Chuan YT: Efficient Parallel Algorithm for Optimal Three-Sequences Alignment. International Conference on Parallel Processing [CY Lin, ICPP 2007] Loytynoja A, Goldman N: An algorithm for progressive multiple alignment of sequences with insertions. Proc Natl Acad Sci U S A. 2005, 102(30): [Loytynoja, PNAS 2005] Matthias K, Peter FS: Progressive multiple sequence alignments from triplets. BMC Bioinformatics [matthias, BMC Bioinformatics July, 2007] Thompson JD: Introducing variable gap penalties to sequence alignment in linear space. Bioinformatics 1995, 11: [Thompson, Bioinformatics 1995] 2016/6/130SSLAB, Deportment of computer science, National Tsing Hua University
Che-Lun Hung: Chun-Yuan Lin: Yeh-Ching Chung: Chuan Yi Tang: Thank you for your attention 2016/6/131SSLAB, Deportment of computer science, National Tsing Hua University