Download presentation
Presentation is loading. Please wait.
Published byDelphia Tyler Modified over 8 years ago
1
CrossWA: A new approach of combining pairwise and three-sequence alignments to improve the accuracy for highly divergent sequence alignment Che-Lun Hung, Chun-Yuan Lin, Yeh-Ching Chung, and Chuan Yi Tang National Tsing Hua University, Taiwan Sixth International Conference on Bioinformatics InCoB2007 HKUST, Hong Kong
2
Outline Introduction Motivation Algorithm Experiments Conclusions 2016/6/12SSLAB, Deportment of computer science, National Tsing Hua University
3
I ntroduction Multiple sequence alignment (MSA) –NP-hard problem The heuristic methods for MSA –Progressive method ClustalW, T-Coffee, POA, and etc. –Iterative method Muscle, DIALIGN, and etc. –Probabilistic method Probcons, Hmmt, Muscle, and etc. –Anchor-based method MAFFT, Align-m, and etc. 2016/6/13SSLAB, Deportment of computer science, National Tsing Hua University
4
Introduction (cont’) Pairwise alignment –Use Dynamic programming to find the optimal alignment. [Needleman, J. Mol. Biol 1970; Smith, J. Mol. Biol 1981] Three-sequence alignment –More accurate than pairwise alignment. [Murata, PNAS 1985] –Introduce linear gap penalty. [Gotoh, J. Theor. Biol 1986] –Space has been reduced from O(N 3 ) to O(N 2 ) with affine gap penalty. [Huang, ACM 1994] –Useful for MSA. [Makoto, Bioinformatics 1993; CY Lin, CMCT 2006, ICPP 2007] 2016/6/14SSLAB, Deportment of computer science, National Tsing Hua University
5
Introduction (cont’) Progressive multiple sequence alignment (Progressive pairwise MSA) –To align pair sequences following the branching order of the guide tree until all sequences are aligned. –The resulting alignment is affected by Initial branching order. –Problems of Gap Gap will not be removed. Insertion gap may be calculated multiple times. [Loytynoja, PNAS 2005] 2016/6/15SSLAB, Deportment of computer science, National Tsing Hua University
6
Introduction (cont’) Progressive triple MSA - aln3nn –Published on [Matthias, BMC Bioinformatics July, 2007]. –Any alignment step is three-sequence alignment. –The three-sequence alignment uses the affine gap penalty same as [Huang, ACM 1994]. –Use Huang’s three-sequence alignment algorithm. 2016/6/16SSLAB, Deportment of computer science, National Tsing Hua University
7
Motivation CrossWA - combine three-sequence and pairwise alignments –Minimize the problem of Progressive pairwise MSA Use three-sequence alignment to reduce the affection of initial branching order. –Increase the accuracy of alignment Three-sequence alignment may obtain more accurate alignments. Keep pairwise alignment because three-sequence alignment is not always better than pairwise alignment. For pairwise, using position-specific gap penalty is more accurate than affine gap penalty. [Thompson, Bioinformatics 1995] Introduce position-specific gap penalty into three-sequence alignment which is different to the algorithm “aln3nn”. –Avoid increasing the computing time 2016/6/17SSLAB, Deportment of computer science, National Tsing Hua University
8
Motivation (cont’) Comparison of three protein sequences among different methods 2016/6/18SSLAB, Deportment of computer science, National Tsing Hua University
9
Motivation (cont’) Three-sequence alignment VS Progressive pairwise MSA – with three sequences (430 test sets, random selected from BAliBase 2.0 Ref1 -5) –Three-sequence alignment with position-specific gap penalty and sequence weighting 2016/6/19SSLAB, Deportment of computer science, National Tsing Hua University
10
Motivation (cont’) Progressive pairwise MAS (ClustalW) VS Progressive Triple MSA (aln3nn) – reference set 1, BAliBase 2.0 [Matthias, BMC Bioinformatics 2007, 7] 2016/6/110SSLAB, Deportment of computer science, National Tsing Hua University
11
General Process of Progressive Multiple sequence alignment . . .. . . Aligning pair sequence or group along the branching order Unaligned sequences Aligned sequences Step 1. Calculating distance matrix Step 2. Constructing guide tree Step 3. Alignment 2016/6/111SSLAB, Deportment of computer science, National Tsing Hua University
12
Algorithm Process of CrossWA –Step 1. construct distance matrix. –Step 2. build guide tree – Neighbour-Joining. Sequence weights will be calculated. –Step 3. build a new guide tree modified from the guide tree. Branches will be changed for three-sequence and pairwise alignments. Sequence weights will be recalculated. –Step 4. Alignment. Pairwise alignment Three-sequence alignment –Compare with the alignment produced by progressive pairwise alignment with same three sequences and select better one. 2016/6/112SSLAB, Deportment of computer science, National Tsing Hua University
13
Algorithm (cont’) . . .. . . Aligning pair or three sequences (or groups) along the branching order of new tree Unaligned sequences Aligned sequences . . .. . . Step 1. Calculating distance matrix Step 2. Constructing guide tree VS Step 4. Alignment Step 3. Constructing new tree modified from the guide tree in step 2 2016/6/113SSLAB, Deportment of computer science, National Tsing Hua University Progressive Pairwise MSA Three-sequence alignment
14
Algorithm (cont’) The branch changing rule Type I Type II Type III 2016/6/114SSLAB, Deportment of computer science, National Tsing Hua University
15
Algorithm (cont’) The evaluation of three-sequence alignment A BC S’ = Align(B, C) S’’ = Align(A, S’) ABC T’ = Align(A, B, C) 1.If SP(S’’) > SP(T’) then keep S’’ 2. IF SP(T’) > SP(S’’) then keep T’ 2016/6/115SSLAB, Deportment of computer science, National Tsing Hua University
16
Algorithm (cont’) Modification of sequence weights –The calculation of sequence weight is same as ClustalW. Weight of Hba_Human = 0.055 + 0.219/2 + 0.061/4 + 0.015/5 + 0.062/6 = 0.194 Length between node A and node C = 0.219 + 0.061 = 0.280 Weight of Hba_Human = 0.055 + 0.280/2 + 0.077/5 = 0.210 AC A B C DD The strategy of Gap penalty – Introduce position-specific gap penalty into three- sequence alignment (modified from ClustalW). 2016/6/116SSLAB, Deportment of computer science, National Tsing Hua University
17
Experiments System environment –Linux (AMD opteron 250 2.4G with 512MB of memory) Data source –BAliBASE 2.0 Reference sets (1 – 5). [T-Coffee, Muscle, Probcons, aln3nn, and etc] Reference sets (6 – 8) contain repeats, inversions and transmembrane helices, for which none of the tested algorithms is designed. [Muscle] 2016/6/117SSLAB, Deportment of computer science, National Tsing Hua University
18
Experiments (cont’) Scoring functions –Sum-of-pair (SP) –Total Column Score (TC) Proportion probability (%) –No. of best alignment of the method/No. of total test sets Comparing algorithms –CrossWA fast, CrossWA full, ClustalW 1.83, T-Coffee 5.05, Muscle 3.6. –CrossWA fast : only use the type I in the branch changing rule. –CrossWA full : use all types in the branch changing rule. 2016/6/118SSLAB, Deportment of computer science, National Tsing Hua University
19
Experiments (cont’) The comparison of SP scores among different alignment methods Ref1 (81) % Ref2 (19) % Ref3 (12) %Ref4 (7)% Ref5 (12) % CrossWA fast 0.774220.87250.66980.65700.7410 CrossWA full 0.777300.877150.685320.65800.76217 ClustalW 0.773110.876160.65600.67400.7620 T-Coffee 0.787340.884370.69280.718570.82550 Muscle 0.776280.891370.713670.728430.82233 2016/6/119SSLAB, Deportment of computer science, National Tsing Hua University
20
Experiment (cont’) The comparison of TC scores among different alignment methods Ref1 (81) % Ref2 (19) % Ref3 (12) %Ref4 (7)% Ref5 (11) % CrossWA fast 0.665250.498160.36880.333290.52918 CrossWA full 0.671330.515320.390420.301290.54618 ClustalW 0.673330.489420.358170.320140.5430 T-Coffee 0.676300.434210.323170.409290.62545 Muscle 0.679220.475320.408420.396290.65836 2016/6/120SSLAB, Deportment of computer science, National Tsing Hua University
21
Experiments (cont’) The SP scores for each method of variant average identities in Reference 1 data set <= 25%20% - 40%> 35% CrossWA fast 0.491350.825190.94118 CrossWA full 0.500400.827310.94225 ClustalW 0.493100.817120.93811 T-Coffee 0.477320.840350.95457 Muscle 0.487250.824270.94818 2016/6/121SSLAB, Deportment of computer science, National Tsing Hua University
22
Experiments (cont’) The TC scores for each method of variant average identities in Reference 1 data set <= 25%20% - 40%> 35% CrossWA fast 0.208280.714200.89121 CrossWA full 0.303330.712240.89346 ClustalW0.324380.713120.88729 T-Coffee0.274330.744360.90533 Muscle0.299190.725360.90746 2016/6/122SSLAB, Deportment of computer science, National Tsing Hua University
23
Experiments (cont’) The performance of CrossWA with 20 sequences 2016/6/123SSLAB, Deportment of computer science, National Tsing Hua University
24
Experiments (cont’) The Performance of CrossWA with 40 sequences 2016/6/124SSLAB, Deportment of computer science, National Tsing Hua University
25
Experiments (cont’) Comparison of performance among different methods with 20 sequences 2016/6/125SSLAB, Deportment of computer science, National Tsing Hua University
26
Experiments (cont’) Comparison of performance among different methods with 40 sequences 2016/6/126SSLAB, Deportment of computer science, National Tsing Hua University
27
Conclusions Three-sequence alignment can obtain better resulting alignment than pairwise alignment, but not for all data sets. Combining three-sequence alignment and pairwise alignment can keep better alignment at any alignment step in progressive MSA. From the experimental results, CrossWA can be another useful tool to align multiple sequence. CrossWA can be used to align DNA sequences. For aligning Genome data, computing time is a problem. It can be solved by parallel programming. [CY Lin, ICPP 2007] 2016/6/127SSLAB, Deportment of computer science, National Tsing Hua University
28
Web service Http://140.114.91.10/Genome 2016/6/128SSLAB, Deportment of computer science, National Tsing Hua University
29
Reference Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 1970, 48:443-453.27. [Needleman, J Mol Biol 1970] Smith TF, Waterman MS : Identification of common molecular subsequences. J. Mol. Biol. 1981, 147:195-197. [Smith, J Mol Biol 1981] Murata M, Richardson JS, Sussman JL: Simultaneous comparison of three protein sequences. Proc Natl Acad Sci U S A. 1985, 82:3073-3077. [Murata, PNAS 1985] Gotoh O: Alignment of three biological sequences with an efficient traceback procedure, J Theor Biol 1986, 327-337. [Gotoh, J Theor Biol 1986] Huang X: Alignment of three sequences in quadratic space. Applied Computing Review 1993, 1:7-11. [Huang, ACM 1993] Makoto H, Maski H, Masato I, Tomoyuki T: MASCOT: multiple alignment system for protein sequences based on three-way dynamic programming, J Mol Biol 1993, 2:161-167. [Makoto, Bioinformatics 1993] 2016/6/129SSLAB, Deportment of computer science, National Tsing Hua University
30
Reference (cont’) CY Lin, CT Huang, YC Chung, Chuan YT: Parallel Three-sequence Alignment with Space-efficient, Proceedings of the 23th Workshop on Combinatorial Mathematics and Computation Theory, Chang-Hua, Taiwan, April 2006, 160- 165. [CY Lin, CMCT 2006] CY Lin, CT Huang, YC Chung, Chuan YT: Efficient Parallel Algorithm for Optimal Three-Sequences Alignment. International Conference on Parallel Processing 2007. [CY Lin, ICPP 2007] Loytynoja A, Goldman N: An algorithm for progressive multiple alignment of sequences with insertions. Proc Natl Acad Sci U S A. 2005, 102(30):10557- 10562. [Loytynoja, PNAS 2005] Matthias K, Peter FS: Progressive multiple sequence alignments from triplets. BMC Bioinformatics 2007. [matthias, BMC Bioinformatics July, 2007] Thompson JD: Introducing variable gap penalties to sequence alignment in linear space. Bioinformatics 1995, 11:181-186. [Thompson, Bioinformatics 1995] 2016/6/130SSLAB, Deportment of computer science, National Tsing Hua University
31
Che-Lun Hung: allen@sslab.cs.nthu.edu.tw Chun-Yuan Lin: cylin@sslab.cs.nthu.edu.tw Yeh-Ching Chung: ychung@cs.nthu.edu.tw Chuan Yi Tang: cytang@cs.nthu.edu.tw Thank you for your attention 2016/6/131SSLAB, Deportment of computer science, National Tsing Hua University
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.