CrossWA: A new approach of combining pairwise and three-sequence alignments to improve the accuracy for highly divergent sequence alignment Che-Lun Hung,

Slides:



Advertisements
Similar presentations
Multiple Alignment Anders Gorm Pedersen Molecular Evolution Group
Advertisements

Optimal Sum of Pairs Multiple Sequence Alignment David Kelley.
COFFEE: an objective function for multiple sequence alignments
A Hidden Markov Model for Progressive Multiple Alignment Ari Löytynoja and Michel C. Milinkovitch Appeared in BioInformatics, Vol 19, no.12, 2003 Presented.
Structural bioinformatics
BNFO 602 Multiple sequence alignment Usman Roshan.
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
©CMBI 2005 Sequence Alignment In phylogeny one wants to line up residues that came from a common ancestor. For information transfer one wants to line up.
Sequence analysis lecture 6 Sequence analysis course Lecture 6 Multiple sequence alignment 2 of 3 Multiple alignment methods.
1 Protein Multiple Alignment by Konstantin Davydov.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Bioinformatics and Phylogenetic Analysis
Expected accuracy sequence alignment
What you should know by now Concepts: Pairwise alignment Global, semi-global and local alignment Dynamic programming Sequence similarity (Sum-of-Pairs)
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Multiple alignment: heuristics
Multiple sequence alignment
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
BNFO 602 Multiple sequence alignment Usman Roshan.
Algorithms Dr. Nancy Warter-Perez June 19, May 20, 2003 Developing Pairwise Sequence Alignment Algorithms2 Outline Programming workshop 2 solutions.
Developing Sequence Alignment Algorithms in C++ Dr. Nancy Warter-Perez May 21, 2002.
Finding the optimal pairwise alignment We are interested in finding the alignment of two sequences that maximizes the similarity score given an arbitrary.
Multiple Sequence Alignments
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
LCS and Extensions to Global and Local Alignment Dr. Nancy Warter-Perez June 26, 2003.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
MCB 5472 Lecture #6: Sequence alignment March 27, 2014.
Chapter 5 Multiple Sequence Alignment.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Developing Pairwise Sequence Alignment Algorithms
Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.
Multiple sequence alignment
Biology 4900 Biocomputing.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 24th, 2013.
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Protein Sequence Alignment and Database Searching.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
1 Data structure:Lookup Table Application:BLAST. 2 The Look-up Table Data Structure A k-mer is a string of length k. A lookup table is a table of size.
Multiple sequence alignments Introduction to Bioinformatics Jacques van Helden Aix-Marseille Université (AMU), France Lab.
Construction of Substitution Matrices
Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders Weiwei Zhong.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Progressive multiple sequence alignments from triplets by Matthias Kruspe and Peter F Stadler Presented by Syed Nabeel.
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Expected accuracy sequence alignment Usman Roshan.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Expected accuracy sequence alignment Usman Roshan.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.
Sequence alignment CS 394C: Fall 2009 Tandy Warnow September 24, 2009.
Introduction to Sequence Alignment. Why Align Sequences? Find homology within the same species Find clues to gene function Practical issues in experiments.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
©CMBI 2001 Alignment Most alignment programs create an alignment that represents what happened during evolution at the DNA level. To carry over information.
Pairwise alignment Now we know how to do it: How do we get a multiple alignment (three or more sequences)? Multiple alignment: much greater combinatorial.
Multiple Sequence Alignment
Multiple Alignment Anders Gorm Pedersen / Henrik Nielsen
INTRODUCTION TO BIOINFORMATICS
Multiple sequence alignment (msa)
The ideal approach is simultaneous alignment and tree estimation.
Presentation transcript:

CrossWA: A new approach of combining pairwise and three-sequence alignments to improve the accuracy for highly divergent sequence alignment Che-Lun Hung, Chun-Yuan Lin, Yeh-Ching Chung, and Chuan Yi Tang National Tsing Hua University, Taiwan Sixth International Conference on Bioinformatics InCoB2007 HKUST, Hong Kong

Outline Introduction Motivation Algorithm Experiments Conclusions 2016/6/12SSLAB, Deportment of computer science, National Tsing Hua University

I ntroduction Multiple sequence alignment (MSA) –NP-hard problem The heuristic methods for MSA –Progressive method ClustalW, T-Coffee, POA, and etc. –Iterative method Muscle, DIALIGN, and etc. –Probabilistic method Probcons, Hmmt, Muscle, and etc. –Anchor-based method MAFFT, Align-m, and etc. 2016/6/13SSLAB, Deportment of computer science, National Tsing Hua University

Introduction (cont’) Pairwise alignment –Use Dynamic programming to find the optimal alignment. [Needleman, J. Mol. Biol 1970; Smith, J. Mol. Biol 1981] Three-sequence alignment –More accurate than pairwise alignment. [Murata, PNAS 1985] –Introduce linear gap penalty. [Gotoh, J. Theor. Biol 1986] –Space has been reduced from O(N 3 ) to O(N 2 ) with affine gap penalty. [Huang, ACM 1994] –Useful for MSA. [Makoto, Bioinformatics 1993; CY Lin, CMCT 2006, ICPP 2007] 2016/6/14SSLAB, Deportment of computer science, National Tsing Hua University

Introduction (cont’) Progressive multiple sequence alignment (Progressive pairwise MSA) –To align pair sequences following the branching order of the guide tree until all sequences are aligned. –The resulting alignment is affected by Initial branching order. –Problems of Gap Gap will not be removed. Insertion gap may be calculated multiple times. [Loytynoja, PNAS 2005] 2016/6/15SSLAB, Deportment of computer science, National Tsing Hua University

Introduction (cont’) Progressive triple MSA - aln3nn –Published on [Matthias, BMC Bioinformatics July, 2007]. –Any alignment step is three-sequence alignment. –The three-sequence alignment uses the affine gap penalty same as [Huang, ACM 1994]. –Use Huang’s three-sequence alignment algorithm. 2016/6/16SSLAB, Deportment of computer science, National Tsing Hua University

Motivation CrossWA - combine three-sequence and pairwise alignments –Minimize the problem of Progressive pairwise MSA Use three-sequence alignment to reduce the affection of initial branching order. –Increase the accuracy of alignment Three-sequence alignment may obtain more accurate alignments. Keep pairwise alignment because three-sequence alignment is not always better than pairwise alignment. For pairwise, using position-specific gap penalty is more accurate than affine gap penalty. [Thompson, Bioinformatics 1995] Introduce position-specific gap penalty into three-sequence alignment which is different to the algorithm “aln3nn”. –Avoid increasing the computing time 2016/6/17SSLAB, Deportment of computer science, National Tsing Hua University

Motivation (cont’) Comparison of three protein sequences among different methods 2016/6/18SSLAB, Deportment of computer science, National Tsing Hua University

Motivation (cont’) Three-sequence alignment VS Progressive pairwise MSA – with three sequences (430 test sets, random selected from BAliBase 2.0 Ref1 -5) –Three-sequence alignment with position-specific gap penalty and sequence weighting 2016/6/19SSLAB, Deportment of computer science, National Tsing Hua University

Motivation (cont’) Progressive pairwise MAS (ClustalW) VS Progressive Triple MSA (aln3nn) – reference set 1, BAliBase 2.0 [Matthias, BMC Bioinformatics 2007, 7] 2016/6/110SSLAB, Deportment of computer science, National Tsing Hua University

General Process of Progressive Multiple sequence alignment . . .. . . Aligning pair sequence or group along the branching order Unaligned sequences Aligned sequences Step 1. Calculating distance matrix Step 2. Constructing guide tree Step 3. Alignment 2016/6/111SSLAB, Deportment of computer science, National Tsing Hua University

Algorithm Process of CrossWA –Step 1. construct distance matrix. –Step 2. build guide tree – Neighbour-Joining. Sequence weights will be calculated. –Step 3. build a new guide tree modified from the guide tree. Branches will be changed for three-sequence and pairwise alignments. Sequence weights will be recalculated. –Step 4. Alignment. Pairwise alignment Three-sequence alignment –Compare with the alignment produced by progressive pairwise alignment with same three sequences and select better one. 2016/6/112SSLAB, Deportment of computer science, National Tsing Hua University

Algorithm (cont’) . . .. . . Aligning pair or three sequences (or groups) along the branching order of new tree Unaligned sequences Aligned sequences . . .. . . Step 1. Calculating distance matrix Step 2. Constructing guide tree VS Step 4. Alignment Step 3. Constructing new tree modified from the guide tree in step /6/113SSLAB, Deportment of computer science, National Tsing Hua University Progressive Pairwise MSA Three-sequence alignment

Algorithm (cont’) The branch changing rule Type I Type II Type III 2016/6/114SSLAB, Deportment of computer science, National Tsing Hua University

Algorithm (cont’) The evaluation of three-sequence alignment A BC S’ = Align(B, C) S’’ = Align(A, S’) ABC T’ = Align(A, B, C) 1.If SP(S’’) > SP(T’) then keep S’’ 2. IF SP(T’) > SP(S’’) then keep T’ 2016/6/115SSLAB, Deportment of computer science, National Tsing Hua University

Algorithm (cont’) Modification of sequence weights –The calculation of sequence weight is same as ClustalW. Weight of Hba_Human = / / / /6 = Length between node A and node C = = Weight of Hba_Human = / /5 = AC A B C DD The strategy of Gap penalty – Introduce position-specific gap penalty into three- sequence alignment (modified from ClustalW). 2016/6/116SSLAB, Deportment of computer science, National Tsing Hua University

Experiments System environment –Linux (AMD opteron G with 512MB of memory) Data source –BAliBASE 2.0 Reference sets (1 – 5). [T-Coffee, Muscle, Probcons, aln3nn, and etc] Reference sets (6 – 8) contain repeats, inversions and transmembrane helices, for which none of the tested algorithms is designed. [Muscle] 2016/6/117SSLAB, Deportment of computer science, National Tsing Hua University

Experiments (cont’) Scoring functions –Sum-of-pair (SP) –Total Column Score (TC) Proportion probability (%) –No. of best alignment of the method/No. of total test sets Comparing algorithms –CrossWA fast, CrossWA full, ClustalW 1.83, T-Coffee 5.05, Muscle 3.6. –CrossWA fast : only use the type I in the branch changing rule. –CrossWA full : use all types in the branch changing rule. 2016/6/118SSLAB, Deportment of computer science, National Tsing Hua University

Experiments (cont’) The comparison of SP scores among different alignment methods Ref1 (81) % Ref2 (19) % Ref3 (12) %Ref4 (7)% Ref5 (12) % CrossWA fast CrossWA full ClustalW T-Coffee Muscle /6/119SSLAB, Deportment of computer science, National Tsing Hua University

Experiment (cont’) The comparison of TC scores among different alignment methods Ref1 (81) % Ref2 (19) % Ref3 (12) %Ref4 (7)% Ref5 (11) % CrossWA fast CrossWA full ClustalW T-Coffee Muscle /6/120SSLAB, Deportment of computer science, National Tsing Hua University

Experiments (cont’) The SP scores for each method of variant average identities in Reference 1 data set <= 25%20% - 40%> 35% CrossWA fast CrossWA full ClustalW T-Coffee Muscle /6/121SSLAB, Deportment of computer science, National Tsing Hua University

Experiments (cont’) The TC scores for each method of variant average identities in Reference 1 data set <= 25%20% - 40%> 35% CrossWA fast CrossWA full ClustalW T-Coffee Muscle /6/122SSLAB, Deportment of computer science, National Tsing Hua University

Experiments (cont’) The performance of CrossWA with 20 sequences 2016/6/123SSLAB, Deportment of computer science, National Tsing Hua University

Experiments (cont’) The Performance of CrossWA with 40 sequences 2016/6/124SSLAB, Deportment of computer science, National Tsing Hua University

Experiments (cont’) Comparison of performance among different methods with 20 sequences 2016/6/125SSLAB, Deportment of computer science, National Tsing Hua University

Experiments (cont’) Comparison of performance among different methods with 40 sequences 2016/6/126SSLAB, Deportment of computer science, National Tsing Hua University

Conclusions Three-sequence alignment can obtain better resulting alignment than pairwise alignment, but not for all data sets. Combining three-sequence alignment and pairwise alignment can keep better alignment at any alignment step in progressive MSA. From the experimental results, CrossWA can be another useful tool to align multiple sequence. CrossWA can be used to align DNA sequences. For aligning Genome data, computing time is a problem. It can be solved by parallel programming. [CY Lin, ICPP 2007] 2016/6/127SSLAB, Deportment of computer science, National Tsing Hua University

Web service /6/128SSLAB, Deportment of computer science, National Tsing Hua University

Reference Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 1970, 48: [Needleman, J Mol Biol 1970] Smith TF, Waterman MS : Identification of common molecular subsequences. J. Mol. Biol. 1981, 147: [Smith, J Mol Biol 1981] Murata M, Richardson JS, Sussman JL: Simultaneous comparison of three protein sequences. Proc Natl Acad Sci U S A. 1985, 82: [Murata, PNAS 1985] Gotoh O: Alignment of three biological sequences with an efficient traceback procedure, J Theor Biol 1986, [Gotoh, J Theor Biol 1986] Huang X: Alignment of three sequences in quadratic space. Applied Computing Review 1993, 1:7-11. [Huang, ACM 1993] Makoto H, Maski H, Masato I, Tomoyuki T: MASCOT: multiple alignment system for protein sequences based on three-way dynamic programming, J Mol Biol 1993, 2: [Makoto, Bioinformatics 1993] 2016/6/129SSLAB, Deportment of computer science, National Tsing Hua University

Reference (cont’) CY Lin, CT Huang, YC Chung, Chuan YT: Parallel Three-sequence Alignment with Space-efficient, Proceedings of the 23th Workshop on Combinatorial Mathematics and Computation Theory, Chang-Hua, Taiwan, April 2006, [CY Lin, CMCT 2006] CY Lin, CT Huang, YC Chung, Chuan YT: Efficient Parallel Algorithm for Optimal Three-Sequences Alignment. International Conference on Parallel Processing [CY Lin, ICPP 2007] Loytynoja A, Goldman N: An algorithm for progressive multiple alignment of sequences with insertions. Proc Natl Acad Sci U S A. 2005, 102(30): [Loytynoja, PNAS 2005] Matthias K, Peter FS: Progressive multiple sequence alignments from triplets. BMC Bioinformatics [matthias, BMC Bioinformatics July, 2007] Thompson JD: Introducing variable gap penalties to sequence alignment in linear space. Bioinformatics 1995, 11: [Thompson, Bioinformatics 1995] 2016/6/130SSLAB, Deportment of computer science, National Tsing Hua University

Che-Lun Hung: Chun-Yuan Lin: Yeh-Ching Chung: Chuan Yi Tang: Thank you for your attention 2016/6/131SSLAB, Deportment of computer science, National Tsing Hua University