A fast parallel algorithm for finding the longest common sequence of multiple biosequences. Yixin Chen, Andrew Wan, Wei Liu Washington University / Yangzhou.

Slides:



Advertisements
Similar presentations
Introduction to Algorithms 6.046J/18.401J/SMA5503
Advertisements

Dynamic Programming.
Parallel BioInformatics Sathish Vadhiyar. Parallel Bioinformatics  Many large scale applications in bioinformatics – sequence search, alignment, construction.
Longest Common Subsequence
Lecture 8: Dynamic Programming Shang-Hua Teng. Longest Common Subsequence Biologists need to measure how similar strands of DNA are to determine how closely.
CPSC 335 Dynamic Programming Dr. Marina Gavrilova Computer Science University of Calgary Canada.
Overview What is Dynamic Programming? A Sequence of 4 Steps
Algorithms Dynamic programming Longest Common Subsequence.
31 May, NTU A Fast Multiple Longest Common Subsequence (MLCS) Algorithm 組員: 黃安婷 江蘇峰 李鴻欣 劉士弘 施羽芩 周緯志 林耿生 張世杰 潘彥謙 Qingguo Wang, Dmitry Korkin, and.
Chapter 7 Dynamic Programming.
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
Sabegh Singh Virdi ASC Processor Group Computer Science Department
Computability and Complexity 23-1 Computability and Complexity Andrei Bulatov Search and Optimization.
Advanced Topics in Algorithms and Data Structures Page 1 Parallel merging through partitioning The partitioning strategy consists of: Breaking up the given.
Data Structures Lecture 10 Fang Yu Department of Management Information Systems National Chengchi University Fall 2010.
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
10/29/02CSE Dynamic Programming CSE Algorithms Dynamic Programming.
Sequence Alignment Algorithms in Computational Biology Spring 2006 Edited by Itai Sharon Most slides have been created and edited by Nir Friedman, Dan.
Sparse Normalized Local Alignment Nadav Efraty Gad M. Landau.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
4 -1 Chapter 4 The Sequence Alignment Problem The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Inexact Matching General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic programming.
Dynamic Programming Code
Parallel Merging Advanced Algorithms & Data Structures Lecture Theme 15 Prof. Dr. Th. Ottmann Summer Semester 2006.
Sequence Alignment.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Dynamic Programming Optimization Problems Dynamic Programming Paradigm
Sparse Normalized Local Alignment Nadav Efraty Gad M. Landau.
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
7 -1 Chapter 7 Dynamic Programming Fibonacci Sequence Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, … F i = i if i  1 F i = F i-1 + F i-2 if.
Sequence Alignment II CIS 667 Spring Optimal Alignments So we know how to compute the similarity between two sequences  How do we construct an.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
© 2004 Goodrich, Tamassia Dynamic Programming1. © 2004 Goodrich, Tamassia Dynamic Programming2 Matrix Chain-Products (not in book) Dynamic Programming.
1 A Linear Space Algorithm for Computing Maximal Common Subsequences Author: D.S. Hirschberg Publisher: Communications of the ACM 1975 Presenter: Han-Chen.
Lecture 7 Topics Dynamic Programming
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Longest Common Subsequence
UMass Lowell Computer Science Analysis of Algorithms Prof. Karen Daniels Fall, 2005 Design Patterns for Optimization Problems Dynamic Programming.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
LCS and Extensions to Global and Local Alignment Dr. Nancy Warter-Perez June 26, 2003.
Case Study. DNA Deoxyribonucleic acid (DNA) is a nucleic acid that contains the genetic instructions used in the development and functioning of all known.
Developing Pairwise Sequence Alignment Algorithms
Sequence Analysis Determining how similar 2 (or more) gene/protein sequences are (too each other) is a “staple” function in bioinformatics. This information.
HOW TO SOLVE IT? Algorithms. An Algorithm An algorithm is any well-defined (computational) procedure that takes some value, or set of values, as input.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Chapter 3: The Fundamentals: Algorithms, the Integers, and Matrices
Dynamic Programming UNC Chapel Hill Z. Guo.
Télécom 2A – Algo Complexity (1) Time Complexity and the divide and conquer strategy Or : how to measure algorithm run-time And : design efficient algorithms.
Chapter 3 Computational Molecular Biology Michael Smith
1 CPSC 320: Intermediate Algorithm Design and Analysis July 28, 2014.
PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R 林語君.
Complexity 20-1 Complexity Andrei Bulatov Parallel Arithmetic.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
1 Longest Common Subsequence as Private Search Payman Mohassel and Mark Gondree U of CalgaryNPS.
Introduction to Algorithms Jiafen Liu Sept
Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.
CSC 213 Lecture 19: Dynamic Programming and LCS. Subsequences (§ ) A subsequence of a string x 0 x 1 x 2 …x n-1 is a string of the form x i 1 x.
Dipankar Ranjan Baisya, Mir Md. Faysal & M. Sohel Rahman CSE, BUET Dhaka 1000 Degenerate String Reconstruction from Cover Arrays (Extended Abstract) 1.
TU/e Algorithms (2IL15) – Lecture 4 1 DYNAMIC PROGRAMMING II
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Example 2 You are traveling by a canoe down a river and there are n trading posts along the way. Before starting your journey, you are given for each 1
Dr Nazir A. Zafar Advanced Algorithms Analysis and Design Advanced Algorithms Analysis and Design By Dr. Nazir Ahmad Zafar.
Fast Sequence Alignments
Lecture 8. Paradigm #6 Dynamic Programming
Introduction to Algorithms: Dynamic Programming
Longest Common Subsequence
Longest Common Subsequence
Time Complexity and the divide and conquer strategy
Presentation transcript:

A fast parallel algorithm for finding the longest common sequence of multiple biosequences. Yixin Chen, Andrew Wan, Wei Liu Washington University / Yangzhou University December 2006

Biological Review  DNA can be represented as sequence of four letters (ACGT).  New biosequences every day.  Sequence comparison.  Find similarities with other sequences already known.  Longest Common Subsequence (LCS).

Longest Common Subsequence LCS  Search a substring that is common and longest to two or more given strings.  Searching for the LCS of multiple biosequences is a fundamental task in Bioinformatics.  LCS is a special case of global sequence alignment.  All algorithms of global sequence alignment can be used to solve LCS.

Longest Common Subsequence LCS Algorithms and complexity  Dynamic programming.  Smith-Waterman algorithm  Complexity O(mn).  Mayers and Miller  Complexity O(m+n).  Parallel algorithms.  CREW-PRAM

Longest Common Subsequence LCS Algorithms and complexity in Multiple sequences  What is the problem with those Algorithms?  Smith-Waterman: complexity now is exponential.  Carrillo and Lipman algorithm.  New divide and conquer algorithm DCA.(Stoye)  Clustal-W (Feng and Doolittle’s algorithm).

FAST_LCS  Seeks the successors of initial identical character pairs. Tables  Pruning operations  FAST_LCS can be extended to find LCS of multiple biosequences.  They it implemented using Parallel computing model as well.

Initial identical character pair and its successor table  Let X and Y be two biosequences, where X i and Y i  to { A, C, G, T }  Define an Array CH of four CHaracters: CH(1) = A CH(2) = C CH(3) = G CH(4) = T

Initial identical character pair and its successor table  Build the successor table of identical characters for two strings, Lets call the tables TX for sequence X an TY for sequence Y. The table(s) are defined as follows: TX(i,j) = min { k | k  SX (i,j) } SX(i,j)   Otherwise  Where SX (i,j) = {k | X k = CH(i), k > j } and i = 1,2,3,4 and j = 0,1,…n For example if i = 1 means A

Example Let X = “TGCATA” and Y = “ATCTGAT” If X i = Y i = CH(k) then call them an identical pair CH(k) and denote as (i,j). Let (i,j) and (k,l) be two identical characters pairs of X Then: If i < k and j < l they call (i, j) a “predecessor” of (k,l). or (k,l) the a “successor” of (i,j)

Successor Table for example Remember the sequence X = “TGCATA” SX(1,0) = k | X k = CH(A), k > 0 then k = {4,6} TX(1,0) = min { 4,6} => TX(1,0) = 4. SX(1,4) = k | X k = CH(A), k > 4 then k = {6} TX(1,4) = 6

Example The final Tables are these: SX(1,0) = 4 SX(1,5) = 6 SX(1,1) = 4 SX(1,6) = - SX(1,2) = 4 SX(1,3) = 4 SX(1,4) = 6 TGCATA ”

Some definitions  If X i =Y i =CH(k) they call them “Identical pair of CH(k)” and it denoted as (i,j).  The set of all “Identical pairs” of X is denoted as S(X,Y).  If an identical pair (i,j)  S(X,Y) and there is not (k,l)  S(X,Y) such that (k,l) < (i,j) then they call (i,j) “initial identical pair”

Some definitions The level of each pairs is defined as follows:

Theorems T1: If the length of LCS of X and Y is denote as |LCS(X,Y)| then |LCS(X,Y)| = max {level(i,j) | (i,j)  S(X,Y)} T2: For an identical character pair (i,j)  S(X,Y) the operations of producing all its direct successors is as follows:

Back to the Example The successors of the pair (2,5) are: (4,6), (3,-), (-,-), (5,7)

Example The successors of the pair (2,5) are: (4,6), (5,7)

Pruning operations Operation 1  If on the same level, there are two identical character pairs (i,j) and (k,l) and (k,l) > (i,j) then (k,l) can be pruned without to affecting the correctness of the algorithm  This operation will remove all the redundant identical pairs

Back to the Example The successors of the pair (2,5) were (4,6), (5,7) since they are on the same level and (4,6) > (5,7) The successors of the pair (2,5) is: (4,6)

Pruning operations Operation 2  If on the same level, there are two identical character pairs (i 1,j) and (i 2,j) and i 1 < i 2 then (i 2,j) can be pruned without to affecting the correctness of the algorithm

Pruning operations Operation 3  If on the same level, there are two identical character pairs (i 1,j), (i 2,j), …,(i r,j) and i 1 < i 2 … < i r then : (i 2,j) … (i r,j) can be pruned

FAST_LCS complexity  They claim that the complexity of their algorithm is O(L) where L is the number of the identical character pairs of X,Y.  When they algorithm is implement using Parallel computing the complexity is O(|LCS(X,Y)|)

FAST_LCS and multiple sequences  FAST_LCS can be easily extended to the LCS problem of multiple sequences.  From the biological point of views is more important to find LCS for multiple sequence.

FAST_LCS and multiple sequences  Suppose there are n sequence X 1, X 2, …, Xn where X= (X i1, X i2, …, X i,ni ), ni is the length of X i, X ij  to { A, C, G, T } and where j = 1,2,3…,n i.  The successors tables will be denoted as: TX 1, TX 2,….,TX n Where TX s is a two dimesional array for the sequence X s =(X s1, X s2,… X ns ), s = 1,2,…,n

FAST_LCS and multiple sequences Following the same procedure that it is used to find two sequences, in this case start building the successors tables of all the sequences, following:

FAST_LCS and multiple sequences  Identical character tuple for LCS of multiple sequences.  The level of each tuple comes from the following:

FAST_LCS and multiple sequences  For an Identical character tuple this operation follows:  They claim two more Theorems which are basically the extension of T1 and T2 for multiple sequence

Example  Let n = 3, X 1 = “TGCATA”, X 2 =“ATCTGAT” and X 3 =“CTGATTC”  The successors tables are:

Example The direct successors of the identical character triple (1,2,2) can be obtained by :

Example  The successors are : (4,6,4), (3,3,7), (2,5,3), (1,2,2).  Then following the algorithm, apply the pruning operations we get (3,3,7), (2,5,3), (1,2,2), etc.

FAST_LCS for multiple sequence. complexity  The time complexity of most algorithms for multiple sequence LCS depends on the number of sequence.  They are not practicable when the number of sequence is large.

FAST_LCS for multiple sequence. complexity  When FAST_LCS algorithm is implement using Parallel computing the complexity is O(|LCS(X 1, X 2,… X n )|).  It complexity is “Independent of the number of sequences n”

Sequential computation on two sequences

Sequential computation on multiple sequences

Sequential computation on using parallel computing

Conclusion  The precision of FAST_LCS is higher than FASTA and faster than S-W algo for computation on two sequences  FAST_LCS is faster than other algorithms that compute LCS for multiple sequences, like CLUSTAL-W.  FAST_LCS could be implemented using Parallel computation.