Sparse Normalized Local Alignment Nadav Efraty Gad M. Landau.

Slides:



Advertisements
Similar presentations
Introduction to Algorithms 6.046J/18.401J/SMA5503
Advertisements

Indexing DNA Sequences Using q-Grams
Lecture 24 MAS 714 Hartmut Klauck
Improved Algorithms for Inferring the Minimum Mosaic of a Set of Recombinants Yufeng Wu and Dan Gusfield UC Davis CPM 2007.
Longest Common Subsequence
Learning Trajectory Patterns by Clustering: Comparative Evaluation Group D.
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Fast Algorithms For Hierarchical Range Histogram Constructions
CYK Parser Von Carla und Cornelia Kempa. Overview Top-downBottom-up Non-directional methods Unger ParserCYK Parser.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Chapter 7 Dynamic Programming.
Dynamic Programming Dynamic Programming is a general algorithm design technique for solving problems defined by recurrences with overlapping subproblems.
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.
Comp 122, Fall 2004 Dynamic Programming. dynprog - 2 Lin / Devi Comp 122, Spring 2004 Longest Common Subsequence  Problem: Given 2 sequences, X =  x.
Dynamic Programming Lets begin by looking at the Fibonacci sequence.
Dynamic Programming Reading Material: Chapter 7..
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Sparse Normalized Local Alignment Nadav Efraty Gad M. Landau.
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
7 -1 Chapter 7 Dynamic Programming Fibonacci Sequence Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, … F i = i if i  1 F i = F i-1 + F i-2 if.
UNC Chapel Hill Lin/Manocha/Foskey Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject.
Sequence Alignment II CIS 667 Spring Optimal Alignments So we know how to compute the similarity between two sequences  How do we construct an.
Dynamic Programming Reading Material: Chapter 7 Sections and 6.
© 2004 Goodrich, Tamassia Dynamic Programming1. © 2004 Goodrich, Tamassia Dynamic Programming2 Matrix Chain-Products (not in book) Dynamic Programming.
Dynamic Programming A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 8 ©2012 Pearson Education, Inc. Upper Saddle River,
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Protein Sequence Comparison Patrice Koehl
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Class 2: Basic Sequence Alignment
1 Exact Matching Charles Yan Na ï ve Method Input: P: pattern; T: Text Output: Occurrences of P in T Algorithm Naive Align P with the left end.
1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
Dynamic Programming Introduction to Algorithms Dynamic Programming CSE 680 Prof. Roger Crawfis.
Sequence Alignment.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 8 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
Dynamic Programming UNC Chapel Hill Z. Guo.
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
Dynamic Programming. Well known algorithm design techniques:. –Divide-and-conquer algorithms Another strategy for designing algorithms is dynamic programming.
RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?
Extending Alignments Υλικό βασισμένο στο κεφάλαιο 13 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University Press.
Dynamic Programming.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.
CSC401: Analysis of Algorithms CSC401 – Analysis of Algorithms Chapter Dynamic Programming Objectives: Present the Dynamic Programming paradigm.
Biological Sequence Comparison and Alignment Speaker: Yu-Hsiang Wang Advisor: Prof. Jian-Jung Ding Digital Image and Signal Processing Lab Graduate Institute.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Introduction to Algorithms Jiafen Liu Sept
Chapter 8 Maximum Flows: Additional Topics All-Pairs Minimum Value Cut Problem  Given an undirected network G, find minimum value cut for all.
Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject to some constraints. (There may.
Part 2 # 68 Longest Common Subsequence T.H. Cormen et al., Introduction to Algorithms, MIT press, 3/e, 2009, pp Example: X=abadcda, Y=acbacadb.
Searching a Linear Subspace Lecture VI. Deriving Subspaces There are several ways to derive the nullspace matrix (or kernel matrix). ◦ The methodology.
9/27/10 A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne Adam Smith Algorithm Design and Analysis L ECTURE 16 Dynamic.
Core String Edits, Alignments, and Dynamic Programming.
Proof of correctness of Dijkstra’s algorithm: Basically, we need to prove two claims. (1)Let S be the set of vertices for which the shortest path from.
Dr Nazir A. Zafar Advanced Algorithms Analysis and Design Advanced Algorithms Analysis and Design By Dr. Nazir Ahmad Zafar.
NAME THAT ALGORITHM #2 HERE ARE SOME PROBLEMS. SOLVE THEM. GL HF.
Advanced Algorithms Analysis and Design
SPIRE Normalized Similarity of RNA Sequences
SPIRE Normalized Similarity of RNA Sequences
Introduction to Algorithms: Dynamic Programming
SPIRE Normalized Similarity of RNA Sequences
Presentation transcript:

Sparse Normalized Local Alignment Nadav Efraty Gad M. Landau

Goal To find the substrings, X’ and Y’, whose normalized alignment value LCS(X’,Y’)/(|X’|+|Y’|) is the highest, Or higher than a predefined similarity level.

Introduction The O(rLloglogn) normalized local LCS algorithm The O(rMloglogn) normalized local LCS algorithm Conclusions and open problems

Introduction

LCS-. computing a dynamic programming table of size (n+1)x(m+1) : T(i,0)=T(0,j)=0 for all i,j (1 ≤ i ≤ m ; 1 ≤ j ≤ n) if X j =Y i then T(i,j)=T(i-1,j-1)+1, else, T(i,j)=max{T(i-1,j), T(i,j-1)} Background - Global similarity

CABADEB C B A D A C B D X j =Y i T(i,j)=T(i-1,j-1)+1 Xj≠Yi T(i,j)=max{T(i,j-1),T(i-1,j)} The naive LCS algorithm Background - Global similarity

CABADEB C B A D A C B D The typical staircase shape of the layers in the matrix

Edit distance measures the minimal number of operations that are required to transform one string into another one. operations: substitution Deletion insertion. Background - Global similarity

The Smith Waterman algorithm (1981) T(i,0)=T(0,j)=0, for all i,j (1 ≤ i ≤ m ; 1 ≤ j ≤ n) T(i,j)=max{T(i-1,j-1)+ S(Y i,X j ), T(i-1,j)+ D(Y i ), T(i,j-1)+ I(X j ), 0 } Background - Local similarity

Mosaic effect - Lack of ability to discard poorly conserved intermediate segments. Shadow effect – Short, but more biologically important alignments may not be detected because they are overlapped by longer (and less important) alignments. 70/ /100 The sparsity of the essential data is not exploited. CABADEB C B A D A C B D The weaknesses of the Smith Waterman algorithm:

The solution: Normalization The statistical significance of the local alignment depends on both its score and length. Instead of searching for an alignment that maximizes the score S(X,Y), search for the alignment that maximizes S(X,Y)/(|X|+|Y|).

Arslan, Egecioglu, Pevzner (2001) uses a mathematical technique that allows convergence to the optimal alignment value through iterations of the Smith Waterman algorithm. SCORE(X’,Y’)/(|X’|+|Y’|+L), where L is a constant that controls the amount of normalization. O(n 2 logn).

The degree of similarity is defined as LCS(X’,Y’)/(|X’|+|Y’|). M - a minimal length constraint. Similarity level. Our approach

The O(rLloglogn) normalized local LCS algorithm

Definitions A chain is a sequence of matches that is strictly increasing in both components. The length of a chain from match (i,j) to match (i’,j’) is i’-i+j’-j, that is, the length of the substrings which create the chain. 0 0 n m Y X 0 0 J’n i i’ m J Y X (i,j) (i’,j’) A k-chain (i,j) is the shortest chain of k matches starting from (i,j). The normalized value of k-chain (i,j) is k divided by its length.

The algorithm For each match (a,b), construct k-chain (a,b) for 1≤k≤L (L=LCS(X,Y)). Examine all the k-chains with k≥M, starting from each match, and report either: –The k-chains with the highest normalized value. –k-chains whose normalized value exceed a predefined threshold.

Problem: k-chain (a,b) is not the prefix of (k+1)-chain (a,b).

a b c a d e c f h c g b f h e c g Solution: (k+1)-chain(a,b) : (a,b) is concatenated to a k-chain(i’,j’) below and to the right of (a,b).

Question: How can we find the proper match (i’,j’) which is the head of the k- chain that should be concatenated to (a,b) in order to construct (k+1)-chain(a,b).

Definitions: Range- The range of a match (i,j) is (0…i- 1,0…j-1). Mutual range- An area of the table which is overlapped by at least two ranges of distinct matches. Owner- (i’,j’) is the owner of the range where k-chain (i’,j’) is the suffix of (k+1)-chain (a,b) for any match (a,b) in the range. L separated lists of ranges and their owners are maintained by the algorithm.

If (a,b) is in the range of a single match (i’,j’) (it is not in a mutual range), k-chain (i’,j’) would be the suffix of (k+1)-chain (a,b). If (x,y) is in the mutual range of two matches, how can we determined which of them should be concatenated to (a,b)? Lemma: A mutual range of two matches is owned completely by one of them. (i,j) (i’,j) 0 0 j’n i i’ m j (i’,j’) (i,j’) Y X Case 1 (i,j) (i’,j’) 0 0 n i i’ m Y X j’ j Case 2

Lemma: A mutual range of two matches, p ((i,j)) and q ((i’,j’)), is owned completely by one of them. Proof: There are two distinct cases: Case 1: i≤i’ and j≤j‘; (i,j) (i’,j) 0 0 J’n i i’ m J (i’,j’) (i,j’) Y X

(i,j) 0 0 n i i’ m Y X J’ J (i’,j’) LpLpLpLp LqLqLqLq Case 2: i j‘; The mutual range of p and q is (0...i-1,0...j'-1). Entry (i-1,j'-1) is the mutual point (MP) of p and q. p will be the owner of the mutual range if Lp+(j-j') ≤ Lq+(i'-i)

Preprocessing. Process the matches row by row, from bottom up. For the matches of row i: –Stage 1: Construct k-chains 1≤k≤L for all the matches in the row i, using the L lists of ranges and owners. –Stage 2: Update the lists of ranges and owners with the matches of row i and their k- chains. Examine the k-chains of all matches and report the ones with the highest normalized value. The algorithm

Stage 2 Let LRO k be the list of ranges and owners that are the heads of k-chains. Insert each match (i,j) of row i which is the head of a k-chain to LRO k. If there is already another match with column coordinate j, extract it from LRO k. Row i+1 Row 0 Row i Row 0

Stage 2 – cont’ While for (i',j'), which is the left neighbor of (i,j) in LRO k (length of k-chain (i’,j’) +i'-i) ≥ (length of k-chain (i,j) +j-j'), (i',j') should be extracted from LRO k. Row i Row 0

Stage 1 Constructing (k+1)-chain (i,j): concatenating (i,j) to the match in LRO k which is the owner of the range of (i,j). Record the value of (k+1)-chain (i,j) with the match (i,j). Row i Row 0

Reporting the best alignments The best alignment is either the alignment with the highest normalized value or the alignments whose similarity exceed a predefined value. Check all the k-chains, k≥M, starting from each match and report the best alignments.

Complexity analysis Preprocessing- O(nlogΣ Y ) Stage 1- –For each of the r matches we construct at most L k-chains. –Using a Johnson Tree stage 1 is computed in O(rLloglogn) time. Stage 2- Each of the r matches is inserted and extracted at most once to each of the LRO k s. Total, O(rLloglogn) time.

Complexity analysis Reporting the best alignments is done in O(rL) time. Total time complexity of this algorithm is O(nlogΣ Y + rLloglogn). Space complexity is O(rL+nL).

The O(rMloglogn) normalized local LCS algorithm

Reoprts: The normalized alignment value of the best possible local alignment. (value and substrings).

Computing the highest normalized value Definition: A sub-chain of a k-Chain is a path that contains a sequence of x ≤ k consecutive matches of the k-Chain. Claim: When a k-chain is split into a number of non overlapping consecutive sub-chains, the normalized value of a k-chain is smaller or equal than that of its best sub-chain. Result: The normalized value of any k-chain (k≥M) is smaller or equal than the value of its best sub-chain with M to 2M-1 matches.

Computing the highest normalized value A sub-chains of less than M matches may not be reported. Sub-chains of 2M matches or more, can be split into shorter sub-chains of M to 2M-1 matches. Is it sufficient to construct all the sub-chains of exactly M matches? No - Sub-chains of M+1 to 2M-1 matches can not be split to sub- chains of M matches.

Computing the highest normalized value The algorithm: For each match construct all the k-chains, for k≤2M-1. The algorithm constructs all these chains, that are, in fact, the sub-chains of all the longer k-chains. A longer chain can not be better than its best sub-chain. This algorithm is able to report the highest normalized value of a sub-chain (of at least M matches) which is equal to the highest normalized value of a chain of at least M matches.

Constructing the longest optimal alignment Definition: A perfect alignment is an alignment of two identical strings. Its normalized value is ½ Unless the optimal alignment is perfect, the longest optimal alignment has no more than 2M-1 matches.

Constructing the longest optimal alignment Assume there is a chain with more than 2M-1matches whose normalized value is the optimal, denoted by LB. LB may be split to a number of sub-chains of M matches, followed by a single sub-chain of between M and 2M-1 matches. The normalized value of each such sub-chain must be equal to that of LB, otherwise, LB is not optimal. Each such sub-chain must start and end at a match, otherwise, the normalized value of the chain comprised of the same matches will be higher than that of LB. 10/30 0/2 0/3 = 10/35 < 10/30

when the head of the second is not next to the tail of the first, the concatenated chain is not optimal. Constructing the longest optimal alignment 10/30 0/2 20/62 Note that if we concatenate two optimal sub-chains where the head of the second is next to the tail of the first the concatenated chain is optimal. 10/30 20/60 The tails and heads of the sub-chains from which LB is comprised must be next to each other.

But, what happens if we examine the following sub- chain: Constructing the longest optimal alignment If the tails and heads of the optimal sub-chains from which LB is comprised are next to each other then their concatenation (i.e. LB) is optimal. Lets examine the first two sub-chains: M/L 2M/2L M/L It’s number of matches is M+1 and its length is L+2. Since M/L M/L. Thus, we found a chain of M+1 matches whose normalized value is higher than that of LB, in contradiction to the optimality of LB.

Closing remarks

The advantages of the new algorithm The first algorithm to combine the “normalized local” and the “sparse”. Ideal for textual local comparison (where the sparsity is typically dramatic) as well as for screening bio sequences. As a normalized alignment algorithm, it does not suffer form the weaknesses from which non normalized algorithms suffer. A straight forward approach to the minimal constraint which is easy to control and understand, and in the same time, does not require reformulation of the original problem. the minimal constraint is problem related rather than input related.