1 New tabulation and dynamic programming based techniques for sequence similarity problems Szymon Grabowski Sept. 2014 Lodz University of Technology, Institute.

Slides:



Advertisements
Similar presentations
Xiaoming Sun Tsinghua University David Woodruff MIT
Advertisements

Noise, Information Theory, and Entropy (cont.) CS414 – Spring 2007 By Karrie Karahalios, Roger Cheng, Brian Bailey.
Longest Common Subsequence
Dynamic Programming Nithya Tarek. Dynamic Programming Dynamic programming solves problems by combining the solutions to sub problems. Paradigms: Divide.
QuickSort Average Case Analysis An Incompressibility Approach Brendan Lucier August 2, 2005.
Arithmetic Coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a How we can do better than Huffman? - I As we have seen, the.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Overview What is Dynamic Programming? A Sequence of 4 Steps
Fusion Trees Advanced Data Structures Aris Tentes.
Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.
Sorting Chapter Sorting Consider list x 1, x 2, x 3, … x n We seek to arrange the elements of the list in order –Ascending or descending Some O(n.
Complexity 7-1 Complexity Andrei Bulatov Complexity of Problems.
1 Reduction between Transitive Closure & Boolean Matrix Multiplication Presented by Rotem Mairon.
Advanced Topics in Algorithms and Data Structures Page 1 Parallel merging through partitioning The partitioning strategy consists of: Breaking up the given.
CPSC 311, Fall 2009: Dynamic Programming 1 CPSC 311 Analysis of Algorithms Dynamic Programming Prof. Jennifer Welch Fall 2009.
Complexity 5-1 Complexity Andrei Bulatov Complexity of Problems.
Space Efficient Alignment Algorithms and Affine Gap Penalties
Space Efficient Alignment Algorithms Dr. Nancy Warter-Perez June 24, 2005.
Introduction to Bioinformatics Algorithms Block Alignment and the Four-Russians Speedup Presenter: Yung-Hsing Peng Date:
Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
CPSC 411 Design and Analysis of Algorithms Set 5: Dynamic Programming Prof. Jennifer Welch Spring 2011 CPSC 411, Spring 2011: Set 5 1.
25/06/2015Marius Mikucionis, AAU SSE1/22 Principles and Methods of Testing Finite State Machines – A Survey David Lee, Senior Member, IEEE and Mihalis.
CS 104 Introduction to Computer Science and Graphics Problems Data Structure & Algorithms (3) Recurrence Relation 11/11 ~ 11/14/2008 Yang Song.
By Makinen, Navarro and Ukkonen. Abstract Let A and B be two run-length encoded strings of encoded lengths m’ and n’, respectively. we will show an O(m’n+n’m)
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
1 A Linear Space Algorithm for Computing Maximal Common Subsequences Author: D.S. Hirschberg Publisher: Communications of the ACM 1975 Presenter: Han-Chen.
P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism Efficient Longest Common Subsequence Computation using Bulk-Synchronous.
Space Efficient Alignment Algorithms Dr. Nancy Warter-Perez.
1 CSE 417: Algorithms and Computational Complexity Winter 2001 Lecture 5 Instructor: Paul Beame TA: Gidon Shavit.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
Dynamic Programming Introduction to Algorithms Dynamic Programming CSE 680 Prof. Roger Crawfis.
Case Study. DNA Deoxyribonucleic acid (DNA) is a nucleic acid that contains the genetic instructions used in the development and functioning of all known.
Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda Kyushu University, Japan SPIRE Cartagena, Colombia.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Chapter 3: The Fundamentals: Algorithms, the Integers, and Matrices
Challenges Bit-vector approach Conclusion & Future Work A subsequence of a string of symbols is derived from the original string by deleting some elements.
Lecture 12: Parallel Sorting Shantanu Dutt ECE Dept. UIC.
Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University.
Chapter 14 Randomized algorithms Introduction Las Vegas and Monte Carlo algorithms Randomized Quicksort Randomized selection Testing String Equality Pattern.
ADA: 7. Dynamic Prog.1 Objective o introduce DP, its two hallmarks, and two major programming techniques o look at two examples: the fibonacci.
1 Longest Common Subsequence Problem and Its Approximation Algorithms Kuo-Si Huang ( 黃國璽 )
Dynamic Programming.
Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.
Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp
Design of Algorithms by Induction Part 1 Algorithm Design and Analysis Week 3 Bibliography: [Manber]- Chap.
Introduction to Modern Symmetric-key Ciphers
Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Space Efficient Alignment Algorithms and Affine Gap Penalties Dr. Nancy Warter-Perez.
Fall 2008Simple Parallel Algorithms1. Fall 2008Simple Parallel Algorithms2 Scalar Product of Two Vectors Let a = (a 1, a 2, …, a n ); b = (b 1, b 2, …,
Part 2 # 68 Longest Common Subsequence T.H. Cormen et al., Introduction to Algorithms, MIT press, 3/e, 2009, pp Example: X=abadcda, Y=acbacadb.
A Sub-quadratic Sequence Alignment Algorithm. Global alignment Alignment graph for S = aacgacga, T = ctacgaga Complexity: O(n 2 ) V(i,j) = max { V(i-1,j-1)
Compressing Bi-Level Images by Block Matching on a Tree Architecture Sergio De Agostino Computer Science Department Sapienza University of Rome ITALY.
Unit-8 Sorting Algorithms Prepared By:-H.M.PATEL.
TU/e Algorithms (2IL15) – Lecture 4 1 DYNAMIC PROGRAMMING II
CSE 554 Lecture 5: Contouring (faster)
Modeling with Recurrence Relations
Approximate Matching of Run-Length Compressed Strings
13 Text Processing Hongfei Yan June 1, 2016.
Lecture 18: Uniformity Testing Monotonicity Testing
CSCE 411 Design and Analysis of Algorithms
Advanced Algorithms Analysis and Design
Approximate Matching of Run-Length Compressed Strings
Dynamic Programming-- Longest Common Subsequence
Space-Saving Strategies for Computing Δ-points
Huffman Coding Greedy Algorithm
Presentation transcript:

1 New tabulation and dynamic programming based techniques for sequence similarity problems Szymon Grabowski Sept Lodz University of Technology, Institute of Applied Computer Science, Łódź, Poland

2 Agenda 1.(Na ï ve) dynamic programming. 2.Four Russians. 3.Main LCS results. 4.Bille & Farach-Colton technique. 5.Our improvement of the BFC alg. 6.Our LCS result with sparse DP. 7.Algorithmic apps (Lev distance, LCTS, MerLCS). 8.Concl & open problems.

33 Dynamic Programming (DP) Everybody knows… Quadratic cost for 2 sequences (can’t compute a cell "in a middle" before knowing the previous rows/cols), Speedup ideas: tabulation (aka Four Russians), bit-parallelism, sparse dynamic programming, compressing the input sequences.

4 DP made (slightly) faster If we can process blocks of b  b symbols in O(1) time, we immediately obtain O(mn / b 2 ) time. We can do it (Masek & Paterson, 1980) e.g. for binary alphabet and b = log n / 4  O(mn / log 2 n) time. The idea is to precompute all possible inputs (short enough strings are guaranteed to repeat and represent the DP values in differential manner).

55 LCS, selected results (time compl.) Standard DP: O(mn). Tabulation (Masek & Paterson, 1980): O(mn / log 2 n) for a constant alphabet. Tabulation (Bille & Farach-Colton, 2008): O(mn (log log n) 2 / log 2 n) for an integer alphabet. Bit-parallelism (Allison & Dix, 1986, …): O(mn / w), w  log n is machine word size (in bits). Sparse DP: Hunt & Szymanski, 1977: O(r log log n), r is the # of matches, Eppstein, Galil, Giancarlo & Italiano, 1992: O(D log log(min{D, mn / D})), D  r is the # of dominant matches.

6 LCS, selected results, cont’d Sparse DP: Sakai, 2012: O(m  + min{D , p(m-q)} + n), where p = LCS(A, B), q = LCS(A[1…m], B). LZ78-compressed input: Crochemore, Landau & Ziv-Ukelson, 2003: O(hmn / log n), for a constant alphabet, where h  1 is the entropy of the inputs (for a binary alph.). RLE-compressed input: several results, incl. Liu, Wang & Lee, 2008: O(min{nl, km}), where l, m are RLE-compressed seq lengths. SLP-compressed input: Gawrychowski, 2012: O(kn sqrt(log(n / k)), where k is total length of SLP-compressed sequences.

7 The technique of Bille & Farach-Colton For an integer alphabet of size , the Masek & Paterson result can easily be modified to have O(mn log 2  / log 2 n) time. This is fine for small , but not if  = n c, c > 0. Bille & Farach-Colton use alphabet mapping in superblocks. Use superblocks of size e.g. log 3 n  log 3 n and divide each superblock into blocks of size  (log n / log log n)   (log n / log log n).

8 BFC, cont’d That is, for current text snippet from A of length log 3 n extract up to log 3 n distinct symbols and encode the current snippet of A and current snippet of B accordingly (one extra symbol for "smth else" in snippet B needed). Easily, O(log log n) bits per encoded symbol are enough, mapping times overall negligible (a BST can be used with log(superblock)-factor per symbol) and O(mn (log log n) 2 / log 2 n) total time.

9 BFC, alphabet mapping example Blocks of size 3  3, superblocks of size 9  9.

10 Our technique (Alg 1) Use the BFC alphabet mapping in superblocks. But use many LUTs (instead of 1), yet with modified input. One LUT per horizontal stripe (of length n). The LUT input: snippet of A, left block border (1 bit per cell), upper block border (1 bit per cell). No snippet of B as part of the input, as it is fixed for a given LUT! (Re-use LUTs for repeating snippets of B.) Thanks to it, we work on rectangular (not square) "portrait"-oriented blocks of size  (log n / log log n)   (log n).

11 One horizontal stripe (4 blocks of 5  5) Red arrows: explicitly stored LCS values; black arrows: diff-encoded LCS values and 34023: text snippets encoded with ref to a superblock (not shown). The diagonally shaded cells are the block output cells. seq A seq B

12 LCS, first result (Alg 1) 12

13 Output-dependent algorithm We work in blocks of (b+1)  (b+1), but divide them into sparse ones, which have  K matches, and dense ones with > K matches. Key observation: knowing the top row and leftmost column for the block plus the location of all matches in it is enough to compute this block. That is, the text snippets are not needed!

14 Where sparse DP meets tabulation A sparse block input: top row: b bits (diff encoding), leftmost column: b bits (diff encoding), match locations: each in log(b 2 ) bits, totalling O(K log b) bits. (Output: even less.) Hence, if K log b + b = O(log n) (with a small enough constant), we can use a LUT for all sparse blocks and compute each of them in constant time.

15 Dense blocks Dense blocks are partitioned into smaller blocks which then will be processed by our technique from Alg 1. The smaller block sizes are:  (log n / log log n)   (b).

16 Choosing the parameters b = O(log n) (otherwise the LUT build costs will be dominating), but also b =  (log n / sqrt(log log n)) (otherwise this alg will never beat Alg 1). This implies K =  (log n / log log n), with an appropriate constant. If the fraction of dense blocks in the matrix is 0 < f d  1, then the total time complexity (w/o preprocessing!) is: For a small enough r (= total # of matches in the matrix) we may have O(mn / log 2 n) from the above formula, alas in the pp we have to find and encode all matches in all sparse blocks, in O(n + r) time.

17 LCS, second result (Alg 2)

18 Alg 2 niche Considering the results of: Eppstein et al., 1992, Sakai, 2012, Alg 1, we obtain the following niche in which Alg 2 is the winner: and

19 Simple generalization of Th. 1 and 2

20 Longest common transposition-invariant subsequence (LCTS) LCTS = LCS in the best key transposition (in music, transposition is shifting a sequence of notes (pitches) up or down by a constant interval).

21 LCTS, known results and a new one Navarro, Grabowski, Mäkinen, Deorowicz, 2005; Deorowicz, 2006  apply BFC technique for each transposition New algorithm: let us call the transpositions with at least mn log log n /  matches as dense, the others as sparse. Apply Alg 1 to the dense transpositions and Alg 2 to the sparse ones. Overall time: for

22 Merged LCS (MerLCS) A bioinformatics problem on 3 sequences: given sequences A, B and P, return a longest seq. T that is a subsequence of P and can be split into two subsequences T’ and T’’ such that T’ is a subsequence of A and T’’ is a subsequence of B. |A| = n, |B| = m, |P| = u. Known results: Peng, Yang, Huang, Tseng & Hor, 2010: O(lmn) time, where l  n is the result length. Deorowicz & Danek, 2013: O(  u / w  mn log w) time.

23 Our result for MerLCS DP matrix property: Deorowicz and Danek noticed that M(i, j, k) is equal to or larger by 1 than any of the three neighhbors: M(i – 1, j, k), M(i, j – 1, k), M(i, j, k – 1). We generalize our result on 2 sequences to 3 sequences (input: 3 text snippets plus 3 2-dim walls instead of 1-dim borders!) to obtain O(mnu / log 3/2 n) for MerLCS, if u =  (n c ) for some c > 0.

24 Conclusions 24 Tabulation (= Four Russians) is a classic DP-boosting technique. Interestingly, we managed to (slightly) improve its application to the LCS / edit distance problem. Applying tabulation may be even better for a sparse matrix. These techniques work also for a few other problems than LCS and edit distance.

25 Open problems Can we improve the tabulation based result on compressible sequences? Can we adopt our technique(s) to problems in which the conditions from Lemma 3 (or Lemma 7, involving 3 sequences) are relaxed, that is, consecutive DP cells may (sometimes) differ more than by a constant? Exemplary problem: SEQ-EC-LCS (Chen & Chao, 2011; Deorowicz & Grabowski, 2014).