1 New tabulation and dynamic programming based techniques for sequence similarity problems Szymon Grabowski Sept Lodz University of Technology, Institute of Applied Computer Science, Łódź, Poland
2 Agenda 1.(Na ï ve) dynamic programming. 2.Four Russians. 3.Main LCS results. 4.Bille & Farach-Colton technique. 5.Our improvement of the BFC alg. 6.Our LCS result with sparse DP. 7.Algorithmic apps (Lev distance, LCTS, MerLCS). 8.Concl & open problems.
33 Dynamic Programming (DP) Everybody knows… Quadratic cost for 2 sequences (can’t compute a cell "in a middle" before knowing the previous rows/cols), Speedup ideas: tabulation (aka Four Russians), bit-parallelism, sparse dynamic programming, compressing the input sequences.
4 DP made (slightly) faster If we can process blocks of b b symbols in O(1) time, we immediately obtain O(mn / b 2 ) time. We can do it (Masek & Paterson, 1980) e.g. for binary alphabet and b = log n / 4 O(mn / log 2 n) time. The idea is to precompute all possible inputs (short enough strings are guaranteed to repeat and represent the DP values in differential manner).
55 LCS, selected results (time compl.) Standard DP: O(mn). Tabulation (Masek & Paterson, 1980): O(mn / log 2 n) for a constant alphabet. Tabulation (Bille & Farach-Colton, 2008): O(mn (log log n) 2 / log 2 n) for an integer alphabet. Bit-parallelism (Allison & Dix, 1986, …): O(mn / w), w log n is machine word size (in bits). Sparse DP: Hunt & Szymanski, 1977: O(r log log n), r is the # of matches, Eppstein, Galil, Giancarlo & Italiano, 1992: O(D log log(min{D, mn / D})), D r is the # of dominant matches.
6 LCS, selected results, cont’d Sparse DP: Sakai, 2012: O(m + min{D , p(m-q)} + n), where p = LCS(A, B), q = LCS(A[1…m], B). LZ78-compressed input: Crochemore, Landau & Ziv-Ukelson, 2003: O(hmn / log n), for a constant alphabet, where h 1 is the entropy of the inputs (for a binary alph.). RLE-compressed input: several results, incl. Liu, Wang & Lee, 2008: O(min{nl, km}), where l, m are RLE-compressed seq lengths. SLP-compressed input: Gawrychowski, 2012: O(kn sqrt(log(n / k)), where k is total length of SLP-compressed sequences.
7 The technique of Bille & Farach-Colton For an integer alphabet of size , the Masek & Paterson result can easily be modified to have O(mn log 2 / log 2 n) time. This is fine for small , but not if = n c, c > 0. Bille & Farach-Colton use alphabet mapping in superblocks. Use superblocks of size e.g. log 3 n log 3 n and divide each superblock into blocks of size (log n / log log n) (log n / log log n).
8 BFC, cont’d That is, for current text snippet from A of length log 3 n extract up to log 3 n distinct symbols and encode the current snippet of A and current snippet of B accordingly (one extra symbol for "smth else" in snippet B needed). Easily, O(log log n) bits per encoded symbol are enough, mapping times overall negligible (a BST can be used with log(superblock)-factor per symbol) and O(mn (log log n) 2 / log 2 n) total time.
9 BFC, alphabet mapping example Blocks of size 3 3, superblocks of size 9 9.
10 Our technique (Alg 1) Use the BFC alphabet mapping in superblocks. But use many LUTs (instead of 1), yet with modified input. One LUT per horizontal stripe (of length n). The LUT input: snippet of A, left block border (1 bit per cell), upper block border (1 bit per cell). No snippet of B as part of the input, as it is fixed for a given LUT! (Re-use LUTs for repeating snippets of B.) Thanks to it, we work on rectangular (not square) "portrait"-oriented blocks of size (log n / log log n) (log n).
11 One horizontal stripe (4 blocks of 5 5) Red arrows: explicitly stored LCS values; black arrows: diff-encoded LCS values and 34023: text snippets encoded with ref to a superblock (not shown). The diagonally shaded cells are the block output cells. seq A seq B
12 LCS, first result (Alg 1) 12
13 Output-dependent algorithm We work in blocks of (b+1) (b+1), but divide them into sparse ones, which have K matches, and dense ones with > K matches. Key observation: knowing the top row and leftmost column for the block plus the location of all matches in it is enough to compute this block. That is, the text snippets are not needed!
14 Where sparse DP meets tabulation A sparse block input: top row: b bits (diff encoding), leftmost column: b bits (diff encoding), match locations: each in log(b 2 ) bits, totalling O(K log b) bits. (Output: even less.) Hence, if K log b + b = O(log n) (with a small enough constant), we can use a LUT for all sparse blocks and compute each of them in constant time.
15 Dense blocks Dense blocks are partitioned into smaller blocks which then will be processed by our technique from Alg 1. The smaller block sizes are: (log n / log log n) (b).
16 Choosing the parameters b = O(log n) (otherwise the LUT build costs will be dominating), but also b = (log n / sqrt(log log n)) (otherwise this alg will never beat Alg 1). This implies K = (log n / log log n), with an appropriate constant. If the fraction of dense blocks in the matrix is 0 < f d 1, then the total time complexity (w/o preprocessing!) is: For a small enough r (= total # of matches in the matrix) we may have O(mn / log 2 n) from the above formula, alas in the pp we have to find and encode all matches in all sparse blocks, in O(n + r) time.
17 LCS, second result (Alg 2)
18 Alg 2 niche Considering the results of: Eppstein et al., 1992, Sakai, 2012, Alg 1, we obtain the following niche in which Alg 2 is the winner: and
19 Simple generalization of Th. 1 and 2
20 Longest common transposition-invariant subsequence (LCTS) LCTS = LCS in the best key transposition (in music, transposition is shifting a sequence of notes (pitches) up or down by a constant interval).
21 LCTS, known results and a new one Navarro, Grabowski, Mäkinen, Deorowicz, 2005; Deorowicz, 2006 apply BFC technique for each transposition New algorithm: let us call the transpositions with at least mn log log n / matches as dense, the others as sparse. Apply Alg 1 to the dense transpositions and Alg 2 to the sparse ones. Overall time: for
22 Merged LCS (MerLCS) A bioinformatics problem on 3 sequences: given sequences A, B and P, return a longest seq. T that is a subsequence of P and can be split into two subsequences T’ and T’’ such that T’ is a subsequence of A and T’’ is a subsequence of B. |A| = n, |B| = m, |P| = u. Known results: Peng, Yang, Huang, Tseng & Hor, 2010: O(lmn) time, where l n is the result length. Deorowicz & Danek, 2013: O( u / w mn log w) time.
23 Our result for MerLCS DP matrix property: Deorowicz and Danek noticed that M(i, j, k) is equal to or larger by 1 than any of the three neighhbors: M(i – 1, j, k), M(i, j – 1, k), M(i, j, k – 1). We generalize our result on 2 sequences to 3 sequences (input: 3 text snippets plus 3 2-dim walls instead of 1-dim borders!) to obtain O(mnu / log 3/2 n) for MerLCS, if u = (n c ) for some c > 0.
24 Conclusions 24 Tabulation (= Four Russians) is a classic DP-boosting technique. Interestingly, we managed to (slightly) improve its application to the LCS / edit distance problem. Applying tabulation may be even better for a sparse matrix. These techniques work also for a few other problems than LCS and edit distance.
25 Open problems Can we improve the tabulation based result on compressible sequences? Can we adopt our technique(s) to problems in which the conditions from Lemma 3 (or Lemma 7, involving 3 sequences) are relaxed, that is, consecutive DP cells may (sometimes) differ more than by a constant? Exemplary problem: SEQ-EC-LCS (Chen & Chao, 2011; Deorowicz & Grabowski, 2014).