Download presentation
Presentation is loading. Please wait.
1
Fast and Practical Algorithms for Computing Runs Gang Chen – McMaster, Ontario, CAN Simon J. Puglisi – RMIT, Melbourne, AUS Bill Smyth – McMaster, Ontario, CAN CPM, UWO, July 11, 2007
2
Overview I won’t talk much about runs! Lempel-Ziv (LZ) Factorization How to compute LZ with SA & LCP – Suffix Array & LCP Array Basics (again!) – Two different methods for LZ factorization – CPS1 and CPS2 – Various space time trade-offs Experimental comparison to other approaches
3
LZ Factorization (Defn) The LZ-factorization, LZ x of string x[1..n] is a factorization x = w 1 w 2...w k such that each w j, j ε 1..k, is either: 1.a letter that does not occur in w 1 w 2...w j-1 ; or 2.the longest substring that occurs at least twice in w 1 w 2...w j. This is the LZ-77 parsing of the input string Also known as the S-Factorization (Crochemore)
4
LZ Factorization (Ex) abababa 12345 a 678 x = a (1,0) … or (5,2) (2,0) (1,1) (1,3)(2,2) baababa (POS,LEN) wjwj POS = Position of some previous occurrence LEN = Factor length Convention: LEN = 0 if factor is a new letter
5
Applications of LZ Factorization Computing all runs (Kolpakov & Kucherov) Repeats with fixed gap (Kolpakov & Kucherov… again) Branching repeats (Gusfield & Stoye) Sequence Alignment (Crochemore et al.) Local periods (Duval et al.) Data Compression (Lempel & Ziv, many others) Etcetera… LZ Factorization is the computational bottleneck in numerous string processing algorithms
6
Computing LZ “Traditional” method is to use a suffix tree –Can be computed as a by-product of Ukkonen’s online suffix tree construction algorithm OR –During a bottom-up traversal of a whole tree SA/LCP interval tree (Abouelhoda et al 2004) –Essentially simulating a bottom-up traversal of the suffix tree on the SA/LCP combination Both these approaches use lots of space.
7
The ubiquitous Suffix Array Sort the n suffixes of x[1..n] into lexorder Store the offsets in an array 8 a 3 aababa 6 aba 1 abaababa 4 ababa 7 ba 2 baababa 5 baba 1 abaababa 2 baababa 3 aababa 4 ababa 5 baba 6 aba 7 ba 8 a abababa 12345 a 678 x = SORT
8
LCP Array Many SA algorithms rely on an additional table: the LCP (longest common prefix) array Can be computed in O(n) time (Kasai et al. 1999) Several practical improvements: space consumption reduced from 13n to 9n (Manzini 2004) LCP Array stores length of Longest Common Prefix between suffixes SA[i] and SA[i-1] 8 0 a 3 1 aababa 6 1 aba 1 3 abaababa 4 3 ababa 7 0 ba 2 2 baababa 5 2 baba
9
Computing LZ with the SA First “family” of LZ algorithms we call CPS1 CPS1 algorithms compute arrays POS and LEN These arrays give us the factor information for every position (which is more than we require) Also, LEN is a permutation of LCP abababa 12345 a 678 x = 12121211POS = 00323211LEN = 01330221LCP =
10
CPS1: LZ from SA & LCP POS and LEN are computed in a straight left-to- right traversal of the SA/LCP arrays We “ascend” the LCP array, saving indexes on the stack until LCP values decrease Backtrack using the stack to locate the rightmost i1 < i2 with LCP[i1] < LCP[i2] As we go set the larger position with equal LCP to point leftwards to the smaller one 14 lines of C code! x, SA, LCP, POS, LEN arrays → 17n + stack
11
Overwrite LCP with POS Once POS[SA[i]] has been assigned –SA[i] and LCP[i] are no longer accessed… Reuse the space –Leave SA[i] as is –Assign LCP[i] = POS[SA[i]] –Store LEN separately as before After the traversal of SA/LCP is complete, permute the SA and “LCP” arrays inplace into string order by following all cycles POS array no longer needed → 13n + stack
12
Eliminate the LEN Array Given POS[i] = p –LEN[i] = longestmatch(x[POS[i]…n],x[i…n]) Compute only the POS values –Permute them into the POS array (as last slide) Compute LEN values only for factors in the parsing Sum of factors lengths required for the parsing is n, still O(n) time LEN array no longer needed → 9n + stack
13
CPS2: LZ without LCP LCP computation is slow (though linear) –requires extra space: can we drop it? Use SA to search for the longest previous match at each position in the factorization –Problem is: we don’t want any match - we want a match to the left. –When do we stop the search?
14
8 a 3 aababa 6 aba 1 abaababa 4 ababa 7 ba 2 baababa 5 baba LZ without LCP (cont…) abababa 12345 a 678 x = RangeMin SA (1,5) = 1Length = 1 RangeMin SA (3,5) = 1Length = 2 RangeMin SA (3,5) = 1Length = 3 RangeMin SA (3,5) = 4 RangeMin SA (3,5) = 1Length = 3
15
LZ without LCP (cont…) Use two binary searches to refine range –Incremental use of Manber and Myers search –Could use other search algs (like FM) Preprocess SA for fast RMQ queries –RMQ SA (i,j) returns minimum value in SA[i..j] –Fast implementation of RMQ requires n bytes O(n log n) time, ~6n bytes space –n single character searches –Each search takes O(log n) time
16
Experiments Implemented CPS algorithms and raced with: 1.Kolpakov and Kucherov’s implementation Computes factors during online construction of the suffix tree (Ukkonen’s algorithm) Tuned specifically for DNA strings 2.Abouelhoda et al’s approach Uses SA and LCP, computes the POS,LEN
17
Results - Runtimes
18
Peak Memory Usage
19
Conclusions KK remains fastest algorithm on DNA CPS1 (13n) is consistently fastest on larger alphabets (notably faster than AKO) CPS1 (9n) provides a nice space time tradeoff CPS2 most suitable if memory is tight
20
Future Work Computing the LCP array is a burden –Can we speed it up? –Compute it during SA construction? How easily do these algorithms map to compressed SAs? –Overwriting SA/LCP difficult in that setting Can LZ be computed efficiently without using SA/LCP or STree? Can we compute the rightmost previous POS instead of the leftmost? (Veli Makinen 7-9-2007)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.