Compressed Suffix Arrays for Massive Data Jouni Sirén SPIRE 2009
Constructing Really Big Compressed Suffix Arrays Jouni Sirén SPIRE 2009
Jouni Sirén, Compressed Suffix Arrays for Massive Data3 Outline 1) Introduction 2) Merging Compressed Suffix Arrays 3) Incremental CSA Construction for a Single Sequence 4) Conclusions
Jouni Sirén, Compressed Suffix Arrays for Massive Data4 Suffix Arrays (SA) are full-text indexes for arbitrary texts. They consist of pointers to the suffixes of the text in lexicographic order. Quite large: 4 to 8 times the size of the text. Compressed Suffix Arrays (CSA) can be much smaller on highly repetitive texts. 210 or 336 GB for the SA and 42 GB for the text on the Finnish language Wikipedia with version history (fiwiki). A CSA requires just 2.1 GB, including the text. How to build a CSA for such a huge collection? Compressed Suffix Arrays
Jouni Sirén, Compressed Suffix Arrays for Massive Data5 The Standard Approach Build a regular SA and compress it. Quite fast: O(n) or O(n log n) in theory. ~1 MB / s for the best algorithms on current hardware. About 12 hours for fiwiki. Memory intensive: need space for the text and the SA. At least 5 to 9 times the size of the text. At least 250 GB for fiwiki.
Jouni Sirén, Compressed Suffix Arrays for Massive Data6 Other Approaches Secondary memory, dynamic indexes, direct CSA construction … Only a few implementations are available. Most are quite slow in practice. Often just 50 to 100 kB/s. 5 to 10 days for fiwiki.
Jouni Sirén, Compressed Suffix Arrays for Massive Data7 The Best Algorithms So Far Space-efficient BWT construction (Kärkkäinen 2007). Keeps the text in memory. Distributed SA construction (Kulla and Sanders 2007). Keeps the text, the SA, and a lot of temporary data in distributed memory. Merge suffix arrays in secondary memory (Gonnet et al. 1992). I/O volume is O(n 2 / M) words – quickly becomes slow when the input grows past memory size.
Jouni Sirén, Compressed Suffix Arrays for Massive Data8 Our Approach Build several smaller CSAs and merge them. Allows efficient parallel and distributed implementations. Easy to get O(n log n) time in |CSA| + O(n) bits. Results for the 42 GB fiwiki example: 9.6 hours and 32 GB on 8 cores. 25 hours and 8 GB on 2 cores. Supports bulk insertions and deletions.
Jouni Sirén, Compressed Suffix Arrays for Massive Data9 Outline 1) Introduction 2) Merging Compressed Suffix Arrays 3) Incremental CSA Construction for a Single Sequence 4) Conclusions
Jouni Sirén, Compressed Suffix Arrays for Massive Data10 Burrows-Wheeler Transform CSAs are based on a permutation of the text called Burrows-Wheeler Transform (BWT). BWT is related to SA: BWT[i] = T[SA[i] – 1]. A CSA is essentially a compressed representation of the BWT supporting rank c (i), and select c (i). We generalize BWT for multiple texts.
Jouni Sirén, Compressed Suffix Arrays for Massive Data11 BWT of a Collection Let $ i < $ i+1 < c ∈ Σ, T 1 = ababbaa$ 1, T 2 = abbaa$ 2, T 3 = babba$ 3. $ 1 a b b a a $ 2 b a b b a $ 3 a b a b b a a $ 2 b a b b a $ 3 a b a b b a a $ 1 a b b a a $ 3 a b a b b a a $ 1 a b b a a $ 2 b a b b a b b a $ 3 a b a b b a a $ 1 a b b a a $ 2 b a b b a a $ 1 a b b a a $ 2 b a b b a $ 3 a b a b b a a $ 2 b a b b a $ 3 a b a b b a a $ 1 a The BWT of T 1 T 2 T 3 is aaaaabbb$ 3 bb$ 1 bbb$ 2 aaaa.
Jouni Sirén, Compressed Suffix Arrays for Massive Data12 BWT of a Collection Let $ i < $ i+1 < c ∈ Σ, T 1 = ababbaa$ 1, T 2 = abbaa$ 2, T 3 = babba$ 3. $ 1 a b b a a $ 2 b a b b a $ 3 a b a b b a a $ 2 b a b b a $ 3 a b a b b a a $ 1 a b b a a $ 3 a b a b b a a $ 1 a b b a a $ 2 b a b b a b b a $ 3 a b a b b a a $ 1 a b b a a $ 2 b a b b a a $ 1 a b b a a $ 2 b a b b a $ 3 a b a b b a a $ 2 b a b b a $ 3 a b a b b a a $ 1 a The BWT of T 1 T 2 T 3 is aaaaabbb$ 3 bb$ 1 bbb$ 2 aaaa.
Jouni Sirén, Compressed Suffix Arrays for Massive Data13 Generalized BWT Each character had a unique sort key. The suffix starting at the next position for regular characters. The next text in cyclical order for end markers. What if we used the same sort keys as in the BWTs of individual texts? T 1 T 2 T 3 a a a a a b b b $ 3 b b $ 1 b b b $ 2 a a a a {T 1, T 2, T 3 }a a a a a b b b $ 1 b b $ 2 b b b $ 3 a a a a {T 1 T 2, T 3 }a a a a a b b b $ 2 b b $ 1 b b b $ 3 a a a a What about {T 1 T 2, T 3 }?
Jouni Sirén, Compressed Suffix Arrays for Massive Data14 Some Bit Vectors A CSA is a compressed representation of bit vectors Ψ c : Ψ a Ψ b BWT(T 1 T 2, T 3 )a a a a a b b b $ b b $ b b b $ a a a a I BWT(T 3 )0 0 a 0 0 b 0 0 $ b 0 0 b 0 0 $ 0 a 0 0 BWT(T 1 T 2 ) and BWT(T 3 ) are subsequences of BWT(T 1 T 2, T 3 ).
Jouni Sirén, Compressed Suffix Arrays for Massive Data15 Merging BWTs If we have BWT(T 1 T 2 ) and BWT(T 3 ), we can use I to merge them to get BWT(T 1 T 2, T 3 ). (Hon et al. 2007): Rank of a suffix of T 3 among the suffixes of {T 1 T 2, T 3 } is the sum of its ranks among the suffixes of each part. We can use backward searching to get the ranks among the suffixes of T 1 T 2. We get the ranks among the suffixes of T 3 implicitly from BWT(T 3 ).
Jouni Sirén, Compressed Suffix Arrays for Massive Data16 The Algorithm Merge(BWT(A), BWT(B), B): # The ranks among the suffixes of A I := BackwardSearch(BWT(A), B) Sort(I) # Add the ranks among the suffixes of B for i := 1 to |B|: I[i] := I[i] + i BWT(A, B) := Interleave(BWT(A), BWT(B), I) return BWT(A, B)
Jouni Sirén, Compressed Suffix Arrays for Massive Data17 Outline 1) Introduction 2) Merging Compressed Suffix Arrays 3) Incremental CSA Construction for a Single Sequence 4) Conclusions
Jouni Sirén, Compressed Suffix Arrays for Massive Data18 A Related Algorithm Hon et al.: A Space And Time Efficient Algorithm for Constructing Compressed Suffix Arrays. Algorithmica 48, pp. 23 – 36, Incremental construction of CSAs for a single sequence. The algorithm can be improved with some of our ideas.
Jouni Sirén, Compressed Suffix Arrays for Massive Data19 A Comparison between the Algorithms Our Algorithm Merge CSAs Multiple sequences Split between sequences CSA(T 1 ), CSA(T 2 ), array of |T 2 | integers Merge bit vectors Hon et al. Incremental construction A single sequence Split anywhere CSA(T 1 ), 4 arrays of |T 2 | integers Merge values of Ψ
Jouni Sirén, Compressed Suffix Arrays for Massive Data20 Improving the Algorithm of Hon et al. We already have CSA(Y), and want to build CSA(XY). We call the first l = |X| suffixes of T the long suffixes of T. Let SA l (T) be the subsequence of SA(T) containing them. SA l (XY) can be built by using X, SA l (Y), and the length l prefix of Y (Hon et al. 2007). We could build a CSA l (XY) instead. We could also build all the CSAs before merging them. Hence the peak memory usage becomes CSA(Y), CSA l (XY), and an array of l integers.
Jouni Sirén, Compressed Suffix Arrays for Massive Data21 The Improved Algorithm Our Algorithm Merge CSAs Multiple sequences Split between sequences CSA(T 1 ), CSA(T 2 ), array of |T 2 | integers Merge bit vectors Hon et al. Incremental construction A single sequence Split anywhere CSA(T 1 ), 4 arrays of |T 2 | integers Merge values of Ψ
Jouni Sirén, Compressed Suffix Arrays for Massive Data22 Outline 1) Introduction 2) Merging Compressed Suffix Arrays 3) Incremental CSA Construction for a Single Sequence 4) Conclusions
Jouni Sirén, Compressed Suffix Arrays for Massive Data23 Conclusions We have presented a practical space-efficient algorithm for merging two CSAs. Improved space-efficient CSA construction: A parallel implementation can index tens of gigabytes in a reasonable time. A distributed implementation should be able to handle multiple terabytes. Also possible: Bulk insertions and deletions on a static CSA. Merge wavelet trees and build FM-indexes directly.