Download presentation
Presentation is loading. Please wait.
1
Compressed Suffix Arrays based on Run-Length Encoding Veli Mäkinen Bielefeld University Gonzalo Navarro University of Chile BWTRLFID
2
20.6.2005 Compressed suffix arrays based on run- length encoding2 Abstract We introduce a new full-text index that occupies O(H k |T|) bits and supports counting queries in O(|P|) time. - optimal space / search time on constant alphabet - works on any alphabet size , adding log to the space/time bounds.
3
20.6.2005 Compressed suffix arrays based on run- length encoding3 Introduction We consider exact string matching on static text. The task is to construct an index for the text such that the occurrences of a given pattern can be found efficiently. Well known optimal solution exists: build a suffix tree over the text.
4
20.6.2005 Compressed suffix arrays based on run- length encoding4 Introduction... The suffix-tree-based solution takes O(|T| log |T|) bits of space. Text itself can be represented in O(|T| log ) bits. - or even less space if text is compressible. In many applications the space usage is the real bottleneck, not the search efficiency.
5
20.6.2005 Compressed suffix arrays based on run- length encoding5 Introduction... During the last 15 years, many practical / theoretical solutions with reduced space complexities have been proposed. The work can roughly be divided into three categories: (1) Reducing constant factors (2) Concrete optimization (3) Abstract optimization
6
20.6.2005 Compressed suffix arrays based on run- length encoding6 Reducing constant factors Suffix arrays (Manber & Myers 1990) Suffix cactuses (Kärkkäinen 1995) Sparse suffix trees (Kärkkäinen & Ukkonen 1996) Space-efficient suffix trees (Kurtz 1998) Enhanced suffix arrays (Abouelhoda & Ohlebusch & Kurtz 2002)
7
20.6.2005 Compressed suffix arrays based on run- length encoding7 Concrete optimization “ Minimizing automata” DAWGS (Blumer & Blumer & Haussler & McConnel & Ehrenfeucht 1983) Compact DAWGS (Crochemore & Vérin 1997) Compact suffix arrays (Mäkinen 2000)
8
20.6.2005 Compressed suffix arrays based on run- length encoding8 Abstract optimization Objective: Use as few space as possible to support the functionality of a given abstract definition of a data structure. Space is measured in bits and usually given proportional to the entropy of the text.
9
20.6.2005 Compressed suffix arrays based on run- length encoding9 Abstract optimization: Example A full text index for a given text T supports the following operations: - Exists(P): is P a substring of T? - Count(P): how many times P occurs in T? - Report(P): list occurrences of P in T.
10
20.6.2005 Compressed suffix arrays based on run- length encoding10 Abstract optimization... Seminal work by Jacobson 1989: rank- select queries on bit-vectors. Rank-select-type structures for suffix trees (Munro & Raman & Rao & Clark 1996-) Lempel-Ziv index (Kärkkäinen & Ukkonen 1996)
11
20.6.2005 Compressed suffix arrays based on run- length encoding11 Abstract optimization... Compressed suffix arrays (Grossi & Vitter 2000, Sadakane 2000, 2002) FM-index (Ferragina & Manzini 2000) LZ-self-index (Navarro 2002) Space-optimal full-text indexes (Grossi & Gupta & Vitter 2003, 2004) Alphabet friendly FM-index (Ferragina & Manzini & Mäkinen & Navarro) See also ISAAC'04, SODA'05,...
12
20.6.2005 Compressed suffix arrays based on run- length encoding12 This talk We show that combining FM-index with compact suffix array gives a practical full- text index with good space / search time tradeoff. Our structure, Run-Length FM-index, uses O(min(|T|(H k log +1),|T|log ) bits and supports Count(P) in O(|P|log ) time.
13
20.6.2005 Compressed suffix arrays based on run- length encoding13 This talk... H k =H k (T) is the order-k empirical entropy of T, i.e., “the average number of bits needed to encode a symbol using a fixed codebook for each possible combination of k previous symbols”. There holds 0 H k H k-1 ... H 0 log
14
20.6.2005 Compressed suffix arrays based on run- length encoding14 FM-index Let us first describe a simple variant of the FM-index that: - occupies O(|T| log bits, and - supports counting queries in O(|P| log ) time.
15
20.6.2005 Compressed suffix arrays based on run- length encoding15 Simple FM-index Construct the Burrows-Wheeler-transformed text bwt(T) [BW94]. From bwt(T) it is possible to construct the suffix array sa(T) of T in linear time. Instead of constructing the whole sa(T), one can add small data structures besides bwt(T) to simulate a search from sa(T).
16
20.6.2005 Compressed suffix arrays based on run- length encoding16 Burrows-Wheeler transformation Construct a matrix M that contains as rows all rotations of T. Sort the rows in the lexicographic order. Let L be the last column and F be the first column. bwt(T)=L associated with the row number of T in the sorted M.
17
20.6.2005 Compressed suffix arrays based on run- length encoding17 Example pos 123456789 T = kalevala# 1:9 #kalevala 2:8 a#kaleval 3:6 ala#kalev 4:2 alevala#k 5:4 evala#kal 6:1 kalevala# 7:7 la#kaleva 8:3 levala#ka 9:5 vala#kale ==> L = alvkl#aae, row 6 Exercise: Given L and the row number, how to compute T and sa(T)? saM LF
18
1 a 2 l 3 v 4 k 5 l 6 # 7 a 8 a 9 e #aaaekllv#aaaekllv 1: 2: 3: 4: 5: 6: 7: 8: 9: # 9 a 8 l 7 a 6 v 5 e 4 l 3 a 2 1 k sort sa(T) T -1 = L F … alvkl#aaealvkl#aae M L LF[i] 2 7 9 6 8 1 3 4 5 i 1 2 3 4 5 6 7 8 9 a l e v a l a k a l e v a l
19
20.6.2005 Compressed suffix arrays based on run- length encoding19 Implicit LF[i] Ferragina and Manzini (2000) noticed the following connection: LF[i]=C T [L[i]]+rank L[i] (L,i) Here C T [c] : amount of letters 0,1,...,c-1 in L=bwt(T) rank c (L,i) : amount of letters c in the prefix L[1,i]
20
20.6.2005 Compressed suffix arrays based on run- length encoding20 Rank/Select 001001001101 001112223445rank 1 (L,i) L select 1 (L,j)3 6 9 10 12
21
LF[i] 2 7 9 6 8 1 3 4 5 i 1 2 3 4 5 6 7 8 9 LF[7]=C T [a]+rank a (L,7) =1+2=3 1 a 2 l 3 v 4 k 5 l 6 # 7 a 8 a 9 e #aaaekllv#aaaekllv 1: 2: 3: 4: 5: 6: 7: 8: 9: # 9 a 8 l 7 a 6 v 5 e 4 l 3 a 2 1 k sort sa(T) T -1 = L F … alvkl#aaealvkl#aae M L
22
20.6.2005 Compressed suffix arrays based on run- length encoding22 Backward search on bwt(T) Observation: If [i,j] is the range of rows of M that start with string X, then the range [i’,j’] containing cX can be computed as i’ := C T [c]+rank c (L,i-1)+1, j’ := C T [c]+rank c (L,j).
23
20.6.2005 Compressed suffix arrays based on run- length encoding23 M L … alvkl#aaealvkl#aae Backward search on bwt(T) … #k a# al ev ka la le va X=a i j vX=va? rank v (L,i-1)=0 rank v (L,j)=1 C[’v’]=8 i’ := 8 + 0 + 1 j’ := 8 + 1 i’, j’
24
20.6.2005 Compressed suffix arrays based on run- length encoding24 Algorithm Count(P[1,m], L[1,n],C T [1, ) (1)c = P[m]; k = m; (2)i = C T [c]+1; j = C T [c+1]; (3)while (i ≤ j and k>1) do begin (4) c = P[k-1]; k = k-1; (5) i = C T [c]+rank c (L,i-1)+1; (6) j = C T [c]+rank c (L,j); end; (7)if (j<i) then return 0 else return (j-i+1); Backward search on bwt(T) …
25
20.6.2005 Compressed suffix arrays based on run- length encoding25 Backward search on bwt(T)... Array C T [1, ] takes O( log |T|) bits. L=Bwt(T) takes O(|T| log ) bits. Assuming rank c (L,i) can be computed in constant time for each (c,i), the algorithm takes O(|P|) time to count the occurrences of P in T.
26
20.6.2005 Compressed suffix arrays based on run- length encoding26 Answering rank c (L,i) Wavelet tree (GGV 2003) is a data structure replacing L=bwt(T): - supports rank c (L,i) in O(log ) time, and - occupies |T|H 0 (T) +o(|T|) bits. Generalized wavelet tree (FMMN 2004) improves query time to constant when =O(polylog(|T|)).
27
20.6.2005 Compressed suffix arrays based on run- length encoding27 Simple FM-index... We obtained a structure that - occupies O(|T|H 0 (T) bits, supports counting queries in O(|P|log ) time. Original FM-index takes O(H k |T|) bits, but only on constant alphabet. Compression boosting can be applied to improve simple FM-index to take only O(|T|H k (T) bits (FMMN 2004).
28
20.6.2005 Compressed suffix arrays based on run- length encoding28 To partition or not... All alphabet-friendly solutions obtaining O(|T|H k (T) space for compressed suffix arrays use optimal partitioning of BWT text, and store explicitly the distribution for each piece. - always ( k+1 ) overhead. MTF+zeroth order coding take O(|T|H k (T) ( k ), but supporting queries on larger alphabets is non- trivial.
29
20.6.2005 Compressed suffix arrays based on run- length encoding29 Run-Length FM-index We make the following changes to the previous FM-index variant: - L=Bwt(T) is replaced by a sequence S[1,n’] and two bit-vectors B[1,|T|] and B’[1,|T|], - Cumulative array C T [1,c] is replaced by C S [1,c], - wavelet tree is build on S, and - some formulas are changed.
30
20.6.2005 Compressed suffix arrays based on run- length encoding30 Run-Length FM-index... cccaaggattcccaaggatt L 10010101101001010110 B cagatcagat S 10110010101011001010 B’ aaacccggttaaacccggtt F cccaaggattcccaaggatt L
31
20.6.2005 Compressed suffix arrays based on run- length encoding31 Changes to formulas Recall that we need to compute C T [c]+rank c (L,i) in the backward search. Theorem: C[c]+rank c (L,i) is equivalent to select 1 (B’,C S [c]+1+rank c (S,rank 1 (B,i)))-1, when L[i] c, and otherwise to select 1 (B’,C S [c]+rank c (S,rank 1 (B,i)))+ i-select 1 (B,rank 1 (B,i)).
32
20.6.2005 Compressed suffix arrays based on run- length encoding32 Example, L[i]=c cccaaggattcccaaggatt L aaacccggttaaacccggtt F LF[8]= select 1 (B’,C S [a]+rank a (S,rank 1 (B,8)))+ 8-select 1 (B,rank 1 (B,8)) 10010101101001010110 B cagatcagat S 10110010101011001010 B’ = select 1 (B’,0+rank a (S,4))+8-select 1 (B,4) = select 1 (B’,0+2)+8-8 = 3
33
20.6.2005 Compressed suffix arrays based on run- length encoding33 Space requirement C S [1, ] takes O( log |T|) bits. B and B’ with rank/select dictionaries take 2|T|+o(|T|) bits. S represented using wavelet tree occupies |S|H 0 (S)+o(|S|) bits. In CPM 2004, we have shown that |S| H k |T| + k.
34
Comparison 560
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.