Compressed Suffix Arrays based on Run-Length Encoding Veli Mäkinen Bielefeld University Gonzalo Navarro University of Chile BWTRLFID.

Compressed Suffix Arrays based on Run-Length Encoding Veli Mäkinen Bielefeld University Gonzalo Navarro University of Chile BWTRLFID

20.6.2005 Compressed suffix arrays based on run- length encoding2 Abstract  We introduce a new full-text index that occupies O(H k |T|) bits and supports counting queries in O(|P|) time. - optimal space / search time on constant alphabet - works on any alphabet size , adding log  to the space/time bounds.

20.6.2005 Compressed suffix arrays based on run- length encoding3 Introduction  We consider exact string matching on static text.  The task is to construct an index for the text such that the occurrences of a given pattern can be found efficiently.  Well known optimal solution exists: build a suffix tree over the text.

20.6.2005 Compressed suffix arrays based on run- length encoding4 Introduction...  The suffix-tree-based solution takes O(|T| log |T|) bits of space.  Text itself can be represented in O(|T| log  ) bits. - or even less space if text is compressible.  In many applications the space usage is the real bottleneck, not the search efficiency.

20.6.2005 Compressed suffix arrays based on run- length encoding5 Introduction...  During the last 15 years, many practical / theoretical solutions with reduced space complexities have been proposed.  The work can roughly be divided into three categories: (1) Reducing constant factors (2) Concrete optimization (3) Abstract optimization

20.6.2005 Compressed suffix arrays based on run- length encoding6 Reducing constant factors  Suffix arrays (Manber & Myers 1990)  Suffix cactuses (Kärkkäinen 1995)  Sparse suffix trees (Kärkkäinen & Ukkonen 1996)  Space-efficient suffix trees (Kurtz 1998)  Enhanced suffix arrays (Abouelhoda & Ohlebusch & Kurtz 2002)

20.6.2005 Compressed suffix arrays based on run- length encoding7 Concrete optimization  “  Minimizing automata”  DAWGS (Blumer & Blumer & Haussler & McConnel & Ehrenfeucht 1983)  Compact DAWGS (Crochemore & Vérin 1997)  Compact suffix arrays (Mäkinen 2000)

20.6.2005 Compressed suffix arrays based on run- length encoding8 Abstract optimization  Objective: Use as few space as possible to support the functionality of a given abstract definition of a data structure.  Space is measured in bits and usually given proportional to the entropy of the text.

20.6.2005 Compressed suffix arrays based on run- length encoding9 Abstract optimization: Example  A full text index for a given text T supports the following operations: - Exists(P): is P a substring of T? - Count(P): how many times P occurs in T? - Report(P): list occurrences of P in T.

20.6.2005 Compressed suffix arrays based on run- length encoding10 Abstract optimization...  Seminal work by Jacobson 1989: rank- select queries on bit-vectors.  Rank-select-type structures for suffix trees (Munro & Raman & Rao & Clark 1996-)  Lempel-Ziv index (Kärkkäinen & Ukkonen 1996)

20.6.2005 Compressed suffix arrays based on run- length encoding11 Abstract optimization...  Compressed suffix arrays (Grossi & Vitter 2000, Sadakane 2000, 2002)  FM-index (Ferragina & Manzini 2000)  LZ-self-index (Navarro 2002)  Space-optimal full-text indexes (Grossi & Gupta & Vitter 2003, 2004)  Alphabet friendly FM-index (Ferragina & Manzini & Mäkinen & Navarro)  See also ISAAC'04, SODA'05,...

20.6.2005 Compressed suffix arrays based on run- length encoding12 This talk  We show that combining FM-index with compact suffix array gives a practical full- text index with good space / search time tradeoff.  Our structure, Run-Length FM-index, uses O(min(|T|(H k log  +1),|T|log  ) bits and supports Count(P) in O(|P|log  ) time.

20.6.2005 Compressed suffix arrays based on run- length encoding13 This talk...  H k =H k (T) is the order-k empirical entropy of T, i.e., “the average number of bits needed to encode a symbol using a fixed codebook for each possible combination of k previous symbols”.  There holds 0  H k  H k-1 ...  H 0  log 

20.6.2005 Compressed suffix arrays based on run- length encoding14 FM-index  Let us first describe a simple variant of the FM-index that: - occupies O(|T| log  bits, and - supports counting queries in O(|P| log  ) time.

20.6.2005 Compressed suffix arrays based on run- length encoding15 Simple FM-index  Construct the Burrows-Wheeler-transformed text bwt(T) [BW94].  From bwt(T) it is possible to construct the suffix array sa(T) of T in linear time.  Instead of constructing the whole sa(T), one can add small data structures besides bwt(T) to simulate a search from sa(T).

20.6.2005 Compressed suffix arrays based on run- length encoding16 Burrows-Wheeler transformation  Construct a matrix M that contains as rows all rotations of T.  Sort the rows in the lexicographic order.  Let L be the last column and F be the first column.  bwt(T)=L associated with the row number of T in the sorted M.

20.6.2005 Compressed suffix arrays based on run- length encoding17 Example pos 123456789 T = kalevala# 1:9 #kalevala 2:8 a#kaleval 3:6 ala#kalev 4:2 alevala#k 5:4 evala#kal 6:1 kalevala# 7:7 la#kaleva 8:3 levala#ka 9:5 vala#kale ==> L = alvkl#aae, row 6 Exercise: Given L and the row number, how to compute T and sa(T)? saM LF

1 a 2 l 3 v 4 k 5 l 6 # 7 a 8 a 9 e #aaaekllv#aaaekllv 1: 2: 3: 4: 5: 6: 7: 8: 9: # 9 a 8 l 7 a 6 v 5 e 4 l 3 a 2 1 k sort sa(T) T -1 = L F … alvkl#aaealvkl#aae M L LF[i] 2 7 9 6 8 1 3 4 5 i 1 2 3 4 5 6 7 8 9 a l e v a l a k a l e v a l

20.6.2005 Compressed suffix arrays based on run- length encoding19 Implicit LF[i]  Ferragina and Manzini (2000) noticed the following connection:  LF[i]=C T [L[i]]+rank L[i] (L,i)  Here C T [c] : amount of letters 0,1,...,c-1 in L=bwt(T) rank c (L,i) : amount of letters c in the prefix L[1,i]

20.6.2005 Compressed suffix arrays based on run- length encoding20 Rank/Select 001001001101 001112223445rank 1 (L,i) L select 1 (L,j)3 6 9 10 12

LF[i] 2 7 9 6 8 1 3 4 5 i 1 2 3 4 5 6 7 8 9 LF[7]=C T [a]+rank a (L,7) =1+2=3 1 a 2 l 3 v 4 k 5 l 6 # 7 a 8 a 9 e #aaaekllv#aaaekllv 1: 2: 3: 4: 5: 6: 7: 8: 9: # 9 a 8 l 7 a 6 v 5 e 4 l 3 a 2 1 k sort sa(T) T -1 = L F … alvkl#aaealvkl#aae M L

20.6.2005 Compressed suffix arrays based on run- length encoding22 Backward search on bwt(T)  Observation: If [i,j] is the range of rows of M that start with string X, then the range [i’,j’] containing cX can be computed as i’ := C T [c]+rank c (L,i-1)+1, j’ := C T [c]+rank c (L,j).

20.6.2005 Compressed suffix arrays based on run- length encoding23 M L … alvkl#aaealvkl#aae Backward search on bwt(T) … #k a# al ev ka la le va X=a i j vX=va? rank v (L,i-1)=0 rank v (L,j)=1 C[’v’]=8 i’ := 8 + 0 + 1 j’ := 8 + 1 i’, j’

20.6.2005 Compressed suffix arrays based on run- length encoding24 Algorithm Count(P[1,m], L[1,n],C T [1,  ) (1)c = P[m]; k = m; (2)i = C T [c]+1; j = C T [c+1]; (3)while (i ≤ j and k>1) do begin (4) c = P[k-1]; k = k-1; (5) i = C T [c]+rank c (L,i-1)+1; (6) j = C T [c]+rank c (L,j); end; (7)if (j<i) then return 0 else return (j-i+1); Backward search on bwt(T) …

20.6.2005 Compressed suffix arrays based on run- length encoding25 Backward search on bwt(T)...  Array C T [1,  ] takes O(  log |T|) bits.  L=Bwt(T) takes O(|T| log  ) bits.  Assuming rank c (L,i) can be computed in constant time for each (c,i), the algorithm takes O(|P|) time to count the occurrences of P in T.

20.6.2005 Compressed suffix arrays based on run- length encoding26 Answering rank c (L,i)  Wavelet tree (GGV 2003) is a data structure replacing L=bwt(T): - supports rank c (L,i) in O(log  ) time, and - occupies |T|H 0 (T) +o(|T|) bits.  Generalized wavelet tree (FMMN 2004) improves query time to constant when  =O(polylog(|T|)).

20.6.2005 Compressed suffix arrays based on run- length encoding27 Simple FM-index...  We obtained a structure that - occupies O(|T|H 0 (T)  bits, supports counting queries in O(|P|log  ) time.  Original FM-index takes O(H k |T|) bits, but only on constant alphabet.  Compression boosting can be applied to improve simple FM-index to take only O(|T|H k (T)  bits (FMMN 2004).

20.6.2005 Compressed suffix arrays based on run- length encoding28 To partition or not...  All alphabet-friendly solutions obtaining O(|T|H k (T)  space for compressed suffix arrays use optimal partitioning of BWT text, and store explicitly the distribution for each piece. - always  (  k+1 ) overhead.  MTF+zeroth order coding take O(|T|H k (T)  (  k ), but supporting queries on larger alphabets is non- trivial.

20.6.2005 Compressed suffix arrays based on run- length encoding29 Run-Length FM-index  We make the following changes to the previous FM-index variant: - L=Bwt(T) is replaced by a sequence S[1,n’] and two bit-vectors B[1,|T|] and B’[1,|T|], - Cumulative array C T [1,c] is replaced by C S [1,c], - wavelet tree is build on S, and - some formulas are changed.

20.6.2005 Compressed suffix arrays based on run- length encoding30 Run-Length FM-index... cccaaggattcccaaggatt L 10010101101001010110 B cagatcagat S 10110010101011001010 B’ aaacccggttaaacccggtt F cccaaggattcccaaggatt L

20.6.2005 Compressed suffix arrays based on run- length encoding31 Changes to formulas  Recall that we need to compute C T [c]+rank c (L,i) in the backward search.  Theorem: C[c]+rank c (L,i) is equivalent to select 1 (B’,C S [c]+1+rank c (S,rank 1 (B,i)))-1, when L[i]  c, and otherwise to select 1 (B’,C S [c]+rank c (S,rank 1 (B,i)))+ i-select 1 (B,rank 1 (B,i)).

20.6.2005 Compressed suffix arrays based on run- length encoding32 Example, L[i]=c cccaaggattcccaaggatt L aaacccggttaaacccggtt F LF[8]= select 1 (B’,C S [a]+rank a (S,rank 1 (B,8)))+ 8-select 1 (B,rank 1 (B,8)) 10010101101001010110 B cagatcagat S 10110010101011001010 B’ = select 1 (B’,0+rank a (S,4))+8-select 1 (B,4) = select 1 (B’,0+2)+8-8 = 3

20.6.2005 Compressed suffix arrays based on run- length encoding33 Space requirement  C S [1,  ] takes O(  log |T|) bits.  B and B’ with rank/select dictionaries take 2|T|+o(|T|) bits.  S represented using wavelet tree occupies |S|H 0 (S)+o(|S|) bits.  In CPM 2004, we have shown that |S|  H k |T| +  k.

Comparison 560

Compressed Suffix Arrays based on Run-Length Encoding Veli Mäkinen Bielefeld University Gonzalo Navarro University of Chile BWTRLFID.

Similar presentations

Presentation on theme: "Compressed Suffix Arrays based on Run-Length Encoding Veli Mäkinen Bielefeld University Gonzalo Navarro University of Chile BWTRLFID."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Compressed Suffix Arrays based on Run-Length Encoding Veli Mäkinen Bielefeld University Gonzalo Navarro University of Chile BWTRLFID.

Similar presentations

Presentation on theme: "Compressed Suffix Arrays based on Run-Length Encoding Veli Mäkinen Bielefeld University Gonzalo Navarro University of Chile BWTRLFID."— Presentation transcript:

Similar presentations

About project

Feedback