Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter
2 Outline Reminders Motivation Compression results Time & Space bounds Compressed Suffix Tree Compressed Suffix Array Proof of bounds
3 Reminder - Symbols T = t 1 t 2...t n-1 text of length n-1 eof symbol # at the n th position T[i,n] is suffix i of text T i=1,…,n
4 Reminder - Symbols P = p 1 p 2...p m pattern of length m 0<ε≤1
5 Reminder - Main Goal Search string pattern P within text T Support fast queries Text T being fully scanned only once
6 Reminder – Suffix Trees Leaf with value i represents suffix [i,n] Build time O(n) Search time O(m) Structure space O(n)
7 Reminder – Suffix Arrays Lexicographically ordered SA[i] = the starting position in T of the i-th suffix Σ={a,b} a<#<b T = bbba# a##ba#bba#bbba#
8 Reminder – Suffix Arrays Build time O(nlogn) Search time O(m+logn) Structure space O(n)
9 Motivation So Far Greedy in space Fast searching Need for space-efficient text indexing Reduce both space and query time
10 Compressed Suffix Tree Build time O(n) Search time O(m/logn+(logn) ε ) Structure space (ε -1 +O(1)) n
11 Compressed Suffix Tree Build Suffix Array Build Compressed Suffix Tree Patricia Tries Compress Suffix Array
12 CSA Basic Operations Compress(T,SA) Return succinct representation of SA Retain T Discard SA Lookup(i) Return SA[i] Use compressed SA
13 CSA Primary measures Compress Preprocessing compressed SA Space of compressed SA lookup Query time
14 Compressed Suffix Array Build time O(n) Structure space ½nloglogn + O(n) lookup time O(loglogn)
15 Suffix Arrays Optimization Main idea Decomposition scheme Recursive structure of permutations
16 Decomposition Scheme K levels, K=0,….,l SA 0 = SA ( Original SA) n 0 =n n = |T| assumption - n is a power of 2 n k =n/2 k SA k ={1,2,…,n k )
17 SA k Succinct Representation 4 main steps: 1. Produce bit vector B k 2. Map B k 0’s to 1’s 3. Compute 1’s for each prefix in B k Using function rank k (j) 4. ‘Pack’ SA k
18 Step #1: Produce bit vector B k |B k | = n k B k [i]=1 if SA k [i] is even B k [i]=0 if SA k [i] is odd T = bba# 2431 SA 0 BoBo 1100
19 Step #2 : Map B k 0’s to 1’s New Fuction Ψ k (i), i=1,…,n k Ψ k (i) = jSA k [i] is odd and SA k [j]= SA k [i]+1 iotherwise (SA k [i] is even) T = bba# 2431 BoBo ΨoΨo SA 0
20 Step #3 : Compute 1’s for B k Recall fuction rank k (j), j=1,…,l k rank k (j) = number of 1’s on first j bits of B k T = bba# 2431 BoBo 1100 SA rank o
21 Step #4 : ‘Pack’ SA k Pack even values of SA k Divide by 2 New permutation {1,2,..,n k+1 } n k+1 =n k /2=n/2 k+1 Store new permutation into SA k+1 Remove SA k |SA k+1 | = |SA k |/2
22 Example: level 0, steps 1-3
23 Example: level 0, step 4
24 Lemma : Reconstruct SA k Results of phase k B k, Ψ k, rank k,SA k+1 Reconstruct SA k SA k [i] = 2*SA k+1 [rank k (Ψ k (i))] + (B k [i]-1) i = 1,….n k
25 Proof, case 1, B k [i] = 1 SA k [i] = 2*SA k+1 [rank k (Ψ k (i))] + (B k [i]-1) Step #4 : SA k [i]/2 stored in rank k (i) th entry of SA k+1 SA k [i] = 2 * SA k+1 [rank k (i)] Step #2 : Ψ k (i) = i
26 Proof, case 2, B k [i] = 0 SA k [i] = 2*SA k+1 [rank k (Ψ k (i))] + (B k [i]-1) Ψ k (i) = j Step #2 : SA k [i] = SA k [j]-1 B k [j] = 1 Apply case 1 on j SA k [j] = 2 * SA k+1 [rank k (j)]
27 Example, case 1, B k [i] = 1 SA 0 [2] = ? B 0 [2]=1, Ψ 0 (2)=2, rank 0 (2) = 1 SA 0 [2]/2 stored in 1 st entry of SA 1 SA 0 [2] = 2 * SA 1 [1] = 2 * 8 = 16
28 Example, case 2, B k [i] = 0 SA 0 [3] = ? B 0 [3]=0, Ψ 0 (3) = 14, rank 0 (14) = 6 SA 0 [14] = 2 * SA 1 [6] = 2 * 16 = 32 SA 0 [3] = SA 0 [14] - 1 = = 31
29 Example - Decomposition
30 Determining l n 0 = n = 32 n 3 = 4 ~ n/logn can be stored in ≤ n bits Conclusion l = loglogn
31 CSA Structure K levels, k = 0,1,….,l-1 Store B k, Ψ k, rank k Final Level k = l Store only SA l
32 CSA Structure & Build B k n k bits per vector O(n k ) build rank k O(n k (loglogn k )/logn k ) bits As shown before O(n k ) build Sa l (n/2 l )logn bits
33 CSA Structure space - Ψ k List method 2 K lists possibilities for ‘prefixes’ of suffixes Number of lists increases L k = concatenation of all 2 K lists |L k | = n k /2 |L k | decreases
34 CSA Structure space - Ψ k For i = 1,…,n k /2 j = i th 1 in B k Pattern in 2 K (SA k [j]-1),…, 2 K *SA k [j]-1 matched to a list
35 Level 0 a list = {2,14,15,18,23,28,30,31} b list = {7,8,10,13,16,17,21,27}
36 Levels 1,2 Level 1 aa = {} //empty list ab = {9} ba = {1,6,12,14} bb = {2,4,5} Level 2 abba = {5,8} baba = {1} aabb = {4}
37 Reconstruct Ψ k B k [i] =1 Ψ k (i) = i B k [i] =0 h = number of 0’s in B k Ψ k (i) = L k [h]
38 example : Reconstruct Ψ k Ψ 0 (25) = ? B 0 [25] =0 h = = 13 Ψ 0 (25) = L 0 [13] = 16
39 example : Reconstruct Ψ k rank 0 (16) = 8 SA 1 [8] = ? Ψ 1 [8] = ? B 1 [8] =0 h = = 3 Ψ 1 (8) = L 1 [3] = 6
40 Lemma S sorted integers w bits per number S < 2 w Store integers S(2+w-logs)+O(s/loglogs) Retrieve h th integer O(1)
41 Store L k Store integers n(1/2+3/2 K+1 )+O(n/2 k loglogn) Retrieve h th integer O(1) Preprocess time O(n/2 k +2 2 k )
42 CSA Structure - Summary B k n k rank k O(n k (loglogn k )/logn k ) Sa l (n/2 l )logn Ψ k n(½+3/2 K+1 )+O(n/2 k loglogn)
43 Summing it up… nlogn/2 l + ½l*n + 5n + O(n/loglogn) ≤½nloglogn+n ½nloglogn + O(n) bits of storage
44 Preprocess - summary B k O(n k ) rank k O(n k ) Ψ k O(n/2 k +2 2 k ) Summing up 0,..,l-1 levels Preprocess time O(n)
45 lookup(i) lookup(i) refers to SA 0 [i] Need to reconstruct SA 0 [i] New procedure - rlookup(i,k) Recursive Based on lemma of reconstructing SA k
46 rlookup(i,k) If k = l Return Sa l [i] else Return 2*rlookup(rank k (Ψ k (i)),k+1)+(B k [i]-1)
47 Reconstruct SA k Lemma 2*SA k+1 [rank k (Ψ k (i))] + (B k [i]-1) lookup(i) = rlookup(i,0)
48 Example - lookup(i) lookup(5) = rlookup(5,0), l=3 2*rlookup(rank 0 (Ψ 0 (5)),1)+(B 0 [5]-1) 2*rlookup(10,1)+(-1)
49 Example – cont. rlookup(10,1) = 2*rlookup(rank 1 (Ψ 1 (10)),2)+(B 1 [10]-1) 2*rlookup(7,2)+(-1)
50 Example - cont. rlookup(7,2) = 2*rlookup(rank 2 (Ψ 2 (7)),3)+(B 2 [7]-1) 2*rlookup(2,3)+(-1)
51 Example - cont. rlookup(2,3) = lookup(5) = 2*(2*(2*3+(-1))+(-1))+(-1) = 2*(2*(5)+(-1))+(-1) = 2*(9)+(-1) = 17
52 lookup(i) lookup(i) = rlookup(i,0) l+1 levels O(1) per level O(loglogn) lookup time
53 Compressed Suffix Array Build time O(n) Structure space ½nloglogn + O(n) lookup time O(loglogn)
54 Compressed Suffix Tree Build time O(n) Search time O(m/logn+(logn) ε ) Structure space (ε -1 +O(1)) n