Presentation is loading. Please wait.

Presentation is loading. Please wait.

Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

Similar presentations


Presentation on theme: "Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter."— Presentation transcript:

1 Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

2 2 Outline Reminders Motivation Compression results Time & Space bounds Compressed Suffix Tree Compressed Suffix Array Proof of bounds

3 3 Reminder - Symbols T = t 1 t 2...t n-1 text of length n-1 eof symbol # at the n th position T[i,n] is suffix i of text T i=1,…,n

4 4 Reminder - Symbols P = p 1 p 2...p m pattern of length m 0<ε≤1

5 5 Reminder - Main Goal Search string pattern P within text T Support fast queries Text T being fully scanned only once

6 6 Reminder – Suffix Trees Leaf with value i represents suffix [i,n] Build time O(n) Search time O(m) Structure space O(n)

7 7 Reminder – Suffix Arrays Lexicographically ordered SA[i] = the starting position in T of the i-th suffix Σ={a,b} a<#<b T = bbba# a##ba#bba#bbba# 1 2 3 4 5 4 5 3 2 1

8 8 Reminder – Suffix Arrays Build time O(nlogn) Search time O(m+logn) Structure space O(n)

9 9 Motivation So Far Greedy in space Fast searching Need for space-efficient text indexing Reduce both space and query time

10 10 Compressed Suffix Tree Build time O(n) Search time O(m/logn+(logn) ε ) Structure space (ε -1 +O(1)) n

11 11 Compressed Suffix Tree Build Suffix Array Build Compressed Suffix Tree Patricia Tries Compress Suffix Array

12 12 CSA Basic Operations Compress(T,SA) Return succinct representation of SA Retain T Discard SA Lookup(i) Return SA[i] Use compressed SA

13 13 CSA Primary measures Compress Preprocessing compressed SA Space of compressed SA lookup Query time

14 14 Compressed Suffix Array Build time O(n) Structure space ½nloglogn + O(n) lookup time O(loglogn)

15 15 Suffix Arrays Optimization Main idea Decomposition scheme Recursive structure of permutations

16 16 Decomposition Scheme K levels, K=0,….,l SA 0 = SA ( Original SA) n 0 =n n = |T| assumption - n is a power of 2 n k =n/2 k SA k ={1,2,…,n k )

17 17 SA k Succinct Representation 4 main steps: 1. Produce bit vector B k 2. Map B k 0’s to 1’s 3. Compute 1’s for each prefix in B k Using function rank k (j) 4. ‘Pack’ SA k

18 18 Step #1: Produce bit vector B k |B k | = n k B k [i]=1 if SA k [i] is even B k [i]=0 if SA k [i] is odd T = bba# 2431 SA 0 BoBo 1100

19 19 Step #2 : Map B k 0’s to 1’s New Fuction Ψ k (i), i=1,…,n k Ψ k (i) = jSA k [i] is odd and SA k [j]= SA k [i]+1 iotherwise (SA k [i] is even) T = bba# 2431 BoBo 11003223 ΨoΨo SA 0

20 20 Step #3 : Compute 1’s for B k Recall fuction rank k (j), j=1,…,l k rank k (j) = number of 1’s on first j bits of B k T = bba# 2431 BoBo 1100 SA 0 2102 rank o

21 21 Step #4 : ‘Pack’ SA k Pack even values of SA k Divide by 2 New permutation {1,2,..,n k+1 } n k+1 =n k /2=n/2 k+1 Store new permutation into SA k+1 Remove SA k |SA k+1 | = |SA k |/2

22 22 Example: level 0, steps 1-3

23 23 Example: level 0, step 4

24 24 Lemma : Reconstruct SA k Results of phase k B k, Ψ k, rank k,SA k+1 Reconstruct SA k SA k [i] = 2*SA k+1 [rank k (Ψ k (i))] + (B k [i]-1) i = 1,….n k

25 25 Proof, case 1, B k [i] = 1 SA k [i] = 2*SA k+1 [rank k (Ψ k (i))] + (B k [i]-1) Step #4 : SA k [i]/2 stored in rank k (i) th entry of SA k+1 SA k [i] = 2 * SA k+1 [rank k (i)] Step #2 : Ψ k (i) = i

26 26 Proof, case 2, B k [i] = 0 SA k [i] = 2*SA k+1 [rank k (Ψ k (i))] + (B k [i]-1) Ψ k (i) = j Step #2 : SA k [i] = SA k [j]-1 B k [j] = 1 Apply case 1 on j SA k [j] = 2 * SA k+1 [rank k (j)]

27 27 Example, case 1, B k [i] = 1 SA 0 [2] = ? B 0 [2]=1, Ψ 0 (2)=2, rank 0 (2) = 1 SA 0 [2]/2 stored in 1 st entry of SA 1 SA 0 [2] = 2 * SA 1 [1] = 2 * 8 = 16

28 28 Example, case 2, B k [i] = 0 SA 0 [3] = ? B 0 [3]=0, Ψ 0 (3) = 14, rank 0 (14) = 6 SA 0 [14] = 2 * SA 1 [6] = 2 * 16 = 32 SA 0 [3] = SA 0 [14] - 1 = 32 - 1 = 31

29 29 Example - Decomposition

30 30 Determining l n 0 = n = 32 n 3 = 4 ~ n/logn can be stored in ≤ n bits Conclusion l = loglogn

31 31 CSA Structure K levels, k = 0,1,….,l-1 Store B k, Ψ k, rank k Final Level k = l Store only SA l

32 32 CSA Structure & Build B k n k bits per vector O(n k ) build rank k O(n k (loglogn k )/logn k ) bits As shown before O(n k ) build Sa l (n/2 l )logn bits

33 33 CSA Structure space - Ψ k List method 2 K lists possibilities for ‘prefixes’ of suffixes Number of lists increases L k = concatenation of all 2 K lists |L k | = n k /2 |L k | decreases

34 34 CSA Structure space - Ψ k For i = 1,…,n k /2 j = i th 1 in B k Pattern in 2 K (SA k [j]-1),…, 2 K *SA k [j]-1 matched to a list

35 35 Level 0 a list = {2,14,15,18,23,28,30,31} b list = {7,8,10,13,16,17,21,27}

36 36 Levels 1,2 Level 1 aa = {} //empty list ab = {9} ba = {1,6,12,14} bb = {2,4,5} Level 2 abba = {5,8} baba = {1} aabb = {4}

37 37 Reconstruct Ψ k B k [i] =1 Ψ k (i) = i B k [i] =0 h = number of 0’s in B k Ψ k (i) = L k [h]

38 38 example : Reconstruct Ψ k Ψ 0 (25) = ? B 0 [25] =0 h = 25 - 12 = 13 Ψ 0 (25) = L 0 [13] = 16

39 39 example : Reconstruct Ψ k rank 0 (16) = 8 SA 1 [8] = ? Ψ 1 [8] = ? B 1 [8] =0 h = 8 - 5 = 3 Ψ 1 (8) = L 1 [3] = 6

40 40 Lemma S sorted integers w bits per number S < 2 w Store integers S(2+w-logs)+O(s/loglogs) Retrieve h th integer O(1)

41 41 Store L k Store integers n(1/2+3/2 K+1 )+O(n/2 k loglogn) Retrieve h th integer O(1) Preprocess time O(n/2 k +2 2 k )

42 42 CSA Structure - Summary B k n k rank k O(n k (loglogn k )/logn k ) Sa l (n/2 l )logn Ψ k n(½+3/2 K+1 )+O(n/2 k loglogn)

43 43 Summing it up… nlogn/2 l + ½l*n + 5n + O(n/loglogn) ≤½nloglogn+n ½nloglogn + O(n) bits of storage

44 44 Preprocess - summary B k O(n k ) rank k O(n k ) Ψ k O(n/2 k +2 2 k ) Summing up 0,..,l-1 levels Preprocess time O(n)

45 45 lookup(i) lookup(i) refers to SA 0 [i] Need to reconstruct SA 0 [i] New procedure - rlookup(i,k) Recursive Based on lemma of reconstructing SA k

46 46 rlookup(i,k) If k = l Return Sa l [i] else Return 2*rlookup(rank k (Ψ k (i)),k+1)+(B k [i]-1)

47 47 Reconstruct SA k Lemma 2*SA k+1 [rank k (Ψ k (i))] + (B k [i]-1) lookup(i) = rlookup(i,0)

48 48 Example - lookup(i) lookup(5) = rlookup(5,0), l=3 2*rlookup(rank 0 (Ψ 0 (5)),1)+(B 0 [5]-1) 2*rlookup(10,1)+(-1)

49 49 Example – cont. rlookup(10,1) = 2*rlookup(rank 1 (Ψ 1 (10)),2)+(B 1 [10]-1) 2*rlookup(7,2)+(-1)

50 50 Example - cont. rlookup(7,2) = 2*rlookup(rank 2 (Ψ 2 (7)),3)+(B 2 [7]-1) 2*rlookup(2,3)+(-1)

51 51 Example - cont. rlookup(2,3) = lookup(5) = 2*(2*(2*3+(-1))+(-1))+(-1) = 2*(2*(5)+(-1))+(-1) = 2*(9)+(-1) = 17

52 52 lookup(i) lookup(i) = rlookup(i,0) l+1 levels O(1) per level O(loglogn) lookup time

53 53 Compressed Suffix Array Build time O(n) Structure space ½nloglogn + O(n) lookup time O(loglogn)

54 54 Compressed Suffix Tree Build time O(n) Search time O(m/logn+(logn) ε ) Structure space (ε -1 +O(1)) n


Download ppt "Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter."

Similar presentations


Ads by Google