Download presentation
Presentation is loading. Please wait.
Published byChristopher Nicholson Modified over 9 years ago
1
Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache
2
Joint Advanced Student School 2004 2 Motivation Problem: Suffix Arrays are datastructures that support fast searching of patterns in long texts, but require large amounts of memory Goal: Find a compression that reduces the size of the Suffix Array while still allowing fast access.
3
Joint Advanced Student School 2004 3 Outline 1.Trivial Compression 2.Grossi & Vitter –Outline –Algorithms and Analysis 3.Sadakane –Outline –Algorithms and Analysis (Sketch)
4
Joint Advanced Student School 2004 4 Conventions used Text T [1…n] –Binary text {a,b} n, terminated with # Pattern P [1…m] –Binary text {a,b} m Suffix Array SA [1…n] –Each entry points to a T [ i ] –Uses n log n bits
5
Joint Advanced Student School 2004 5 Trivial Compression Construct and store SA; T = baab# 1.baab# 2.aab# 3.ab# 4.b# 5.# 2.aab# 3.ab# 5.# 1.baab# 4.b# SA = [ 2,3,5,1,4 ] a < # < b
6
Joint Advanced Student School 2004 6 Trivial Compression Recover T from SA SA = [ 2,3,5,1,4 ]T = _ _ _ _ # SA = [ 2,3,5,1,4 ]T = b _ _ b # SA = [ 2,3,5,1,4 ]T = b a a b # # < b a < #
7
Joint Advanced Student School 2004 7 Trivial Compression Therefore each Suffix Array can be compressed to Drawback: decompression takes
8
Joint Advanced Student School 2004 8 Grossi & Vitter Outline: –Recursive „Divide and Conquer“-type algorithm –Stores SA implicitly (for all but the last level) Supported operations –lookup( i ) –compress
9
Joint Advanced Student School 2004 9 G&V – compress Structural Outline SA 0 SA 1 SA 2 SA l
10
Joint Advanced Student School 2004 10 G&V – compress Given a Text T; create SA 12345678910111213141516 ABBABBABBABBABAA 15163113171928107412124321430 17181920212223242526272829303132 ABABABBABBBABBA# 12182796320232911268522225 T SA
11
Joint Advanced Student School 2004 11 G&V – compress Create array B [1...n] with n = |SA| –B [ i ] = 1, if T [ i ] even 12345678910111213141516 ABBABBABBABBABAA 15163113171928107412124321430 11111111 17181920212223242526272829303132 ABABABBABBBABBA# 12182796320232911268522225 11111111 T SA B
12
Joint Advanced Student School 2004 12 G&V – compress Create array B [1...n] with n = |SA| –B [ i ] = 1, if T [ i ] even –B [ i ] = 0, if T [ i ] odd 12345678910111213141516 ABBABBABBABBABAA 15163113171928107412124321430 0100001101001111 17181920212223242526272829303132 ABABABBABBBABBA# 12182796320232911268522225 1100101000110110 T SA B
13
Joint Advanced Student School 2004 13 G&V – compress Create an array rank [ 1...n ] where rank [ i ] contains the number of 1s in B[ 1...i ] 12345678910111213141516 ABBABBABBABBABAA 15163113171928107412124321430 0100001101001111 0111112334445678 17181920212223242526272829303132 ABABABBABBBABBA# 12182796320232911268522225 1100101000110110 910 11 12 1314 1516 T SA B rank
14
Joint Advanced Student School 2004 14 G&V – compress Define a mapping [ 1..n ] so that –If B[ i ] = 0: [ i ] = j | SA [ j ] = SA [ i ] +1 12345678910111213141516 ABBABBABBABBABAA 15163113171928107412124321430 0100001101001111 0111112334445678 21415182328103031 17181920212223242526272829303132 ABABABBABBBABBA# 12182796320232911268522225 1100101000110110 910 11 12 1314 1516 78101316172927 T SA B Rank
15
Joint Advanced Student School 2004 15 G&V – compress Define a mapping [ 1..n ] so that –If B[ i ] = 0: [ i ] = j | SA [ j ] = SA[ i ] +1 –If B[ i ] = 1: [ i ] = i 12345678910111213141516 ABBABBABBABBABAA 15163113171928107412124321430 0100001101001111 0111112334445678 2214151823782810303113141516 17181920212223242526272829303132 ABABABBABBBABBA# 12182796320232911268522225 1100101000110110 910 11 12 1314 1516 171878211023131617272829203127 T SA B Rank
16
Joint Advanced Student School 2004 16 G&V – compress Compressing from SA k to SA k+1 Store only even values of SA k in SA k+1 Divide each entry in SA k+1 by 2 12345678910111213141516 ABBABBABBABBABAA 15163113171928107412124321430 0100001101001111 0111112334445678 2214151823782810303113141516 17181920212223242526272829303132 ABABABBABBBABBA# 12182796320232911268522225 1100101000110110 910 11 12 1314 1516 171878211023131617272829203127 T SA k BkBk Rank k kk 12345678910111213141516 81452121671569310134111SA k+1
17
Joint Advanced Student School 2004 17 G&V – lookup Reconstruction of SA k [ i ] using B k, rank k, k and SA k+1 [ i ] SA k [ i ] = 2 SA k+1 [ rank k ( k ( i ))] + (B [ i ] –1)
18
Joint Advanced Student School 2004 18 G&V – lookup Proof / Example part 1: B [ i ] = 1 SA k [ i ] = 2 SA k+1 [ rank k ( k ( i ))] + (B [ i ] –1) SA k [ i ] = 2 SA k+1 [ rank k ( k ( i ))] SA k [ i ] = 2 SA k+1 [ rank k ( i )] 12345678910111213141516 ABBABBABBABBABAA 10 0100001101001111 0111112334445678 2214151823782810303113141516 17181920212223242526272829303132 ABABABBABBBABBA# 18 1100101000110110 910 11 12 1314 1516 171878211023131617272829203127 T SA k BkBk Rank k kk 12345678910111213141516 81452121671569310134111SA k+1
19
Joint Advanced Student School 2004 19 G&V – lookup Proof / Example part 2: B [ i ] = 0 SA k [ i ] = 2 SA k+1 [ rank k ( k ( i ))] + (B [ i ] –1) SA k [ i ] = 2 SA k+1 [ rank k ( k ( i ))] - 1 12345678910111213141516 ABBABBABBABBABAA 17 0100001101001111 0111112334445678 2214151823782810303113141516 17181920212223242526272829303132 ABABABBABBBABBA# 3 1100101000110110 910 11 12 1314 1516 171878211023131617272829203127 T SA k BkBk Rank k kk 12345678910111213141516 81452121671569310134111SA k+1
20
Joint Advanced Student School 2004 20 G&V-lookup Stored information For each level k = 0...l-1, –explicitly store B k –rank k and k stored implicit –SA reconstructible by recursion For level l store SA l explicit –No further information neede
21
Joint Advanced Student School 2004 21 G&V – lookup lookup ( i ) = rlookup( i,0 ) rlookup ( i, k ) if (k == l) return SA[i]; else return 2 * rlookup( rank k [ psi k [i]], k+1) + (B k [i]-1); end Pseudocode for the lookup function
22
Joint Advanced Student School 2004 22 G&V - details Speed versus Time Quick and Large Small and Slow Space (in bits)O (n log log n)O (n) TimeO (log log n)O (log n) > 0
23
Joint Advanced Student School 2004 23 G&V – Quick and Large Storing rank spaceefficient and quickly accessible: Explicit storage of rank takes n log n bits Jacobson´s method uses o( n ) bits Both allow for constant time access
24
Joint Advanced Student School 2004 24 G&V – Quick and Large Storing k efficiently (outline): Create 2 k arrays; one for each possible substring over {a,b} 2 k using the substring as label aa ab ba bb 12345678910111213141516 81452121671569310134111 1101110010010100 1223455566677888 SA 2 B2B2 rank 2 12345678910111213141516 ABBABBABBABBABAA 17181920212223242526272829303132 ABABABBABBBABBA#T Example using k=2
25
Joint Advanced Student School 2004 25 G&V – Quick and Large For each B k [ j ] = 1 find the 2 k literals preceding the suffix referenced by SA k [ j ] in T Store j in the array according to T aa ab ba1 bb 12345678910111213141516 81452121671569310134111 1101110010010100 1223455566677888 SA 2 B2B2 rank 2 12345678910111213141516 ABBABBABBABBABAA 17181920212223242526272829303132 ABABABBABBBABBA#T Example using k=2
26
Joint Advanced Student School 2004 26 G&V – Quick and Large In other words: For each i with B k [ i ] = 0 and t the first 2 k literals of the suffix referenced by SA k [ i ], insert [ i ] in array t aa ab9 ba1 bb 12345678910111213141516 81452121671569310134111 1101110010010100 1223455566677888 SA 2 B2B2 rank 2 12345678910111213141516 ABBABBABBABBABAA 17181920212223242526272829303132 ABABABBABBBABBA#T Example using k=2
27
Joint Advanced Student School 2004 27 G&V – Quick and Large For each j with B k [ j ] = 1 t = T [ 2 k SA k [ j ] – 2 k,..., 2 k SA k [ j ] – 1] add j to the array with label t aa ab9 ba1, 6, 12, 14 bb2, 4, 5 12345678910111213141516 81452121671569310134111 1101110010010100 1223455566677888 SA 2 B2B2 rank 2 12345678910111213141516 ABBABBABBABBABAA 17181920212223242526272829303132 ABABABBABBBABBA#T Example using k=2
28
Joint Advanced Student School 2004 28 G&V – Quick and Large To calculate k ( i ), use i - rank k ( i ) as index to the concatenated arrays aa ab9 ba1, 6, 12, 14 bb2, 4, 5 12345678910111213141516 81452121671569310134111 1101110010010100 1223455566677888 SA 2 B2B2 rank 2 12345678910111213141516 ABBABBABBABBABAA 17181920212223242526272829303132 ABABABBABBBABBA#T Example using k=2 2 (8) = 6
29
Joint Advanced Student School 2004 29 G&V – Quick and Large l = log log n levels of compression Space occupation: l levels, each occupying O ( n ) bits O ( n log log n ) bits Time requirement for lookup ( i ): l levels, each requiring O ( 1) O ( log log n ) steps
30
Joint Advanced Student School 2004 30 G&V – Small and Slow Reduction of size by allowing for higher time usage Quick and Large Small and Slow Space (in bits)O (n log log n)O (n) TimeO (log log n)O (log n) > 0
31
Joint Advanced Student School 2004 31 G&V – Small and Slow Instead of storing all l = log log n levels, store only levels Example n = 32 store levels 0, 2, 3
32
Joint Advanced Student School 2004 32 G&V – Small and Slow Example using | T | = 32 SA 0 SA 1 SA 3 SA 2
33
Joint Advanced Student School 2004 33 G&V – Small and Slow Keep only 3 levels SA 0 SA 1 SA 3
34
Joint Advanced Student School 2004 34 G&V – Small and Slow On levels 0 and l´, mark entries that are still present in the next level SA 0 SA 1 SA 3
35
Joint Advanced Student School 2004 35 G&V – Small and Slow Before the modification: B k [ i ] = 1 SA k [ i ] is stored in SA k+1 k used for each B k [ i ] = 0 to find SA k [ [ i ] ] = SA k [ i ] +1 Modifications added: B o ´[ i ] = 1 SA 0 [ i ] is stored in SA l´ B l´ ´[ i ] = 1 SA l´ [ i ] is stored in SA l ´ k used for each B k [ i ] = 1 to find SA k [ [ i ] ] = SA k [ i ] +1
36
Joint Advanced Student School 2004 36 G&V – Small and Slow Construction of ´ and B´ markings of indices 1234 2341 SA 3 12345678910111213141516 81452121671569310134111 1000110000000100 9161214245 1081113157163 SA 1 B1´B1´ 11 1´1´
37
Joint Advanced Student School 2004 37 G&V – Small and Slow ´ and in combination can be used to traverse the entire SA (ascending) 1234 2341 SA 3 12345678910111213141516 81452121671569310134111 1000110000000100 9161214245 1081113157163 SA 1 B1´B1´ 11 1´
38
Joint Advanced Student School 2004 38 G&V – Small and Slow Length of the traversal determines required time for lookup Level 0 contains n entries Level l´ was divided ½ log log n times
39
Joint Advanced Student School 2004 39 G&V – Small and Slow Length of the traversal determines required time for lookup 0s in B´ are evenly spaced Longest sequence of 0s
40
Joint Advanced Student School 2004 40 G&V – Small and Slow Generalized for more than 2 additional levels (the number must be constant!): Let L be the number of levels, = L -1 The longest sequence of 0s has length log n
41
Joint Advanced Student School 2004 41 G&V – Small and Slow reconstruction of levels < l requires a vector describing which entries of level k´ can be found in k´+1 =>O ( n ) bits a function ´ that combined with allows for complete traversal of SA =>O ( n ) bits
42
Joint Advanced Student School 2004 42 Sadakane Improvements on the datastructure and algorithms proposed by Grossi & Vitter More operations –inverse( j ): return i so that SA[ i ] = j –search( P ): return l, r where P matches T –decompress( s, e ): return T[s...e] Allow for alphabets | | > 2
43
Joint Advanced Student School 2004 43 Sadakane – inverse( i ) Goal: For a suffix starting at position j, find the index i of the lexicographic order of all suffices Assuming: j = SA[ i ] Create SA -1 so that: i = SA -1 [ j ]
44
Joint Advanced Student School 2004 44 Sadakane – inverse( i ) Proposition: inverse( i ) can be computed in O( log n ) with explicit storage of SA -1 at the last level and a recursion for all above.
45
Joint Advanced Student School 2004 45 Sadakane – search( P ) Goal: Find the interval [ l...r ] in SA so that P matches each of the suffices pointed to by SA. Do so without using T
46
Joint Advanced Student School 2004 46 Sadakane – search( P ) Proposition: By augmenting the datastructure by a function C -1 (the „inverse of the array of cumulative frequencies“) it is possible to obtain the substring in O( |P| ) time
47
Joint Advanced Student School 2004 47 Sadakane – decompress( I ) Goal: Using only SA and its functions, return the substring of T pointed to by I = [ s, e ].
48
Joint Advanced Student School 2004 48 Sadakane – decompress( I ) Proposition: A substring of length l = e-s+1can be decompressed using only SA, SA -1 and C -1 in O( l + log n ) time, where n is the length of the original text.
49
Joint Advanced Student School 2004 49 Sadakane – Complexity Using inverse, search and decompress it is possible to implicitly store T. Therefore O( n ) words are no longer required. The space-complexity of the Sadakane- improved Suffix Array is only 37% of a Grossi&Vitter Suffix Array including the text
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.