Download presentation
Presentation is loading. Please wait.
Published byHester Robinson Modified over 9 years ago
1
Compressed suffix arrays and suffix trees with applications to text indexing and string matching
2
Jeffrey scott vitter Roberto Grosssi
3
Agenda A (very) short review on suffix arrays Introduction Problem Definition Information theory reasoning –Simple solution round 2 Compressed suffix arrays in ½ *nloglogn +O(n) bits and O(loglogn) access time Rank And Select Problem definitions –Rank DS Compressed suffix arrays in ε -1 n + O(n) bits and O(log ε n) access time Select data structure (if time permits)
4
Short review on suffix arrays A suffix array is a sorted array of the suffix of a string S represented by an array of pointers to the suffixes of S For example The string TelAviv and it ’ s corresponding suffix array telaviv S0S0S0S0 elaviv S1S1S1S1 laviv S2S2S2S2 aviv S3S3S3S3 viv S4S4S4S4 iv S5S5S5S5 v S6S6S6S6 4602513
5
Introduction Succinct data structures branch Dna genome strings (small alphabet, large strings) Mainly a Theoretical article
6
Problem Definition The Algorithm Is composed of two phases –compression –lookup Compress : – given a suffix array Sa compress it to get it ’ s succinct representation lookup(i): –Given the compressed representation return SA[i]
7
Some Definitions We will deal (at first) with binary alphabet –Σ = {a,b} We will add a special end of string symbol # And will set the relation between the characters to be –a<#<b (*) Basic Ram Model –Log(n) word size –Word lookup and arithmetic in constant time
8
Information theory reasoning abba#15432abab#41532abaa#13524abaa#34152 aabb# 12543 aaba#14253aaab#12354aaaa#12345 bbbb# 54321 bbba# 45321 bbab# 35241 bbaa# 34521 babb# 25143 baba# 42531 baab# 23514 baaa#23451
9
Information theory reasoning (2) Suffix array size nlog(n) One to one corresponds between the suffix array to the string –Construction details Number of possible suffix arrays 2 n-1 –Perfect compress n bits (the string itself) –The cost for lookup Ω(n) see prev lecture
10
“Simple” solution round 2 different approach Let ’ s pack together each logn bits to create a new alphabet. So the text length will be n/logn and the pattern length would be m/logn The suffix array will take o(n) bits Searching becomes hard (alignment) – the text is aligned but the pattern isn ’ t logn cases
11
“Simple” solution round 2 the text isn ’ t aligned the pattern occurs k bit right to a word boundary Need to append k bits to the pattern and check it So we need to check 2^k cases K~logn => n different cases to check Assuming we know how much to pad!!
12
General framework Abstract Data Type Optimization [Jacobson'89] # distinct Data structures = C(n) => Each data structure occupies O(log C(n)) bits. Doesn ’ t guarantee the time complexity on the supported operations
13
Compressed suffix arrays in ½*nloglogn +O(n) bits and O(loglogn) access time Recursive method in nature –Take advantage on the suffixes Let Sa 0 be the uncompressed suffix array And N 0 be it ’ s size (assume power of 2) In The k phase of the compression we start with Sa k with the size and create Sa k+1 with the size Sa k+1 holds the permutation {1..N k+1 }
14
Sa k+1 Construction Create the B k bit vector B k [i] = 1 iff Sa k [i] is even B k [i] = 1 iff Sa k [i] is even create the Rank vector Rank k (j) counts the number of one bits in the first j bits of B k Create the Ψk(i) vector –stores the 0 to 1 companion relation) Store the even values from Sa k in Sa k+1
15
An Example The 32 chars string T abbabbabbabbabaaabababbabbbabba#
16
An Example 16151413121110987654321 aababbabbabbabbaText 30143224211471028191713311615 Sa 0 1111001011000010 B0B0B0B0 8765444332111110 Rank 0 1615141331301028872318151422 Ψ0Ψ0Ψ0Ψ0
17
Example … 3231302928272625242322212019181716 #abbabbbabbababaa 2522258261129232036927181230 01101100010100111 16161514141312121212111110101098 27313021282717161323102187181716
18
How To compute Sa k from Sa k-1 Lemma 1 –Given suffix array Sa k let B k rank k Ψ k and Sa k+1 Be the result of the transformation performed by phase k we can construct Sa k from Sak+1 by the following formula Sa k [i] = 2* Sa k+1 [rank k (Ψ k (i))]+(B k [i]-1) –Let ’ s split for 2 cases Bk[i] is even Bk[i] is odd
19
Example continue 11141310396157161225148 Sa 1 0010100100111011 B1B1B1B1 8887766655543221 Rank 1 54142121412961654921 Ψ1Ψ1Ψ1Ψ1 25386174 Sa 2 10011001 B2B2B2B2 43332111 Rank 2 84154851 Ψ2Ψ2Ψ2Ψ2 1432 Sa 3
20
Compress –We Keep l = O(loglogn) levels –All Levels but the Sal level are save implicitly –For each of the level 0..l-1 we save B j,rank j Ψ j –rank j Ψ j are stored implicitly –The Size of Sa l is
21
lookup just compute recursively Sa k [i] from Sa k+1 [i] Recursion depth loglogn All data structure going to be used have o(1) access time O(loglogn) lookup cost
22
How The Data Is Stored The Bk bit vector is stored explctiy –O(Nk) space –O(1) lookup –O(Nk) preprocess time The Rank K vector is stored implicitly using Jacobson rank data structure –O(N k (loglogn k )/logn k ) space –O(1) lookup –O(Nk) preprocess time The Ψ k vector is stored implicitly (using rank and select)
23
Ψ k vector representation
24
Let’s Take a look
25
An Example 16151413121110987654321 aababbabbabbabbaText 30143224211471028191713311615 Sa 0 1111001011000010 B0B0B0B0 8765444332111110 Rank 0 1615141331301028872318151422 Ψ0Ψ0Ψ0Ψ0
26
Example … 3231302928272625242322212019181716 #abbabbbabbababaa 2522258261129232036927181230 01101100010100111 16161514141312121212111110101098 27313021282717161323102187181716
27
So What can we do with all the list’s Concatenate them together in a lexicographical order and form the Lk list L 1 ={9,1,6,12,14,2,4,5} Let ’ s see how we can compute Ψ k (i) –If B k [i] is even, it ’ s simply i –Otherwise, – because all the prefix patterns saved are in sorted order, –We saved in the Lk list till the point i, entries for all the odd suffix ’ s before i, h=i-rank[i] –So we can look up the h entry in Lk And it will give us the answer
28
Simple example L 2 ={5,8,2,4} Rank 2 ={1,1,1,2,3,3,3,4} B 2 ={1,0,0,1,1,0,0,1} Ψ2={1,5,8,4,5,1,4,8} Ψ(3) = ? Rank(3) = 1, h= 3-1, L2[2] = 8 Ψ(3) =8 Ψ(3) =8
29
Rank and select Given a bit vector length n Rank[i] is the number of 1 bits till I Select(i) returns the index of the ith 1
30
Ψ k vector representation Lemma 2 Given s integers in sorted order, each containing w bits,where s<2 w each containing w bits,where s<2 w we can store them with at most s(2+w-floor(logs))+O(s/loglogs) bits so that retrieving the hth integer takes constant time
31
Ψ k vector representation Take the first z=floor(logs) bits of each int, creating the q 1..q s int It ’ s easy to see that, q 1 <q i <q i+1 <s (we take the msb bits after all) The rest w-z bits of each int, will be r i 10101010101010101010101010101 1010101010101010101010101 SiSi qiqi 101 riri
32
Ψ k vector representation Store r i in a simple array, (w-z)*s bits Store q 1..q s in a table supporting select and rank in constant time. The table Q is implemented in the following way Instead of saving the number themselves, we store q 1,q 2 -q 1,q 2 -q 3, … q s -q s-1 we store q 1,q 2 -q 1,q 2 -q 3, … q s -q s-1 in unary representation )0 i 1( And add a select data structure.
33
Ψ k vector representation In order to get qi we simply do select(i), and count the number of zeros before the ith 1 Qi = select(i) - rank(select(i))
34
Ψ k vector representation The q table size is the size of the unary string is s+2z <2s + the select overhead O(s/loglogs) So we can output Si easily S i =q i *2 w-z +r i
35
Ψ k vector representation Lemma 3 We can store the concatenated list L k used for Ψ k in n*(1/2+3/2 k+1 )+O(n/2 k loglogn), so accessing the hth element will take constant time, with preprocessing time o(n/2k+2 2 k ) There are 2 2 k lists, number them,(even the empty ones)
36
Ψ k vector representation Lemma 3 We can store the concatenated list L k used for Ψ k in n*(1/2+3/2 k+1 )+O(n/2 k loglogn), so accessing the hth element will take constant time, with preprocessing time o(n/2k+2 2 k ) There are 2 2 k lists, number them,(even the empty ones) Each X i integer in the lists, 1<x i <N k will be transformed into a new integer by appending it ’ s list int representation X` bit size is, 2 K +logn k,
37
Ψ k vector representation Lemma 3 We can store the concatenated list L k used for Ψ k in n*(1/2+3/2 k+1 )+O(n/2 k loglogn), so accessing the hth element will take constant time, with preprocessing time o(n/2k+2 2 k ) There are 2 2 k lists, number them,(even the empty ones) Each X i integer in the lists, 1<x i <N k will be transformed into a new integer by appending it ’ s list int representation X` bit size is, 2 K +logn k, After concatenating all the lists,we have a N k /2 sorted numbers sized 2 K +logn k bits Using lemma 2 we get. O(1) access time And a space bound of n(1/2+3/2 k+1 )+O(n/2 k loglogn) bits
38
Sum it up (space complexity)
39
Rank data structure Due to Jacobson Given a bit vector length n,Rank[i] is the number of 1 bits till I Multilevel approach We will slice the bit string to log2n chunks. Between each chunk we will keep rank counter Each chunk will be divvied into ½ * logn chunks, And a counter will be kept between each sub chunks At The Bottom Level a simple Lookup table will be used.
40
Rank 3 7 101 Lookup table 14 Log 2 n chunks ½ logn sub chunks The output 14+3+1
41
Rank Analysis
42
Compressed suffix arrays in ε -1 n + O(n) bits and O(log ε n) access time In order to break the space barrier we need to save less levels =>longer lookup ’ s Lets save 3 compressed levels only Sa 0 Sa l Sa l` L = ceil(loglogn), l`=ceil(1/2loglogn) using A Dictionary data structure, which Can say If an element is member of the Dictionary, and support a rank query, O(1) time for both queries The Space complexity of the dictionary is We keep in 2 dictionaries what items we have in the next level D 0 and Dl (from Sa 0 ->Sa l` Sa l` ->Sa l
43
The Ψ` k function We define the Ψ` k function, which maps each 1 to it ’ s companion 0 Let ’ s define the φ k function to be We just need to merge the indexes in L k and L` k
44
Example
45
The φ k function implementation Lemma 4 :We can store the concatenated list used for φ k –k =0 in n+O(n/loglogn) bits –K>0 in n*(1+1/2 k-1 )+O(n/2kloglogn), preprocess time of O(n/2 k +2 2 k ) –If k>0 simply using lemma 3 –K=0 Encode a,# as 0, and b as 1. Create a n bit vector, named l L[f] = 0 iff the list for φ 0 is a or # at the f position We add a select and select 0 data structure on top of it. O(n/loglogn) Also we keep the number of 0 in l as c0, Query φ k (j) is done in the following way if j = C0, return select0(c0) If j<c0 return select0(j) If j>c0 return select(j-c0)
46
The Lookup algorithm Sa[i], we start walking the φ k function i,i`,i``,i``` Sa0[i]+1=Sa0[i`] … Until reaching entry found in the dictionary D 0, –Let s be the walk length –And r the entry rank in the dictionary (how many items, already passed to the next level?) Using r we start walking the next level –Let s` be the walk length –And r` the entry rank in the dictionary we return the following result The walk length is, max(s,s`)<2 l` <sqr(logn) So the query time is O(sqr(logn))
47
The General multilevel Build For every 0<ε<1, Assume εl is an integer so 2 εl <2log ε n Create all the levels, 0, εl,2εl..l Number of levels is ε -1 +1 => lookup of O(log ε n)
48
The General multilevel Build
49
Select data structure select(i)- returns the i 1 bit in the string Same idea as rank, a bit more complicated multilevel approach At the first level we record the position of every lognloglogn th bit, –Total space o(N/loglogn) Between each two bits, we keep the following data, If the distance between them r>(lognloglogn) 2 –we keep the absolute pos of all the indexes between them log 2 nloglogn –Other wise we keep, the relative position of each logrloglogn th bit Total space logr*loglogn <log 2 nloglogn = r/loglogn r<N !!! Then we keep one more level (the same notions) –Block size comes to the size of (lgn) 4
50
Select data structure After that, we keep a lookup table For every logn/d pattern we save (d>=2) –Number of 1 bits, –the location of the ith 1 bit in the pattern Same as before the space is O(n 1/d lognloglogn) The lookup is then very simple, just walk the levels, Get a block and ask a query about him using the lookup table. Space complexity, O(n/loglogn)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.