Download presentation
Presentation is loading. Please wait.
Published byMeagan Barker Modified over 8 years ago
1
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo http://researchmap.jp/sada/resources/
2
2 Standard Data Structures for Strings Operations –Number of occurrences and locations of a pattern –Common substrings, maximal patterns –Alignment of two strings Standard data structures –Suffix trees [1,2] –Suffix arrays [3] Size: string size + O(n log n) bits –DNA sequence of a human : 3 billion letters (750MB) –Its suffix tree : 40GB
3
3 Suffixes of a String Strings made by omitting letters at the beginning of a string T. There are n suffixes of a string of length n Any substring of T is a prefix of a suffix of T = T 1 ababac$ 2 babac$ 3 abac$ 4 bac$ 5 ac$ 6 c$ 7 $
4
4 Suffix Arrays [3] An array storing pointers to suffixes which are lexicographically sorted. Size n log n + n log |A| bits –A: the alphabet –|A|: alphabet size Time for searching a pattern P O(|P| log n) time 1 7 $ 2 1 ababac$ 3 3 abac$ 4 5 ac$ 5 2 babac$ 6 4 bac$ 7 6 c$ SA i
5
5 Compressed Suffix Arrays (CSA) [4,5] Instead of storing SA, store [i] = SA -1 [SA[i]+1] Size: O(n log |A|) bits Time for search for P O(|P| log n) time 0 1 7 $ 5 2 1 ababac$ 6 3 3 abac$ 7 4 5 ac$ 3 5 2 babac$ 4 6 4 bac$ 1 7 6 c$ SA i
6
6 How to Compute 1.Construct the suffix array SA 2.Radix-sort i w.r.t. (T[SA[i]-1], i) 1.Count the number of occurrences of each character in T 2.For i=1,2,...,n, c = T[SA[i]-1] 3. write i in the range of corresponding to c Time complexity: O(n) 0 1 7 $ 5 2 1 ababac$ 6 3 3 abac$ 7 4 5 ac$ 3 5 2 babac$ 4 6 4 bac$ 1 7 6 c$ SA i
7
7 Why can be compressed? Suffixes are stored in lexicographic order Lexicographic order does not change if the first letter is removed for suffixes sharing the same first letter 6 いかいないかいるかいるいるいるか 10 いかいるかいるいるいるか 4 いないかいないかいるかいるいるいるか 8 いないかいるかいるいるいるか 15 いるいるいるか 17 いるいるか 19 いるか 1 いるかいないかいないかいるかいるいるいるか 12 いるかいるいるいるか 21 か 3 かいないかいないかいるかいるいるいるか 7 かいないかいるかいるいるいるか 14 かいるいるいるか 11 かいるかいるいるいるか 5 ないかいないかいるかいるいるいるか 9 ないかいるかいるいるいるか 16 るいるいるか 18 るいるか 20 るか 2 るかいないかいないかいるかいるいるいるか 13 るかいるいるいるか SA 12 14 15 16 17 18 19 20 21 0 3 4 5 9 1 2 6 7 10 11 13
8
8 Properties of CSA If i < j, T[SA[i]] T[SA[j]] If i < j and T[SA[i]] = T[SA[j]], [i] < [j] Proof : If T[SA[i]] = T[SA[j]], their lex. orders are determined by letters at position 2 or latter. Since i < j,T[SA[i]+1..n] < T[SA[j]+1..n] Let SA[i’] = SA[i]+1, SA[j’] = SA[j]+1, then i’ < j’ That is, i’ = SA -1 [SA[i]+1] = [i] < [j] = j’ 0 1 7 $ 5 2 1 ababac$ 6 3 3 abac$ 7 4 5 ac$ 3 5 2 babac$ 4 6 4 bac$ 1 7 6 c$ SA i
9
9 Succinct Data Structure for Bit Vectors B: 0,1 vector of length n B[0]B[1]…B[n 1] lower bound of size = log 2 n = n bits queries –rank(B, x): number of ones in B[0..x]=B[0]B[1]…B[x] –select(B, i): position of i-th 1 from the head (i 1) Theorem : rank and select on a bit-vector of length n is computed in constant time on word RAM with word length (log n) bits, using n+O(n log log n /log n) bits. B = 1001010001000000 035 9 n = 16
10
10 How to Encode ’[i] = T[SA[i]] n + [i] is used – [i] = ’[i] mod n –T[SA[i]] = ’[i] div n ’[i] (i = 1,2,...,n) forms an increasing sequence –n(2+log ) bits $: 2 a: 5, 8, 9 c: 3, 4 g: 1, 6, 7 $: 000010 a: 010101, 011000, 011001 c: 100011, 100100 g: 110001, 110110, 110111
11
11 How to Encode ’ MSB log n bits of binary encoding of ’[i] –Encode the difference from the preceding value in unary code –Maximum 2n bits (#ones = n , #zeros n) Lowest log bits of ’[i] are stored as it is –n log bits $: 2 a: 5, 8, 9 c: 3, 4 g: 1, 6, 7 $: 000010 a: 010101, 011000, 011001 c: 100011, 100100 g: 110001, 110110, 110111 1, 000001, 01, 1, 001, 01, 0001, 01, 1 10, 01, 00, 01, 11, 00, 01, 10, 11
12
12 Decoding of ’ Upper digits : x = select(H,i) i Lower digits : y = L[i] ’[i] = x + y Time : O(1) Space: n(2+log ) + O(n log log n/log n) $: 2 a: 5, 8, 9 c: 3, 4 g: 1, 6, 7 H: 1, 000001, 01, 1, 001, 01, 0001, 01, 1 L: 10, 01, 00, 01, 11, 00, 01, 10, 11
13
13 Compressing Divide [i]’s according to T[SA[i]] Encode each S(c) : In total H 0 log (equality holds if p 1 = p 2 = …) ( :Prob. of letter c)
14
14 How to Access SA[i] For i multiple of log n, store SA[i] k = 0; w = log n; while (i % w != 0) –i = [i]; k++; return SA 2 [i / w] - k; 0 8 0 1 7 $ 5 2 1 ababac$ 6 3 3 abac$ 7 4 5 ac$ 3 5 2 babac$ 4 6 4 bac$ 1 7 6 c$ SA i 0 8 1 3 2 4 SA 2 n = 8 w = 3 Access time: O(log n) time on average
15
15 T E B D E B D D A D D E B E B D C SA 81452121671569310134111 SA 2 2341 12345678910111213141516 B 1 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 10 8 9 11 13 15 1 6 7 12 14 16 2 3 4 5 B : n+o(n) bits Access time: O(log n) time (worst case) B[i]=1 SA[i] is a multiple of log n Store SA[i] if it is a multiple of log n in SA 2 k = 0; w = log n; while (B[i] != 1) i = [i]; k++; return SA 2 [rank ( B, i)] w k;
16
16 Hierarchical Representatino of At level i –Consecutive 2 i letters of T is regarded as a letter –Entropy of the string does not increase BDEBDDADDEBEBDC$ T E B D E B D D A D D E B E B D C SA 81452121671569310134111 SA 1 47168352 SA 2 2341 B 1 1 0 1 1 1 0 0 1 0 0 1 0 1 0 0 DEBEBDEBDDADBDC$
17
17 Size of the Data Structure If the number of levels is 1/ : 1/ n(3+H 0 ) bits SA 1/ : n/log n log n = n bits B: n + n/2 + n/4 +... 2n bits Total: bits Time to compute SA[i]: time
18
18 Searching Substrings T E B D E B D D A D D E B E B D C 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 SA 8 14 5 2 12 16 7 15 6 9 3 10 13 4 1 11 A B B B B C D D D D D D E E E E D D D D E A C D D E E B B B B C D E A E B B D D D E D E C D E 10 8 9 11 13 15 1 6 7 12 14 16 2 3 4 5 r 1 2 2 2 2 3 4 4 4 4 4 4 5 5 5 5 D 1 1 0 0 0 1 1 0 0 0 0 0 1 0 0 0 Binary search can be done without using actual values of SA 1 2 3 4 5 C A B C D E
19
19 Backward Search To search for P=P[1..p] for (k = p; k >=1; k--) C[$]=[1,1] C[a]=[2,4] C[b]=[5,6] C[c]=[7,7] O(p log n) time 0 1 7 $ 5 2 1 ababac$ 6 3 3 abac$ 7 4 5 ac$ 3 5 2 babac$ 4 6 4 bac$ 1 7 6 c$ SA i
20
20 Binary search w.r.t. : O(log n) time Search time for P: O(|P| log n) time
21
21 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 T E B D E B D D A D D E B E B D C A B B B B C D D D D D D E E E E 10 8 9 11 13 15 1 6 7 12 14 16 2 3 4 5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 SA 8 14 5 2 12 16 7 15 6 9 3 10 13 4 1 11 Partial Decoding of String To decode T[9..13] = DDEBE 1. Compute i=SA -1 [9]=10 2. Find the first letter of the suffix with lex. order i 3. Traverse from i=10 1 2 3 4 5 C A B C D E
22
22 Functions of Compressed Suffix Arrays lookup(i): returns SA[i] (O(log n) time) inverse(i): returns SA -1 [i] (O(log n) time) [i]: returns SA -1 [SA[i]+1] (O(1) time) substring(i,l): returns T[SA[i]..SA[i]+l-1] –O(l) time –(T[SA[i] is computed by rank on length-n 0,1 vector)
23
23 Problems of CSA Size is n(H 0 (S)+O(1)) bits Want to compress into nH k (S)+o(n)
24
24 References [1] P. Weiner. Linear Pattern Matching Algorithms. In Proceedings of the 14th IEEE Symposium on Switching and Automata Theory, pages 1–11, 1973. [2] E. M. McCreight. A Space-economical Suffix Tree Construction Algorithm. Journal of the ACM, 23(12):262–272, 1976. [3] Udi Manber, Gene Myers. Suffix arrays: a new method for on-line string searches, Proc. SODA, 1990. [4] R. Grossi and J. S. Vitter. Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching. SIAM Journal on Computing, 35(2):378–407, 2005. [5] Kunihiko Sadakane: New text indexing functionalities of the compressed suffix arrays. J. Algorithms 48(2): 294-313 (2003). [6] M. Burrows, D. Wheeler. A block sorting lossless data compression algorithm, Technical Report 124, Digital Equipment Corporation, 1994.
25
25 [7] John G. Cleary, Ian H. Witten: A comparison of enumerative and adaptive codes. IEEE Transactions on Information Theory 30(2): 306- 315 (1984) [8] Kunihiko Sadakane: A Modified Burrows-Wheeler Transformation for Case-Insensitive Search with Application to Suffix Array Compression. Data Compression Conference 1999: 548 [9] P. Ferragina and G. Manzini. Indexing compressed texts. Journal of the ACM, 52(4):552–581, 2005.
26
26 Block-Sort Compression Algorithm Burrows, Wheeler 1994 [6] Compression ratio is better than gzip, close to PPM [7] Compression is faster than PPM Decompression is much faster –Suitable for distributing data
27
27 Block-Sort Compression Algorithm ababac$ c$bbaaa 3441411 011 00100 00100 1 00100 1 1 BW transform suffix sorting MTF transform Huffman code Arithmetic code code 11 20 10 30 11 400 100 500 101
28
28 Suffix Array acagcagg$ cagcagg$ agcagg$ gcagg$ cagg$ agg$ gg$ g$ $ T = acagcagg$ 123456789123456789 $ acagcagg$ agcagg$ agg$ cagcagg$ cagg$ g$ gcagg$ gg$ 913625847913625847 SA Lexicographic sort Suffix array
29
29 BW Transform BW[i] = T[SA[i] 1] It consists of characters sorted in the lex. order of following suffixes BW is a permutation of T T can be recovered from BW BW is compressed by a simple (order-0) compression algorithm g $ $ acagcagg c agcagg$ c agg$ a cagcagg$ g cagg$ g g$ a gcagg$ a gg$ T = acagcagg$BW = g$ccaggaa 913625847913625847 SA
30
30 Inverse BW Transform T can be recovered from BW SA can be also recovered [8] g $ c c a g g a a $ acagcagg$ agcagg$ agg$ cagcagg$ cagg$ g$g$ gcagg$ gg$ SA 9 1 3 6 2 5 8 4 7
31
31 FM-index [9] g $ c c a g g a a $ a a a c c g g g 1 2 3 4 5 6 7 8 9 $: 0 a: 1 c: 4 g: 6 C Pattern search can be done using only BW c = P[p] l = C[c]+1, r = C[c+1] while (--p 0) { c = P[p] l = C[c]+rank c (BW,l 1)+1 r = C[c]+rank c (BW,r) } [l,r] is the lex. order of P To search for P=P[1, p]
32
32 Substring Search Given SA range [l,r] for pattern P, range [l’,r’] for cP is computed by g $ c c a g g a a $ acagcagg$ agcagg$ agg$ cagcagg$ cagg$ g$g$ gcagg$ gg$ SA 9 1 3 6 2 5 8 4 7 $: 0 a: 1 c: 4 g: 6 C l’ = C[c]+rank c (BW,l 1)+1 r’ = C[c]+rank c (BW,r) g :[7,9] ag :[3,4] cag:[5,6]
33
33 LF mapping C $ a c gC $ a c g 0 1 4 60 1 4 6 LF[5]=C[a]+rank a (BW,5) =1+1 =2 SA[5]=2 SA[2]=2-1 g $ c c a g g a a $ a a a c c g g g 1 2 3 4 5 6 7 8 9 LF[i] represents lex. order of SA[j 1] for j = SA[i]
34
34 If BW is stored using the wavelet tree, rank can be computed in O(log /log log n) time Pattern search takes O(|P| log /log log n) time Size of BW: nH 0 (BW) + O(n log /log log n) bits If indexes for lookup/inverse store every d suffixes –Size: O(n log n/d) bits –Time: O(d log /log log n) To make the index size o(n), set d = log 1+ n
35
35 Entropy of String Definition: order-0 entropy H 0 of string S (p c : probability of appearance of letter c) Definition: order-k entropy –assumption: Pr[S[i] = c] is determined from S[i k..i 1] (context) –n s : the number of letters whose context is s –p s,c : probability of appearing c in context s abcababc context
36
36 Higher-Order Compression of Strings In the string after BWT, characters with the same context are gathered Compress substring for each context into H 0 ⇒ Achieve H k in total g $ $ acagcagg c agcagg$ c agg$ a cagcagg$ g cagg$ g g$ a gcagg$ a gg$ context = $ context = a context = c context = g
37
37 Summary: FM-index Assume = polylog(n) Index size: nH k (S) + o(n) bits Pattern search: O(|P|) time lookup/inverse: O(log 1+ n) time Decode of a substring of length l:O(l + log 1+ n) time
38
38 References [1] P. Weiner. Linear Pattern Matching Algorithms. In Proceedings of the 14th IEEE Symposium on Switching and Automata Theory, pages 1–11, 1973. [2] E. M. McCreight. A Space-economical Suffix Tree Construction Algorithm. Journal of the ACM, 23(12):262–272, 1976. [3] Udi Manber, Gene Myers. Suffix arrays: a new method for on-line string searches, Proc. SODA, 1990. [4] R. Grossi and J. S. Vitter. Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching. SIAM Journal on Computing, 35(2):378–407, 2005. [5] Kunihiko Sadakane: New text indexing functionalities of the compressed suffix arrays. J. Algorithms 48(2): 294-313 (2003). [6] M. Burrows, D. Wheeler. A block sorting lossless data compression algorithm, Technical Report 124, Digital Equipment Corporation, 1994.
39
39 [7] John G. Cleary, Ian H. Witten: A comparison of enumerative and adaptive codes. IEEE Transactions on Information Theory 30(2): 306- 315 (1984) [8] Kunihiko Sadakane: A Modified Burrows-Wheeler Transformation for Case-Insensitive Search with Application to Suffix Array Compression. Data Compression Conference 1999: 548 [9] P. Ferragina and G. Manzini. Indexing compressed texts. Journal of the ACM, 52(4):552–581, 2005.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.