Compressed Index for a Dynamic Collection of Texts H.W. Chan, W.K. Hon, T.W. Lam The University of Hong Kong.

1 Compressed Index for a Dynamic Collection of Texts H.W. Chan, W.K. Hon, T.W. Lam The University of Hong Kong

2 Problem Definition Given L = { T 1, T 2, …, T k } of total length n over an alphabet Σ We want to create an index for L such that on given any pattern P, the occurrences of P in each of the T i can be found quickly Also, the index should support fast insertion/ deletion of T i into/from L

3 Previous Work & Our Result Space (bits)MatchingInsertion/deletion [McCreight, JACM ’ 76] O(n log n)O (|P|+occ)O(|T i |) [Ferragina & Manzini, FOCS ’ 00] O(n)O(|P| log 3 n + occ log n)amortized O(|T i | log n) / amortized O(|T i | log 2 n) [This paper] O(n)O(|P| log n + occ log 2 n)worst case O(|T i | log n)

4 Two Basic Tools: CSA, FM-index Definition 1: The main component of CSA for a text T is a function Ψ such that Ψ [i] = SA -1 [SA[i] + 1] where SA[i] is the i-th entry in the suffix array, and SA -1 is the inverse of SA

5 Two Basic Tools: CSA, FM-index Definition 2: The FM-index of T is based on Burrows- Wheeler array of T, which is an array of characters, denoted by BWT, such that BWT[i] = T[SA[i]-1]. The main component of FM-index is |Σ| functions count c for every c  Σsuch that count c [i] = # of c in BWT[1…i]

6 Our Index Our index is a dynamic version of CSA + FM-index for the concatenated text T 1 T 2 … T k We exploit the property of Ψ and count that, both of them are essentially a couple of sequence of increasing values.

7 Our Index To maintain a dynamic CSA and FM- index  to maintain a dynamic sequence of increasing values Observation 3: Balanced search tree is good for dynamic sequence Observation 4: Difference encoding for increasing values can save space

8 Our Index Combining Observations 3 and 4  Differential Balanced Search Tree to handle the values in the dynamic CSA and FM-index Drawbacks: computation of Ψ and count is slowed down by O(log n) factor Pattern matching: O(|P| log n + occ log 2 n) time

9 Insertion & Deletion (sketch idea) Insertion corresponds to finding update points in the increasing sequences of Ψ and count To insert a text T into L, there are O(|T|) such update points Update points can be found by simulating a pattern matching query of T against L Total time: O(|T| log n)

10 Insertion & Deletion (sketch idea) Deletion reverses the insertion process Update points can be found by querying Ψ iteratively, instead of simulating a pattern matching query Total time: O(|T| log n)

11 Conclusion, Progress & Future Work In the literature, there is a dual problem called Dictionary Management, which maintains a collection of patterns, such that when a text T is given later, all occurrences of each pattern in T is reported in one query. Also, fast insertion/deletion of pattern is required O(n) bits: some progress …

12 Conclusion, Progress & Future Work There is another problem called Dynamic Text, which maintains a single text T, and when a pattern P is given later, it supports finding all occurrences of P in T. The text T is subject to insertion/deletion of substrings. O(n log n) bits: Sahinalp & Vishkin, FOCS ’ 96 O(n) bits: ??

