1 Compressed Index for Dictionary Matching WK Hon (NTHU), TW Lam (HKU), R Shah (LSU), SL Tam (HKU), JS Vitter (Purdue)
2 Dictionary Matching Problem Summary of Results Description of Our Solution (Brief): Based on (I) Suffix Tree (II) A Simple Sampling Idea (III) Handling Irregularities Open Problems Outline
3 on receiving any text T, we can report for each P j, all positions in T where it occurs Input: A set of d short patterns, { P 1, P 2, …, P d } of total length n Problem: Preprocess the patterns, and create an index so that: Dictionary Matching
4 Relevant parameters to measure index’s performance: d = # of patterns n = total length of patterns |T| = length of T = size of alphabet of T and patterns occ = total occurrences in search result Dictionary Matching
5 Summary of Results Space (bits)Search TimeRef O( n log n )O( |T| + occ ) [AC 75] O( n ) when = constant O( (|T| + occ) log 2 n) [CHLS 07] O( n log )O(|T| log log n + occ) ** this ** (1 + o(1)) n log O(|T| (log n + log d) + occ) ** this ** optimal |patterns| + o(n log ) = constant in (0,1)
6 Existing Solution I: Patricia Trie Compact trie storing all d patterns c h a h t i r Patricia trie for { ate, chair, chat, hat, have, vet } a e e a t e v v t t
7 Existing Solution I: Patricia Trie Advantage: Space: |patterns| + O( d log n ) bits Very small overhead in addition to the input patterns
8 Existing Solution I: Patricia Trie Searching Strategy: For each position k in T Match T from the root starting at k Report occurrence of any P j found Disadvantage: Searching: worst-case O(|T|n + occ) time
9 Existing Solution II: Suffix Tree Compact trie storing all suffixes of all d patterns suffix tree for { ate, chair, chat, hat, have, vet } a t c h a h t i r a r i t v t r r e e $ i r e $ t v e i $ e v e t $
10 Existing Solution II: Suffix Tree Searching: worst-case O(|T| + occ) time Matching Time = O(|T|) Same Searching Strategy: For each position k in T Match T from the root starting at k Report occurrence of any P j found
11 Existing Solution II: Suffix Tree Disadvantage: Space: O( n log n ) bits could be much larger than O( n log ), the space for |patterns|
12 Our Solution no suffixes: poor searching all suffixes: poor space some suffixes: good space + searching
13 Our Solution: Sampling Store one suffix for every suffixes = 2 for { ate, chair, chat, hat, have, vet } a t c h a h t i r a r t t e $ i r e v e v e t $
14 Our Solution: Sampling Store one suffix for every suffixes irregularities = 2 for { ate, chair, chat, hat, have, vet } a t c h a h t i r a r t t e $ i r e v e v e t $
15 Our Solution: Sampling Need to handle irregularities Same Searching Strategy: For each position k in T Match T from the root starting at k Report occurrence of any P j found Matching time = O(|T|) despite irregularities
16 When = log n Handling irregularities predecessor search in a set of (log n)-bit integers Search: O(|T| log log n + occ) time Space: O( n log ) bits Y-fast trie
17 When = (log n) / log Handling irregularities predecessor search in a set of (log n)-bit strings Search: O(|T| (log n + log d) + occ) time Space: |patterns| + o(n log ) bits Sting B-tree
18 When = (log n) / log Handling irregularities predecessor search in a set of (log n)-bit strings Search: O(|T| (log n + log d) + occ) time Space: n H k + o(n log ) bits Sting B-tree FerVen 07
19 Open Problems Compressed + Dynamic Version: Can an index support update in the set of patterns ? Target: Achieve nH k -type space bound External Memory Version: Can an index operate in external memory and still support fast searching ?