Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Compressed Index for Dictionary Matching WK Hon (NTHU), TW Lam (HKU), R Shah (LSU), SL Tam (HKU), JS Vitter (Purdue)

Similar presentations


Presentation on theme: "1 Compressed Index for Dictionary Matching WK Hon (NTHU), TW Lam (HKU), R Shah (LSU), SL Tam (HKU), JS Vitter (Purdue)"— Presentation transcript:

1 1 Compressed Index for Dictionary Matching WK Hon (NTHU), TW Lam (HKU), R Shah (LSU), SL Tam (HKU), JS Vitter (Purdue)

2 2 Dictionary Matching Problem Summary of Results Description of Our Solution (Brief): Based on (I) Suffix Tree (II) A Simple Sampling Idea (III) Handling Irregularities Open Problems Outline

3 3 on receiving any text T, we can report for each P j, all positions in T where it occurs Input: A set of d short patterns, { P 1, P 2, …, P d } of total length n Problem: Preprocess the patterns, and create an index so that: Dictionary Matching

4 4 Relevant parameters to measure index’s performance: d = # of patterns n = total length of patterns |T| = length of T  = size of alphabet of T and patterns occ = total occurrences in search result Dictionary Matching

5 5 Summary of Results Space (bits)Search TimeRef O( n log n )O( |T| + occ ) [AC 75] O( n ) when  = constant O( (|T| + occ) log 2 n) [CHLS 07] O( n log  )O(|T| log log n + occ) ** this ** (1 + o(1)) n log  O(|T| (log  n + log d) + occ) ** this ** optimal |patterns| + o(n log  )  = constant in (0,1)

6 6 Existing Solution I: Patricia Trie Compact trie storing all d patterns c h a h t i r Patricia trie for { ate, chair, chat, hat, have, vet } a e e a t e v v t t

7 7 Existing Solution I: Patricia Trie Advantage: Space: |patterns| + O( d log n ) bits  Very small overhead in addition to the input patterns

8 8 Existing Solution I: Patricia Trie Searching Strategy: For each position k in T Match T from the root starting at k Report occurrence of any P j found  Disadvantage: Searching: worst-case O(|T|n + occ) time

9 9 Existing Solution II: Suffix Tree Compact trie storing all suffixes of all d patterns suffix tree for { ate, chair, chat, hat, have, vet } a t c h a h t i r a r i t v t r r e e $ i r e $ t v e i $ e v e t $

10 10 Existing Solution II: Suffix Tree Searching: worst-case O(|T| + occ) time Matching Time = O(|T|) Same Searching Strategy: For each position k in T Match T from the root starting at k Report occurrence of any P j found

11 11 Existing Solution II: Suffix Tree Disadvantage: Space: O( n log n ) bits  could be much larger than O( n log  ), the space for |patterns|

12 12 Our Solution no suffixes: poor searching all suffixes: poor space some suffixes: good space + searching

13 13 Our Solution: Sampling Store one suffix for every  suffixes  = 2 for { ate, chair, chat, hat, have, vet } a t c h a h t i r a r t t e $ i r e v e v e t $

14 14 Our Solution: Sampling Store one suffix for every  suffixes irregularities  = 2 for { ate, chair, chat, hat, have, vet } a t c h a h t i r a r t t e $ i r e v e v e t $

15 15 Our Solution: Sampling Need to handle irregularities Same Searching Strategy: For each position k in T Match T from the root starting at k Report occurrence of any P j found Matching time = O(|T|) despite irregularities

16 16 When  = log  n Handling irregularities predecessor search in a set of (log n)-bit integers Search: O(|T| log log n + occ) time Space: O( n log  ) bits Y-fast trie

17 17 When  = (log  n) / log   Handling irregularities predecessor search in a set of (log  n)-bit strings Search: O(|T| (log  n + log d) + occ) time Space: |patterns| + o(n log  ) bits Sting B-tree

18 18 When  = (log  n) / log   Handling irregularities predecessor search in a set of (log  n)-bit strings Search: O(|T| (log  n + log d) + occ) time Space: n H k + o(n log  ) bits Sting B-tree FerVen 07

19 19 Open Problems Compressed + Dynamic Version: Can an index support update in the set of patterns ? Target: Achieve nH k -type space bound External Memory Version: Can an index operate in external memory and still support fast searching ?


Download ppt "1 Compressed Index for Dictionary Matching WK Hon (NTHU), TW Lam (HKU), R Shah (LSU), SL Tam (HKU), JS Vitter (Purdue)"

Similar presentations


Ads by Google