1 Compressed Index for Dictionary Matching WK Hon (NTHU), TW Lam (HKU), R Shah (LSU), SL Tam (HKU), JS Vitter (Purdue)

Slides:



Advertisements
Similar presentations
Space-Efficient Algorithms for Document Retrieval Veli Mäkinen University of Helsinki Joint work with Niko Välimäki.
Advertisements

Space-for-Time Tradeoffs
File Systems.
On-line Linear-time Construction of Word Suffix Trees Shunsuke Inenaga (Japan Society for the Promotion of Science & Kyushu University) Masayuki Takeda.
Trie/Suffix Trie/Suffix Tree. Trie A trie (from retrieval), is a multi-way tree structure useful for storing strings over an alphabet. It has been used.
15-853Page : Algorithms in the Real World Suffix Trees.
A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo.
296.3: Algorithms in the Real World
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
The Trie Data Structure Basic definition: a recursive tree structure that uses the digital decomposition of strings to represent a set of strings for searching.
Tries Standard Tries Compressed Tries Suffix Tries.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
Modern Information Retrieval Chapter 8 Indexing and Searching.
Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following.
Modern Information Retrieval
Goodrich, Tamassia String Processing1 Pattern Matching.
Rank-Sensitive Data Structures Iwona Bialynicka-Birula and Roberto Grossi (Università di Pisa) 12 th Symposium on String Processing and Information Retrieval.
The SBC-Tree: An Index for Run- Length Compressed Sequences Mohamed El-tabakh 1, Wing-Kia Hon 2 Rahul Shah 3, Walid Aref 1, Jeffrey Vitter 1 1 Department.
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
E.G.M. PetrakisTries1  Trees of order >= 2  Variable length keys  The decision on what path to follow is taken based on potion of the key  Static environment,
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Text Operations: Coding / Compression Methods. Text Compression Motivation –finding ways to represent the text in fewer bits –reducing costs associated.
CSC 213 Lecture 18: Tries. Announcements Quiz results are getting better Still not very good, however Average score on last quiz was 5.5 Every student.
Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
Department of Computer Eng. & IT Amirkabir University of Technology (Tehran Polytechnic) Data Structures Lecturer: Abbas Sarraf Search.
Obtaining Provably Good Performance from Suffix Trees in Secondary Storage Pang Ko & Srinivas Aluru Department of Electrical and Computer Engineering Iowa.
© 2004 Goodrich, Tamassia Tries1 Chapter 7 Tries Topics Basics Standard tries Compressed ( 壓縮 ) tries Suffix ( 尾字 ) tries.
6/26/2015 7:13 PMTries1. 6/26/2015 7:13 PMTries2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3) Huffman encoding.
1 A Lempel-Ziv text index on secondary storage Diego Arroyuelo and Gonzalo Navarro Combinatorial Pattern Matching 2007.
Univ. of TehranAdv. topics in Computer Network1 Advanced topics in Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.
Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick.
Compressed Index for a Dynamic Collection of Texts H.W. Chan, W.K. Hon, T.W. Lam The University of Hong Kong.
On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching.
Mike 66 Sept Succinct Data Structures: Techniques and Lower Bounds Ian Munro University of Waterloo Joint work with/ work of Arash Farzan, Alex Golynski,
Succinct Representations of Trees
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
CSC401 – Analysis of Algorithms Chapter 9 Text Processing
CS 430: Information Discovery
Optimizing multi-pattern searches for compressed suffix arrays Kalle Karhu Department of Computer Science and Engineering Aalto University, School of Science,
Succinct Dynamic Cardinal Trees with Constant Time Operations for Small Alphabet Pooya Davoodi Aarhus University May 24, 2011 S. Srinivasa Rao Seoul National.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
1 Tries When searching for the name “Smith” in a phone book, we first locate the group of names starting with “S”, then within those we search for “m”,
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Design and Analysis of Algorithms - Chapter 71 Space-time tradeoffs For many problems some extra space really pays off: b extra space in tables (breathing.
06/12/2015Applied Algorithmics - week41 Non-periodicity and witnesses  Periodicity - continued If string w=w[0..n-1] has periodicity p if w[i]=w[i+p],
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.
Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,
Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda
1 CPS216: Advanced Database Systems Notes 05: Operators for Data Access (contd.) Shivnath Babu.
1 Lexicographic Search:Tries All of the searching methods we have seen so far compare entire keys during the search Idea: Why not consider a key to be.
Contents What is a trie? When to use tries
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
15-853:Algorithms in the Real World
Tries 07/28/16 11:04 Text Compression
Tries 5/27/2018 3:08 AM Tries Tries.
Andrzej Ehrenfeucht, University of Colorado, Boulder
Reducing the Space Requirement of LZ-index
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
13 Text Processing Hongfei Yan June 1, 2016.
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Space-for-time tradeoffs
Tries 2/27/2019 5:37 PM Tries Tries.
Suffix Arrays and Suffix Trees
Space-for-time tradeoffs
Sequences 5/17/ :43 AM Pattern Matching.
Presentation transcript:

1 Compressed Index for Dictionary Matching WK Hon (NTHU), TW Lam (HKU), R Shah (LSU), SL Tam (HKU), JS Vitter (Purdue)

2 Dictionary Matching Problem Summary of Results Description of Our Solution (Brief): Based on (I) Suffix Tree (II) A Simple Sampling Idea (III) Handling Irregularities Open Problems Outline

3 on receiving any text T, we can report for each P j, all positions in T where it occurs Input: A set of d short patterns, { P 1, P 2, …, P d } of total length n Problem: Preprocess the patterns, and create an index so that: Dictionary Matching

4 Relevant parameters to measure index’s performance: d = # of patterns n = total length of patterns |T| = length of T  = size of alphabet of T and patterns occ = total occurrences in search result Dictionary Matching

5 Summary of Results Space (bits)Search TimeRef O( n log n )O( |T| + occ ) [AC 75] O( n ) when  = constant O( (|T| + occ) log 2 n) [CHLS 07] O( n log  )O(|T| log log n + occ) ** this ** (1 + o(1)) n log  O(|T| (log  n + log d) + occ) ** this ** optimal |patterns| + o(n log  )  = constant in (0,1)

6 Existing Solution I: Patricia Trie Compact trie storing all d patterns c h a h t i r Patricia trie for { ate, chair, chat, hat, have, vet } a e e a t e v v t t

7 Existing Solution I: Patricia Trie Advantage: Space: |patterns| + O( d log n ) bits  Very small overhead in addition to the input patterns

8 Existing Solution I: Patricia Trie Searching Strategy: For each position k in T Match T from the root starting at k Report occurrence of any P j found  Disadvantage: Searching: worst-case O(|T|n + occ) time

9 Existing Solution II: Suffix Tree Compact trie storing all suffixes of all d patterns suffix tree for { ate, chair, chat, hat, have, vet } a t c h a h t i r a r i t v t r r e e $ i r e $ t v e i $ e v e t $

10 Existing Solution II: Suffix Tree Searching: worst-case O(|T| + occ) time Matching Time = O(|T|) Same Searching Strategy: For each position k in T Match T from the root starting at k Report occurrence of any P j found

11 Existing Solution II: Suffix Tree Disadvantage: Space: O( n log n ) bits  could be much larger than O( n log  ), the space for |patterns|

12 Our Solution no suffixes: poor searching all suffixes: poor space some suffixes: good space + searching

13 Our Solution: Sampling Store one suffix for every  suffixes  = 2 for { ate, chair, chat, hat, have, vet } a t c h a h t i r a r t t e $ i r e v e v e t $

14 Our Solution: Sampling Store one suffix for every  suffixes irregularities  = 2 for { ate, chair, chat, hat, have, vet } a t c h a h t i r a r t t e $ i r e v e v e t $

15 Our Solution: Sampling Need to handle irregularities Same Searching Strategy: For each position k in T Match T from the root starting at k Report occurrence of any P j found Matching time = O(|T|) despite irregularities

16 When  = log  n Handling irregularities predecessor search in a set of (log n)-bit integers Search: O(|T| log log n + occ) time Space: O( n log  ) bits Y-fast trie

17 When  = (log  n) / log   Handling irregularities predecessor search in a set of (log  n)-bit strings Search: O(|T| (log  n + log d) + occ) time Space: |patterns| + o(n log  ) bits Sting B-tree

18 When  = (log  n) / log   Handling irregularities predecessor search in a set of (log  n)-bit strings Search: O(|T| (log  n + log d) + occ) time Space: n H k + o(n log  ) bits Sting B-tree FerVen 07

19 Open Problems Compressed + Dynamic Version: Can an index support update in the set of patterns ? Target: Achieve nH k -type space bound External Memory Version: Can an index operate in external memory and still support fast searching ?