1 A Lempel-Ziv text index on secondary storage Diego Arroyuelo and Gonzalo Navarro Combinatorial Pattern Matching 2007.

Slides:



Advertisements
Similar presentations
February 12, 2007 WALCOM '2007 1/22 DiskTrie: An Efficient Data Structure Using Flash Memory for Mobile Devices N. M. Mosharaf Kabir Chowdhury Md. Mostofa.
Advertisements

An Improved Succinct Dynamic k-Ary Tree Representation (work in progress) Diego Arroyuelo Department of Computer Science, Universidad de Chile.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
File Systems.
January 11, Csci 2111: Data and File Structures Week1, Lecture 1 Introduction to the Design and Specification of File Structures.
A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.
Two implementation issues Alphabet size Generalizing to multiple strings.
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.
Sandeep Tata, Richard A. Hankins, and Jignesh M. Patel Presented by Niketan Pansare, Megha Kokane.
A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
The Trie Data Structure Basic definition: a recursive tree structure that uses the digital decomposition of strings to represent a set of strings for searching.
Tries Standard Tries Compressed Tries Suffix Tries.
Succinct Representations of Trees S. Srinivasa Rao Seoul National University.
Hierarchy-conscious Data Structures for String Analysis Carlo Fantozzi PhD Student (XVI ciclo) Bioinformatics Course - June 25, 2002.
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
Modern Information Retrieval Chapter 8 Indexing and Searching.
Modern Information Retrieval
Goodrich, Tamassia String Processing1 Pattern Matching.
1 Indexing and Searching (File Structures) Modern Information Retrieval (C hapter 8) With G. Navarro.
1 Compressed Index for Dictionary Matching WK Hon (NTHU), TW Lam (HKU), R Shah (LSU), SL Tam (HKU), JS Vitter (Purdue)
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
CSC 213 Lecture 18: Tries. Announcements Quiz results are getting better Still not very good, however Average score on last quiz was 5.5 Every student.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
Obtaining Provably Good Performance from Suffix Trees in Secondary Storage Pang Ko & Srinivas Aluru Department of Electrical and Computer Engineering Iowa.
Indexing and Searching
1 Database Tuning Rasmus Pagh and S. Srinivasa Rao IT University of Copenhagen Spring 2007 February 8, 2007 Tree Indexes Lecture based on [RG, Chapter.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Compressed Index for a Dynamic Collection of Texts H.W. Chan, W.K. Hon, T.W. Lam The University of Hong Kong.
Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/ Dr. Almetwally Mostafa.
Chapter 17 Domain Name System
January 11, Files – Chapter 1 Introduction to the Design and Specification of File Structures.
Succinct Representations of Trees
Efficient Minimal Perfect Hash Language Models David Guthrie, Mark Hepple, Wei Liu University of Sheffield.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
1 Chapter 17 Disk Storage, Basic File Structures, and Hashing Chapter 18 Index Structures for Files.
CS 430: Information Discovery
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
Succinct Orthogonal Range Search Structures on a Grid with Applications to Text Indexing Prosenjit Bose, Carleton University Meng He, Unversity of Waterloo.
Succinct Data Structures Ian Munro University of Waterloo Joint work with David Benoit, Andrej Brodnik, D, Clark, F. Fich, M. He, J. Horton, A. López-Ortiz,
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Compressed Prefix Sums O’Neil Delpratt Naila Rahman Rajeev Raman.
Szymon Grabowski, Marcin Raniszewski Institute of Applied Computer Science, Lodz University of Technology, Poland The Prague Stringology Conference, 1-3.
Storage Structures. Memory Hierarchies Primary Storage –Registers –Cache memory –RAM Secondary Storage –Magnetic disks –Magnetic tape –CDROM (read-only.
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,
1 CPS216: Advanced Database Systems Notes 05: Operators for Data Access (contd.) Shivnath Babu.
Chapter 5 Record Storage and Primary File Organizations
Accelerating Multi-Pattern Matching on Compressed HTTP Traffic Dr. Anat Bremler-Barr (IDC) Joint work with Yaron Koral (IDC), Infocom[2009]
Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections Jouni Sirén 1, Niko Välimäki 1, Veli Mäkinen 1, and Gonzalo Navarro.
Why indexing? For efficient searching of a document
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Tries 07/28/16 11:04 Text Compression
Subject Name: File Structures
Succinct Data Structures
Succinct Data Structures
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Database Management System
Reducing the Space Requirement of LZ-index
CS 430: Information Discovery
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
13 Text Processing Hongfei Yan June 1, 2016.
Indexing and Searching (File Structures)
Tries 2/27/2019 5:37 PM Tries Tries.
Presentation transcript:

1 A Lempel-Ziv text index on secondary storage Diego Arroyuelo and Gonzalo Navarro Combinatorial Pattern Matching 2007

2 Introduction The full-text searching problem: to find all the occ occurrences of a pattern P[1..m] in a text T[1..u] (both over an alphabet  of size  ) We are interested in indexed text searching: an index on T allows us to find quickly the pattern occurrences T PPPP Index In our work the index  replaces the text (self-indexing)  is compressed (LZ) (compression+search)

3 Applications and goals Main applications of text searching: Computational Biology (DNA and protein sequences) Oriental language texts (Japanese, Chinese, Korean, etc.) “Natural language” texts (English, Spanish, etc.) Music (MIDI pitch sequences) Program code Etc. Compressed self-indexes: Reduce the space requirement (not storing the text + compressing) Are useful in cases where accessing the text is expensive (for example, web search engines)

4 Motivations The use of a compressed self-index may totally remove the need to use the disk However… Huge texts Sequential text searching + compression Compressed self-indexes improves disk performance More disk accesses but smaller seek time

5 Motivations By reducing the space of the index we aim at: Saving disk space (important for storage media of limited size) Reducing the seek time when searching (because the index is smaller)

6 Model of computation We assume a model of computation where: A disk page of size B can be transferred to main memory in a single disk access We can hold a constant number of disk pages in main memory We count every disk access The text is static

7 Related Works String B-trees [FG, JACM 1999]: 3 – 4 times text size Compact Pat Trees [CM, SODA 1996]: 5 – 6 times text size Compressed Suffix Arrays [MNS, ISAAC 2003] About 0.25 – 0.5 times text size 2(1 + m ·  log B u  ) accesses for counting O(log u) extra accesses per occurrence! Can we define a small an efficient index on secondary storage?

8 Searching LZ78 compressed texts: the LZ-index LZTrie RevTrie Different types of occurrences… LZ78 parses the text into phrases

9 Occurrences of Type 1 LZTrie P P P Subtrees containing ocurrences of type 1 By LZ78, P is a suffix of such phrases Occurrences contained in a single phrase Shortest possible LZ78 phrases containing P

10 As P is a suffix of such phrases, P r is a prefix of the corresponding reverse phrases We need the Reverse Trie (RevTrie) to solve this problem Occurrences of Type 1 PrPr RevTrie LZTrie P P P Occurrences contained in a single phrase navigation between tries!

11 Occurrences of Type 2 Occurrences spanning two consecutive phrases Phrases starting with P 2 Phrases ending with P 1 P2P2 P1P1 P k-1k Pr1Pr1 RevTrie LZTrie P2P2 k-1 k RNode Node

12 Occurrences of Type 3 Occurrences spanning more than two consecutive phrases O(m 2 ) occurrences of type 3 in the worst case O(m 2 ) random accesses in the worst case

13 A compressed full-text self-index based on the LZTrie [Navarro, JDA 2004] Four data structures compose the LZ-index LZTrie: the trie formed by all the LZ78 phrases B 0,…,B n RevTrie: the trie formed by all the reverse LZ78 phrases B r 0,…,B r n Node: a mapping from phrase identifiers to their node in LZTrie RNode: a mapping from phrase identifiers to their node in RevTrie Overall: the LZ-index requires 4nlogn(1+o(1)) = 4uH k + o(ulog  ) bits, for k = o(log  u) We don’t need to store the text! The LZ-index

14 The LZ-index was originally designed for main memory It has a non-regular pattern of access to the index components We define a version of LZ-index for secondary storage We divide the problem as follows: Solving the Basic Trie Operations Reducing the Navigation Between Structures The LZ-index on secondary storage

15 We cut the tries into disjoint blocks of size at most B, using the Clark and Munro Strategy Every block stores a subtree of the whole trie We arrange these blocks in a tree by adding inter-block pointers Solving the basic trie operations We are able to compute parent(x) child(x, a) depth(x) subtreesize(x) preorder(x) ancestor(x, y) With one extra disk access in the worst case

16 We avoid random accesses to report only one occurrence We would need a data structure able of finding all these subtrees without random accesses Reducing the navigation between structures PrPr RevTrie LZTrie P P P Occurrences contained in a single phrase For counting...

17 Reducing the navigation between structures Occurrences spanning two consecutive phrases P2P2 P1P1 k-1 k Pr1Pr1 RevTrie LZTrie P2P2 y y’ k-1 k LR mapping

18 We add some redundancy to reduce the number of accesses between index components Many random accesses now become a single access + sequential scanning (please read the paper for other technical details) The overall space requirement is 8uH k + o(ulog  ) bits, for any k = o(log  u) The space can be dropped to 6uH k + o(ulog  ) bits if we only need to count pattern occurrences Reducing the navigation between structures

19 We indexed: XML file from Pizza&Chili Corpus (200 megabytes) ( ) We searched for 5,000 random patterns count and locate queries We assume a disk page of 32 kilobytes (i.e., 8,192 integers of 32 bits) Experimental results

20 We compared against Suffix Arrays for secondary storage: The two-level hierarchy of [BYBZ, 1996] String B-trees: We use the model provided in [FG, 1996] Compact Pat Trees (CPT) [CM, 1996] Experimental results

21 Experimental results (count) LZ-index String B-trees Suffix Array CPT 3.3 times smaller than String B-trees

22 Experimental results (count) LZ-index String B-trees Suffix Array CPT

23 Experimental results (locate) LZ-index String B-trees Suffix Array CPT 2.6 times smaller than String B-trees Average number of accesses to report the first occurrence LZ-index  11 String B-trees  12

24 Experimental results (locate) LZ-index String B-trees Suffix Array CPT

25 The LZ-index can be adapted to work on secondary storage Requiring up to 8uH k + o(ulog  ) bits, for any k = o(log  u) Our index is significantly smaller than any other practical secondary-memory data structure LZ-index requires more disk accesses But a smaller index would have a smaller seek time Conclusions

26 Future work We assumed a constant main-memory space, but… To implement our index in a real practical setting Handling dynamism (String B-trees require 13.5 times the text size!) Direct construction on secondary storage adapting [AN, ISAAC 2005] to work on disk

27 Questions? Contact

28 Thanks! Contact