Compressed Index for a Dynamic Collection of Texts H.W. Chan, W.K. Hon, T.W. Lam The University of Hong Kong.

Slides:



Advertisements
Similar presentations
Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ.
Advertisements

Succinct Representations of Dynamic Strings Meng He and J. Ian Munro University of Waterloo.
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Space-Efficient Algorithms for Document Retrieval Veli Mäkinen University of Helsinki Joint work with Niko Välimäki.
Space-for-Time Tradeoffs
Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University,
Paolo Ferragina, Università di Pisa Compressed Permuterm Index Paolo Ferragina Dipartimento di Informatica, Università di Pisa.
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
An Improved Succinct Dynamic k-Ary Tree Representation (work in progress) Diego Arroyuelo Department of Computer Science, Universidad de Chile.
Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.
A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.
Two implementation issues Alphabet size Generalizing to multiple strings.
Tools for Text Review. Algorithms The heart of computer science Definition: A finite sequence of instructions with the properties that –Each instruction.
Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.
A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
String Processing II: Compressed Indexes Patrick Nichols Jon Sheffi Dacheng Zhao
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
Tries Standard Tries Compressed Tries Suffix Tries.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
Sequence Alignment in DNA Under the Guidance of : Prof. Kolin Paul Presented By: Lalchand Gaurav Jain.
Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following.
Fast and Practical Algorithms for Computing Runs Gang Chen – McMaster, Ontario, CAN Simon J. Puglisi – RMIT, Melbourne, AUS Bill Smyth – McMaster, Ontario,
1 Compressed Index for Dictionary Matching WK Hon (NTHU), TW Lam (HKU), R Shah (LSU), SL Tam (HKU), JS Vitter (Purdue)
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler Ben Langmead Based on work by Juha Kärkkäinen.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University.
6/26/2015 7:13 PMTries1. 6/26/2015 7:13 PMTries2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3) Huffman encoding.
Indexing and Searching
1 A Lempel-Ziv text index on secondary storage Diego Arroyuelo and Gonzalo Navarro Combinatorial Pattern Matching 2007.
Modern Information Retrieval Chapter 4 Query Languages.
Information and Coding Theory Heuristic data compression codes. Lempel- Ziv encoding. Burrows-Wheeler transform. Juris Viksna, 2015.
An Online Algorithm for Finding the Longest Previous Factors Daisuke Okanohara University of Tokyo Karlsruhe, Sep 15, 2008 Kunihiko.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
Space Efficient Data Structures for Dynamic Orthogonal Range Counting Meng He and J. Ian Munro University of Waterloo.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
Aligning Reads Ramesh Hariharan Strand Life Sciences IISc.
Optimizing multi-pattern searches for compressed suffix arrays Kalle Karhu Department of Computer Science and Engineering Aalto University, School of Science,
Succinct Orthogonal Range Search Structures on a Grid with Applications to Text Indexing Prosenjit Bose, Carleton University Meng He, Unversity of Waterloo.
Succinct Dynamic Cardinal Trees with Constant Time Operations for Small Alphabet Pooya Davoodi Aarhus University May 24, 2011 S. Srinivasa Rao Seoul National.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Szymon Grabowski, Marcin Raniszewski Institute of Applied Computer Science, Lodz University of Technology, Poland The Prague Stringology Conference, 1-3.
Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.
Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,
B ACKWARD S EARCH FM-I NDEX (F ULL - TEXT INDEX IN M INUTE SPACE ) Paper by Ferragina & Manzini Presentation by Yuval Rikover.
BACKWARD SEARCH FM-INDEX (FULL-TEXT INDEX IN MINUTE SPACE)
Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda
Joint Advanced Student School Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache.
ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections Jouni Sirén 1, Niko Välimäki 1, Veli Mäkinen 1, and Gonzalo Navarro.
Compressed Suffix Arrays for Massive Data Jouni Sirén SPIRE 2009.
RNAseq: a Closer Look at Read Mapping and Quantitation
Burrows-Wheeler Transformation Review
COMP9319 Web Data Compression and Search
Tries 07/28/16 11:04 Text Compression
Succinct Data Structures
Andrzej Ehrenfeucht, University of Colorado, Boulder
Reducing the Space Requirement of LZ-index
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Strings: Tries, Suffix Trees
CSC2431 February 3rd 2010 Alecia Fowler
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Tries 2/27/2019 5:37 PM Tries Tries.
Strings: Tries, Suffix Trees
Languages Fall 2018.
Presentation transcript:

Compressed Index for a Dynamic Collection of Texts H.W. Chan, W.K. Hon, T.W. Lam The University of Hong Kong

Problem Definition Given L = { T 1, T 2, …, T k } of total length n over an alphabet Σ We want to create an index for L such that on given any pattern P, the occurrences of P in each of the T i can be found quickly Also, the index should support fast insertion/ deletion of T i into/from L

Previous Work & Our Result Space (bits)MatchingInsertion/deletion [McCreight, JACM ’ 76] O(n log n)O (|P|+occ)O(|T i |) [Ferragina & Manzini, FOCS ’ 00] O(n)O(|P| log 3 n + occ log n)amortized O(|T i | log n) / amortized O(|T i | log 2 n) [This paper] O(n)O(|P| log n + occ log 2 n)worst case O(|T i | log n)

Two Basic Tools: CSA, FM-index Definition 1: The main component of CSA for a text T is a function Ψ such that Ψ [i] = SA -1 [SA[i] + 1] where SA[i] is the i-th entry in the suffix array, and SA -1 is the inverse of SA

Two Basic Tools: CSA, FM-index Definition 2: The FM-index of T is based on Burrows- Wheeler array of T, which is an array of characters, denoted by BWT, such that BWT[i] = T[SA[i]-1]. The main component of FM-index is |Σ| functions count c for every c  Σsuch that count c [i] = # of c in BWT[1…i]

Our Index Our index is a dynamic version of CSA + FM-index for the concatenated text T 1 T 2 … T k We exploit the property of Ψ and count that, both of them are essentially a couple of sequence of increasing values.

Our Index To maintain a dynamic CSA and FM- index  to maintain a dynamic sequence of increasing values Observation 3: Balanced search tree is good for dynamic sequence Observation 4: Difference encoding for increasing values can save space

Our Index Combining Observations 3 and 4  Differential Balanced Search Tree to handle the values in the dynamic CSA and FM-index Drawbacks: computation of Ψ and count is slowed down by O(log n) factor Pattern matching: O(|P| log n + occ log 2 n) time

Insertion & Deletion (sketch idea) Insertion corresponds to finding update points in the increasing sequences of Ψ and count To insert a text T into L, there are O(|T|) such update points Update points can be found by simulating a pattern matching query of T against L Total time: O(|T| log n)

Insertion & Deletion (sketch idea) Deletion reverses the insertion process Update points can be found by querying Ψ iteratively, instead of simulating a pattern matching query Total time: O(|T| log n)

Conclusion, Progress & Future Work In the literature, there is a dual problem called Dictionary Management, which maintains a collection of patterns, such that when a text T is given later, all occurrences of each pattern in T is reported in one query. Also, fast insertion/deletion of pattern is required O(n) bits: some progress …

Conclusion, Progress & Future Work There is another problem called Dynamic Text, which maintains a single text T, and when a pattern P is given later, it supports finding all occurrences of P in T. The text T is subject to insertion/deletion of substrings. O(n log n) bits: Sahinalp & Vishkin, FOCS ’ 96 O(n) bits: ??