Szymon Grabowski, Marcin Raniszewski Institute of Applied Computer Science, Lodz University of Technology, Poland The Prague Stringology Conference, 1-3.

Slides:



Advertisements
Similar presentations
CS Data Structures Chapter 8 Hashing.
Advertisements

Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Space-for-Time Tradeoffs
CPSC 335 Dr. Marina Gavrilova Computer Science University of Calgary Canada.
CSCE 3400 Data Structures & Algorithm Analysis
A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.
Two implementation issues Alphabet size Generalizing to multiple strings.
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.
A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo.
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following.
Log Files. O(n) Data Structure Exercises 16.1.
Modern Information Retrieval
BTrees & Bitmap Indexes
Fast and Practical Algorithms for Computing Runs Gang Chen – McMaster, Ontario, CAN Simon J. Puglisi – RMIT, Melbourne, AUS Bill Smyth – McMaster, Ontario,
Design a Data Structure Suppose you wanted to build a web search engine, a la Alta Vista (so you can search for “banana slugs” or “zyzzyvas”) index say.
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler Ben Langmead Based on work by Juha Kärkkäinen.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
CS336: Intelligent Information Retrieval
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
Hash Tables1 Part E Hash Tables  
6/26/2015 7:13 PMTries1. 6/26/2015 7:13 PMTries2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3) Huffman encoding.
Hash Tables1 Part E Hash Tables  
Indexing and Searching
1 A Lempel-Ziv text index on secondary storage Diego Arroyuelo and Gonzalo Navarro Combinatorial Pattern Matching 2007.
Compressed Index for a Dynamic Collection of Texts H.W. Chan, W.K. Hon, T.W. Lam The University of Hong Kong.
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
Data Structures Hashing Uri Zwick January 2014.
FM-KZ: An even simpler alphabet-indepent FM-index Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland Prague.
Information and Coding Theory Heuristic data compression codes. Lempel- Ziv encoding. Burrows-Wheeler transform. Juris Viksna, 2015.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
Data Compression By, Keerthi Gundapaneni. Introduction Data Compression is an very effective means to save storage space and network bandwidth. A large.
 The amount of data we deal with is getting larger  Not only do larger files require more disk space, they take longer to transmit  Many times files.
Optimizing multi-pattern searches for compressed suffix arrays Kalle Karhu Department of Computer Science and Engineering Aalto University, School of Science,
MA/CSSE 473 Day 27 Hash table review Intro to string searching.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
MA/CSSE 473 Day 23 Student questions Space-time tradeoffs Hash tables review String search algorithms intro.
Cache-efficient string sorting for Burrows-Wheeler Transform Advait D. Karande Sriram Saroop.
Compressed Prefix Sums O’Neil Delpratt Naila Rahman Rajeev Raman.
Space-time Tradeoffs for Longest-Common-Prefix Array Construction Simon J. Puglisi and Andrew Turpin
Chapter 5: Hashing Part I - Hash Tables. Hashing  What is Hashing?  Direct Access Tables  Hash Tables 2.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Hashing Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005.
1 Chapter 7 Skip Lists and Hashing Part 2: Hashing.
Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,
Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda
CS654: Digital Image Analysis Lecture 34: Different Coding Techniques.
Joint Advanced Student School Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache.
Evidence from Content INST 734 Module 2 Doug Oard.
Chapter 15 A External Methods. © 2004 Pearson Addison-Wesley. All rights reserved 15 A-2 A Look At External Storage External storage –Exists beyond the.
Efficient Processing of Updates in Dynamic XML Data Changqing Li, Tok Wang Ling, Min Hu.
ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park.
© 2006 Pearson Addison-Wesley. All rights reserved15 A-1 Chapter 15 External Methods.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Linear Time Suffix Array Construction Using D-Critical Substrings
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
COMP9319 Web Data Compression and Search
Tries 07/28/16 11:04 Text Compression
Tries 5/27/2018 3:08 AM Tries Tries.
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
13 Text Processing Hongfei Yan June 1, 2016.
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Suffix Arrays and Suffix Trees
Presentation transcript:

Szymon Grabowski, Marcin Raniszewski Institute of Applied Computer Science, Lodz University of Technology, Poland The Prague Stringology Conference, 1-3 September 2014

1. Introduction 2. Related work 3. SA-hash 4. FBCSA 5. Experimental results 6. Conclusions / Future work 2 Two simple full-text indexes based on the suffix array PSC 2014

3 In this book, the first of several which I am writing, I must confine myself to a chronicle of those events which I myself observed and experienced, and those supported by reliable evidence; preceded by two chapters briefly outlining the background and causes of the November Revolution. I am aware that these two chapters make difficult reading, but they are essential to an understanding of what follows. T (text): P (pattern): are In this book, the first of several which I am writing, I must confine myself to a chronicle of those events which I myself observed and experienced, and those supported by reliable evidence; preceded by two chapters briefly outlining the background and causes of the November Revolution. I am aware that these two chapters make difficult reading, but they are essential to an understanding of what follows. Operations: locate: locate: count: count: 2 [291, 351] find all j that: P = T[j.. j + len(P)] allow searches for any substring Two simple full-text indexes based on the suffix array PSC 2014

4  n = len(T), m = len(P)  Suffix array [Manber & Myers, 1990]: array of n pointers to text suffixes arranged in the lexicographic order  Typically needs 5n bytes (n log n (structure) + n log |Σ| (text))  The pattern is searched using binary search (search time is O(m log n) in the worst case) Two simple full-text indexes based on the suffix array PSC 2014

 Suffix tree: ◦ can be built in a linear time ◦ efficient for small alphabet (widely use in bioinformatics) ◦ not-efficient for large alphabet ◦ large space requirements (most implementations need 20n bytes or even more!) 5 Two simple full-text indexes based on the suffix array PSC 2014

 Since around 2000 great interest in succinct text indexes: compressed suffix array (CSA), FM-index  Compressed indexes: ◦ not much slower than SA in count queries ◦ from 2 to 3 orders of magnitude slower in locate queries (large number of matches)  Exploited repetitions in the SA [ Mäkinen, 2003], [González et al., 2014]  Semi-external data structures [Gog & Moffat, 2013] 6 Two simple full-text indexes based on the suffix array PSC 2014

 Manber and Myers presented a nice trick saving several first steps in the binary search: if we know the SA interval for all the possible first k symbols of pattern, we can immediately start the binary search in a corresponding interval  Our proposal is to apply hashing on relatively long strings with extra trick to reduce the number of unnecessary references to the text 7 Two simple full-text indexes based on the suffix array PSC 2014

 The hash function is calculated for the distinct k- symbol (k ≥ 2) prefixes of suffixes from the suffix array  The value written to hash table (HT) is a pair: the position in the SA of the first and last suffixes with the given prefix  Linear probing is used as the collision resolution method 8 Two simple full-text indexes based on the suffix array PSC 2014

unsigned long sdbm(unsigned char* str) { unsigned long hash = 0; int c; while (c = *str++) hash = c + (hash << 6) + (hash << 16) - hash; return hash; } 9 Two simple full-text indexes based on the suffix array PSC 2014

10 Two simple full-text indexes based on the suffix array PSC 2014

 A variant of Mäkinen's compact suffix array (finding repeating suffix areas of fixed size, e.g., 32 bytes)  Maintain a int32-aligned data layout, beneficial for speed and simplicity  Result: two arrays arr1 and arr2: ◦ arr1 stores bits and references to areas in arr2 ◦ arr2 stores SA and text indexes (32-bit integers)  We used: SA, SA -1 and T BWT: ◦ SA -1 [j] = i ⇔ SA[i] = j ◦ T BWT [i] = T[SA[i] – 1] 11 Two simple full-text indexes based on the suffix array PSC 2014

 There are two construction-time parameters: ◦ block size bs ◦ sampling step ss  The parameter bs tells how many successive SA indexes are encoded together and is assumed to be a multiple of 32  The parameter ss means that every ss-th text index will be represented verbatim 12 Two simple full-text indexes based on the suffix array PSC 2014

13 SA[ ] = [1000, 522, 801, 303, 906, 477, 52, 610] bs = 8 T BWT [ ] = [a, b, a, c, d, d, b, b] Σ = {a, b, c, d, e} we choose 3 most common symbols:b, a, d [999, 521, 800, 302, 905, 476, 51, 609] we conclude that SA has areas of values: [521, 51, 609] [999, 800] [905, 476] text indexes: SA -1 [521] = 506 ⇒ SA[ ] = [521, 51, 609] SA -1 [999] = 287 ⇒ SA[ ] = [999, 800] SA -1 [905] = 845 ⇒ SA[ ] = [905, 476] Two simple full-text indexes based on the suffix array PSC 2014

14 SA[ ] = [1000, 522, 801, 303, 906, 477, 52, 610] T BWT [ ] = [a, b, a, c, d, d, b, b] most common symbols: b, a, d SA[ ] = [521, 51, 609] SA[ ] = [999, 800] SA[ ] = [905, 476] b→00 a→01d→10 c→11 arr1.appendBits([01, 00, 01,11,10, 10, 00, 00]) ss = 5 arr1.appendBits([1, 0, 0, 1, 0, 0, 0, 1]) arr2.appendInts([506, 287, 845]) arr2.appendInts([1000, 303, 610]) arr1.appendInts([start_key_arr2]) We code: SA[ ] = [1000, 522, 801, 303, 906, 477, 52, 610] Two simple full-text indexes based on the suffix array PSC 2014 SA[ ] = [1000, 522, 801, 303, 906, 477, 52, 610]

 Recursive calls until verbatim written text index is found  Maximum number of recursions cannot exceed ss parameter  ss parameter is a time-space tradeoff: using larger ss reduces the overall space but decoding a particular SA index typically involves more recursive invocations 15 Two simple full-text indexes based on the suffix array PSC 2014

 All experiments were run on a laptop computer with an Intel i3 2.1 GHz CPU, equipped with 8GB of DDR3 RAM and running Windows 7 Home Premium SP1 64- bit  All codes were written in C++ and compiled with Microsoft Visual Studio 2010  We used A_strcmp function from asmlib library v written by Agner Fog ( 16 Two simple full-text indexes based on the suffix array PSC 2014

 The test datasets were taken from the Pizza & Chili site  We used the 200-megabyte versions of the files: ◦ dna ◦ english ◦ proteins ◦ sources ◦ xml  In order to test the search algorithms, we generated 500K patterns for each used pattern length  Each pattern returns at least one match 17 Two simple full-text indexes based on the suffix array PSC 2014

18 Two simple full-text indexes based on the suffix array PSC 2014

19 Two simple full-text indexes based on the suffix array PSC 2014

 SA-hash: ◦ fastest index ◦ requires significantly more space than SA ◦ minimal pattern length  SA-allHT (not implemented): additional hash table for each pattern length from 1 to k (price: even more space use) 20 Two simple full-text indexes based on the suffix array PSC 2014

21 Two simple full-text indexes based on the suffix array PSC 2014

22 Two simple full-text indexes based on the suffix array PSC 2014

23 Two simple full-text indexes based on the suffix array PSC 2014

24 Two simple full-text indexes based on the suffix array PSC 2014

25 Two simple full-text indexes based on the suffix array PSC 2014

 We proposed two suffix array based full-text indexes with interesting space-time tradeoffs  SA-hash augments the standard SA with a hash table, to speed up searches by factor about 2.5–3, with 0.3n – 2.0n bytes of extra space  Fixed Block based Compact SA (FBCSA) is related to Mäkinen's CSA, but uses an int32-aligned data layout 26 Two simple full-text indexes based on the suffix array PSC 2014

 SA-hash: replace sdbm hash function with a more sophisticated one (xxhash?) with fewer collisions. Or try perfect or cuckoo hashing  SA-hash: variable-length prefixes as hash keys?  FBCSA: some design decisions were rather ad hoc and should be tested in more systematic ways, with hopefully new relevant space-time tradeoffs 27 Two simple full-text indexes based on the suffix array PSC 2014

28 Two simple full-text indexes based on the suffix array PSC 2014