A general compression algorithm that supports fast searching Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Indexing DNA Sequences Using q-Grams
Parametrized Matching Amir, Farach, Muthukrishnan Orgad Keller.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Shift-And Approach to Pattern Matching in LZW Compressed Text Takuya KIDA Department of Informatics Kyushu University, Japan Masayuki TAKEDA Ayumi SHINOHARA.
PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.
Lecture 6 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Efficient Decodable and Searchable Natural Language Adaptive Compression Salvador de Bahía. Aug th Annual International ACM SIGIR Gonzalo Navarro.
Fast and Practical Algorithms for Computing Runs Gang Chen – McMaster, Ontario, CAN Simon J. Puglisi – RMIT, Melbourne, AUS Bill Smyth – McMaster, Ontario,
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Boyer-Moore string search algorithm Book by Dan Gusfield: Algorithms on Strings, Trees and Sequences (1997) Original: Robert S. Boyer, J Strother Moore.
Text Operations: Coding / Compression Methods. Text Compression Motivation –finding ways to represent the text in fewer bits –reducing costs associated.
CS336: Intelligent Information Retrieval
Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University.
Computer Science 335 Data Compression.
1 Accelerating Multi-Patterns Matching on Compressed HTTP Traffic Authors: Anat Bremler-Barr, Yaron Koral Presenter: Chia-Ming,Chang Date: Publisher/Conf.
Simple techniques for plagiarism detection in student programming projects Szymon Grabowski, Wojciech Bieniecki Computer Engineering Dept., Tech. Univ.
A Pre-Processing Algorithm for String Pattern Matching Laurence Boxer Department of Computer and Information Sciences Niagara University and Department.
Document and Query Forms Chapter 2. 2 Document & Query Forms Q 1. What is a document? A document is a stored data record in any form A document is a stored.
Indexing and Searching
1 A Lempel-Ziv text index on secondary storage Diego Arroyuelo and Gonzalo Navarro Combinatorial Pattern Matching 2007.
Cmpt-225 Simulation. Application: Simulation Simulation  A technique for modeling the behavior of both natural and human-made systems  Goal Generate.
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
Chapter 7 Special Section Focus on Data Compression.
DATA STRUCTURE Subject Code -14B11CI211.
Information Retrieval Space occupancy evaluation.
1 Pattern Matching Using n-grams With Algebraic Signatures Witold Litwin[1], Riad Mokadem1, Philippe Rigaux1 & Thomas Schwarz[2] [1] Université Paris Dauphine.
1 Part I: Machine Architecture 4 A major process in the development of a science is the construction of theories that are confirmed or rejected by experimentation.
Seok-Won Seong and Prabhat Mishra University of Florida IEEE Transaction on Computer Aided Design of Intigrated Systems April 2008, Vol 27, No. 4 Rahul.
FM-KZ: An even simpler alphabet-indepent FM-index Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland Prague.
: Chapter 12: Image Compression 1 Montri Karnjanadecha ac.th/~montri Image Processing.
Filter Algorithms for Approximate String Matching Stefan Burkhardt.
Łódź, 2008 Intelligent Text Processing lecture 1 Intro to string matching Szymon Grabowski
Multimedia Specification Design and Production 2012 / Semester 1 / L3 Lecturer: Dr. Nikos Gazepidis
Optimizing multi-pattern searches for compressed suffix arrays Kalle Karhu Department of Computer Science and Engineering Aalto University, School of Science,
1 Pattern Matching Using n-gram Sampling Of Cumulative Algebraic Signatures : Preliminary Results Witold Litwin[1], Riad Mokadem1, Philippe Rigaux1 & Thomas.
Application: String Matching By Rong Ge COSC3100
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
Compressed Pattern Matching in DNA Sequences BARNA SAHA.
Speeding up pattern matching by text compression Department of Informatics, Kyushu University, Japan Department of AI, Kyushu Institute of Technology,
Tamanna Chhabra, M. Oguzhan Kulekci, and Jorma Tarhio Aalto University.
Compressed Prefix Sums O’Neil Delpratt Naila Rahman Rajeev Raman.
Szymon Grabowski, Marcin Raniszewski Institute of Applied Computer Science, Lodz University of Technology, Poland The Prague Stringology Conference, 1-3.
Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM.
Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa Department of Informatics, Kyushu.
06/12/2015Applied Algorithmics - week41 Non-periodicity and witnesses  Periodicity - continued If string w=w[0..n-1] has periodicity p if w[i]=w[i+p],
A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics,
Performance of Compressed Inverted Indexes. Reasons for Compression  Compression reduces the size of the index  Compression can increase the performance.
Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,
Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda
Evidence from Content INST 734 Module 2 Doug Oard.
Faster Approximate String Matching over Compressed Text By Gonzalo Navarro *, Takuya Kida †, Masayuki Takeda †, Ayumi Shinohara †, and Setsuo Arikawa.
Lecture 7 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan
Lecture 12 Huffman Algorithm. In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly.
Accelerating Multi-Pattern Matching on Compressed HTTP Traffic Dr. Anat Bremler-Barr (IDC) Joint work with Yaron Koral (IDC), Infocom[2009]
1/39 COMP170 Tutorial 13: Pattern Matching T: P:.
Compressed Suffix Arrays for Massive Data Jouni Sirén SPIRE 2009.
A database index to large biological sequences
HUFFMAN CODES.
Huffman Coding Based on slides by Ethan Apter & Marty Stepp
13 Text Processing Hongfei Yan June 1, 2016.
Searching Similar Segments over Textual Event Sequences
Multiple Pattern Matching Revisited
Space-for-time tradeoffs
Suffix Arrays and Suffix Trees
CPS 296.3:Algorithms in the Real World
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

A general compression algorithm that supports fast searching Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland Appeared in Information Processing Letters (IPL), 100(6):226–232, Kimmo Fredriksson Dept. of Computer Science Univ. of Joensuu, Finland

2 Compressed pattern searching problem (Amir & Benson, 1992): Input: text T’ available in a compressed form, pattern P. Output: report all occurences of P in T (i.e. decompressed T’) without decompressing the whole T’. Of course, a compressed search algorithm can be called practical if the search time is less than with the naïve “first decompress, then search” approach. Basic notation: |T| = n, |T’| = n’, |P| = m, |  | = . K.Fredriksson & Sz. Grabowski, A general compression algorithm that supports fast searching

3 Pros and cons of on-line and off-line searching On-line algorithms: immediate to use (raw text), simple, flexible – but slow. Off-line algorithms (indexes): much faster but the simple and fastest solutions (suffix tree, suffix array) need much space (at least 5n incl. the text), while the more succinct (FM-index, CSA, many variants of...) are quite complicated. Indexed searching much less flexible than on-line searching (hard / impossible to adapt various approximate matching models, hard to handle a dynamic scenario).

4 K.Fredriksson & Sz. Grabowski, A general compression algorithm that supports fast searching Compressed pattern searching – something in between May be faster (but not dramatically) than on-line searching in uncompressed text. Space: typically 0.5n or less. Relatively simple. Easier to implement approximate matching, handle dynamic text etc. So here was our motivation...

5 State-of-the-art in compressed pattern searching Word based vs. full-text schemes. Word based algorithms are better (faster, better compression, more flexible for advanced queries, easier...) as long as can be applied: text naturally segmented into words. Works like a charm with English. Slightly worse with agglutinative languages (German, Finnish...). Even worse with Polish, Russian... Doesn’t work at all with oriental languages (Chinese, Korean, Japanese). Doesn’t work with DNA, proteins, MIDI... K.Fredriksson & Sz. Grabowski, A general compression algorithm that supports fast searching

6 State-of-the-art in compressed pattern searching, cont’d Full-text algorithms (Approximate) searching in RLE-compressed data (Apostolico et al., 1999; M ä kinen et al., 2001, 2003) – nice theory but limited applications (fax images?). Direct search in binary Huffman stream (Klein & Shapira, 2001; Takeda et al., 2001, 2002; Fredriksson & Tarhio, 2003) – mediocre compression ratio, but relatively simple. Ziv-Lempel based schemes (Kida et al., 1999; Navarro & Tarhio, 2000) – quite good compression but complicated and not very fast. K.Fredriksson & Sz. Grabowski, A general compression algorithm that supports fast searching

7 Our proposal, main traits Full-text compression. Based on q-grams. Actually two search algorithms: very fast for “long” patterns (m  2q–1), somewhat slower and more complicated for short patterns (m < 2q–1). Compresses plain NL text to 45–50% orig. size (worse than Ziv-Lempel but better than character based Huffman). K.Fredriksson & Sz. Grabowski, A general compression algorithm that supports fast searching

8 Our proposal, compression scheme Choose q (larger q  better asymptotic compression but larger dictionary, the slower “short pattern” search variant triggered more often). Practical trade-off for human text: q = 4. Split text T into non-overlapping q-grams, build a dictionary over those units, dump the dictionary to the output file, encode the q-grams according to the built dictionary, using some byte-oriented code enabling pattern searching with skips (could be tagged Huffman (Moura et al., 2000) but (s,c)-DC (Brisaboa et al., 2003b) and ETDC (Brisaboa et al., 2003b) are more efficient). K.Fredriksson & Sz. Grabowski, A general compression algorithm that supports fast searching

9 Searching for long patterns Generate q possible alignments of pattern P[0..m–1]. That is, the last char of P may be either the 1st symbol, or the 2nd, etc., or the qth symbol of some q-gram. We cannot ignore any alignment as this could result in missed matches. Now, truncate at most q–1 characters at each pattern alignment boundary, those that belong to “broken” q-grams. Encode each alignment according to the text dictionary. Use any multiple string searching algorithms (we use BNDM adapted for multiple matching) for searching for the q alignm. in parallel; verify matches with the truncated prefix/suffix. K.Fredriksson & Sz. Grabowski, A general compression algorithm that supports fast searching

10 Searching for long patterns, pattern preprocessing, pseudo code K.Fredriksson & Sz. Grabowski, A general compression algorithm that supports fast searching

11 Searching for long patterns, example Let P = nasty_bananas Let q = 3. ETDC code. Three alignments generated: K.Fredriksson & Sz. Grabowski, A general compression algorithm that supports fast searching

12 Searching for long patterns, example, cont’d We encode the 3-grams. The pattern alignments may turn into smth like: K.Fredriksson & Sz. Grabowski, A general compression algorithm that supports fast searching nas ty_ ban ana nas ana ast y_b sty _ba nan

13 The shortest of those encodings (prev. slide) has 7 bytes (the 3rd one), therefore we truncate the other two sequences to 7 bytes. Those three sequences are input for BNDM alg, potential matches must be verified. Searching for long patterns, example, cont’d K.Fredriksson & Sz. Grabowski, A general compression algorithm that supports fast searching

14 Searching for short patterns If m < 2q–1, at least one alignment will not contain even one “full” q-gram. In result, the presented algorithm won’t work. We solve it by adapting the method from (Fredriksson, 2003). The idea is to have an implicit decoding of the text, encoded to a Shift-Or (Baeza- Yates & Gonnet, 1992; Wu & Manber, 1992) automaton, i.e. the automaton makes implicit transitions using the original text symbols, while the input is the q-gram symbols of the compressed text. K.Fredriksson & Sz. Grabowski, A general compression algorithm that supports fast searching

15 Test methodology All algorithms implemented in C, compiled with gcc Test machine: P4 2 GHz, 512 MB, running GNU/Linux Text files: Dickens (10.2 MB), English, plain text; Bible (~4 MB), in English, Spannish, Finnish, plain text; XML collection (5.3 MB); DNA (e.coli) (4.6 MB),  = 4. proteins (5 MB),  = 23. ( All test files available at szgrabowski.kis.p.lodz.pl/research/data.zip ) K.Fredriksson & Sz. Grabowski, A general compression algorithm that supports fast searching

16 Experimental results. Compression ratio K.Fredriksson & Sz. Grabowski, A general compression algorithm that supports fast searching our algorithms

17 The effect of varying q on the dictionary size and the overall compression. Dickens / ETDC coding K.Fredriksson & Sz. Grabowski, A general compression algorithm that supports fast searching q = 4  somewhat worse compression here than for q = 5 but much smaller dictionary, so may be preferred

18 Decompression times (excl. I/O times) [s] On the XML file, where the word based methods can be used, the q-gram based algs almost twice faster, partly because of the better compression they provide for this case. K.Fredriksson & Sz. Grabowski, A general compression algorithm that supports fast searching

19 Search times [s] Short patterns used for the test: random excerpts from text of length 2q–2 (i.e. longest “short” patterns). Long patterns in the test: minimum pat. lengths that produced compressed patterns of length at least 2. K.Fredriksson & Sz. Grabowski, A general compression algorithm that supports fast searching

20 Conclusions We have presented a compression algorithm for arbitrary data which enables pattern search with Boyer-Moore skips directly in the compressed representation. The algorithm is simple and the conducted experiments validate the claim for its practicality. For natural texts this scheme, however, cannot match, e.g., the original (s,c)-dense code in compression ratio, but this is the price we pay for removing the limitation to word based textual data. Searching speed for long enough patterns can be higher than in uncompressed text. K.Fredriksson & Sz. Grabowski, A general compression algorithm that supports fast searching

21 Future plans Flexible text partitioning: apart from q-grams allowing for shorter tokens (should give a significant compression boost on NL texts). Succinct dictionary representation (currently a naïve approach used). Handling updates to T. Adapting the scheme for approximate searching (very promising!). Finding (quickly) appropriate q for a given text. K.Fredriksson & Sz. Grabowski, A general compression algorithm that supports fast searching