FM-KZ: An even simpler alphabet-indepent FM-index Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland Prague.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Boosting Textual Compression in Optimal Linear Time.
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Lecture 4 (week 2) Source Coding and Compression
Applied Algorithmics - week7
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Fast Algorithms For Hierarchical Range Histogram Constructions
Arithmetic Coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a How we can do better than Huffman? - I As we have seen, the.
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.
Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.
A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo.
1 CSC 421: Algorithm Design & Analysis Spring 2013 Space vs. time  space/time tradeoffs  examples: heap sort, data structure redundancy, hashing  string.
Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following.
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
2015/6/15VLC 2006 PART 1 Introduction on Video Coding StandardsVLC 2006 PART 1 Variable Length Coding  Information entropy  Huffman code vs. arithmetic.
A general compression algorithm that supports fast searching Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland
Text Operations: Coding / Compression Methods. Text Compression Motivation –finding ways to represent the text in fewer bits –reducing costs associated.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
Compressed Suffix Arrays based on Run-Length Encoding Veli Mäkinen Bielefeld University Gonzalo Navarro University of Chile BWTRLFID.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
Document and Query Forms Chapter 2. 2 Document & Query Forms Q 1. What is a document? A document is a stored data record in any form A document is a stored.
A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin.
1 A Lempel-Ziv text index on secondary storage Diego Arroyuelo and Gonzalo Navarro Combinatorial Pattern Matching 2007.
2015/7/12VLC 2008 PART 1 Introduction on Video Coding StandardsVLC 2008 PART 1 Variable Length Coding  Information entropy  Huffman code vs. arithmetic.
Compressed Index for a Dynamic Collection of Texts H.W. Chan, W.K. Hon, T.W. Lam The University of Hong Kong.
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
5. 1 JPEG “ JPEG ” is Joint Photographic Experts Group. compresses pictures which don't have sharp changes e.g. landscape pictures. May lose some of the.
Information Retrieval Space occupancy evaluation.
Data Structures and Algorithms Huffman compression: An Application of Binary Trees and Priority Queues.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
Recursion, Complexity, and Searching and Sorting By Andrew Zeng.
 Divide the encoded file into blocks of size b  Use an auxiliary bit vector to indicate the beginning of each block  Time – O(b)  Time vs. Memory.
Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Information and Coding Theory Heuristic data compression codes. Lempel- Ziv encoding. Burrows-Wheeler transform. Juris Viksna, 2015.
Probe Design Using Exact Repeat Count August 8th, 2007 Aaron Arvey.
Optimizing multi-pattern searches for compressed suffix arrays Kalle Karhu Department of Computer Science and Engineering Aalto University, School of Science,
Succinct Orthogonal Range Search Structures on a Grid with Applications to Text Indexing Prosenjit Bose, Carleton University Meng He, Unversity of Waterloo.
Addressing Image Compression Techniques on current Internet Technologies By: Eduardo J. Moreira & Onyeka Ezenwoye CIS-6931 Term Paper.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Cache-efficient string sorting for Burrows-Wheeler Transform Advait D. Karande Sriram Saroop.
Compressed Prefix Sums O’Neil Delpratt Naila Rahman Rajeev Raman.
Szymon Grabowski, Marcin Raniszewski Institute of Applied Computer Science, Lodz University of Technology, Poland The Prague Stringology Conference, 1-3.
06/12/2015Applied Algorithmics - week41 Non-periodicity and witnesses  Periodicity - continued If string w=w[0..n-1] has periodicity p if w[i]=w[i+p],
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,
Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda
Joint Advanced Student School Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache.
Evidence from Content INST 734 Module 2 Doug Oard.
CSE 143 Lecture 22 Huffman slides created by Ethan Apter
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Basics
Huffman Coding (2 nd Method). Huffman coding (2 nd Method)  The Huffman code is a source code. Here word length of the code word approaches the fundamental.
Accelerating Multi-Pattern Matching on Compressed HTTP Traffic Dr. Anat Bremler-Barr (IDC) Joint work with Yaron Koral (IDC), Infocom[2009]
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections Jouni Sirén 1, Niko Välimäki 1, Veli Mäkinen 1, and Gonzalo Navarro.
A simple storage scheme for strings achieving entropy bounds Paolo Ferragina and Rossano Venturini Dipartimento di Informatica University of Pisa, Italy.
COMP9319 Web Data Compression and Search
HUFFMAN CODES.
Succinct Data Structures
Succinct Data Structures
Information and Coding Theory
Reducing the Space Requirement of LZ-index
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Sequences 5/17/ :43 AM Pattern Matching.
Huffman Coding Greedy Algorithm
Presentation transcript:

FM-KZ: An even simpler alphabet-indepent FM-index Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland Prague Stringology Club, Praha, Aug Gonzalo Navarro Dept. of Computer Science Univ. of Chile, Chile Alejandro Salinger David R. Cheriton School of Computer Science Univ. of Waterloo, Canada Rafał Przywarski Computer Engineering Dept., Tech. Univ. of Łódź, Poland

2 suffix tree (aka lord of the strings): powerful, flexible, but needs at least 10n space (avg. case, assuming indices 4x larger than characters); suffix array: 4n space, otherwise quite practical. R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06 Full text indexing – past Full text indexing – now compressed suffix array (CSA) (Grossi & Vitter, 2000;...) FM-index based on the BWT (Ferragina & Manzini, 2000) LZ-index based on the suffix tree with LZ78 (Navarro, 2003) alphabet-friendly FM (Ferragina et al., 2004)......

3 Compressed indexes Common feature: the original text may be omitted, only its compressed representation suffices for handling queries. Most of the compressed indexes are based on the Burrows–Wheeler transform (BWT). Rapid development in theory (see the survey by Navarro & Mäkinen, 2006); implementations somewhat lag behind... This work – practice oriented. A step on from our earlier work (SPIRE04, PSC05). R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

4 rotations as they gosorted rotations Burrows-Wheeler transform (BWT), an example FL R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

5 Pattern searching in BWT sequence: LF-mapping mechanism Starting point in Ferragina & Manzini’s index (2000): search time: O(m log n), space occupancy: O(n log n) bits. Note that in such form the complexities are like with the plain suffix array, but text T itself may be eliminated! Better? Ferragina & Manzini (2000) reach O(m) time with (roughly) O(H k n) space, assuming small alphabet. R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

6 Searching in BWT sequence, an example FL BWT matrix feasible form of L column R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

7 FM-Huffman (Grabowski et al., 2004) Idea: Search in BWT sequence, but use a binary (or, generally, constant size) alphabet. Use rank() operation in binary sequence (Jacobson, 1989; Munro, 1996; Clark, 1996). Rank(k) tells the number of 1’s in T[1...k], k  n, in O(1) time and needs o(n) extra space. Binary representation? Yes, you guessed: Huffman coding (approximation of order-0 entropy). Soon we’ll see this is not so good as might first seem. R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

8 Searching (counting query) in FM-Huffman Searching for pattern P’ in bit-vector B R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

9 FM-Huffman index 1.Huffman encode the text T, obtaining T’ (n’ bits). 2.Calculate the BWT for the T’, call it B. 3.Create another bit array, Bh, such that indicates the bits in B which start Huffman codewords. 4.Huffman encode the pattern P, obtaining P’. 5.Search in a similar manner as shown at slide 7, BUT the BWT sequence is kept naturally (array B) and the additional space overhead is sublinear in n’. 6.Verify a match with additional bits (Bh array + again extra structures sublinear in n’). Main drawback: Bh as large as B. index construction query handling R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

10 What instead of the binary Huffman? (Consider both space and search time.) k-ary Huffman (Grabowski et al., PSC05) k typically 4 or 16: - B array needs more space (usu. slightly more). +++ Bh array is almost halved. -- rank structures for each of 4 symbols needed (but for a halved sequence). In total: some 10% space gain for English and proteins (almost no gain for DNA). Significant speedup in most cases (fewer codeword chars  fewer rank operations). R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

11 Now more radical: remove Bh completely Removing Bh is possible if our encoding has some self-synchronizing property. Every codeword beginning must be recognized instantanously. Very naïve solution: unary coding. Anything better? Yes. Kautz-Zeckendorf coding. The search is exactly like in slide 8 (for binary FM-Huffman), only line 9 will be now if ep < sp then occ = 0 else occ = ep – (sp – 1) R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

12 R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06 Kautz–Zeckendorf coding Basic variant (we denote it as KZ2): all the codewords start with 110; nowhere else 110 appear. Let B be encoded with KZ2. If during the LF-mapping we read 0 followed by two 1’s, we know we are at a codeword boundaries. Note we allow 1 at a codeword end! So even three 1’s can be “in a row”. But 110  only at a codeword beginning.

13 Kautz–Zeckendorf coding, cont’d KZ2 encoding (in an alterantive variant: each codeword has 1 at the beginning and at a start and no two adjacent 1’s elsewhere) presents an integer as a sum of Fibonacci numbers in a unique form. Fib. sequence (note a single 1 at the start): 1, 2, 3, 5, 8, 13, 21, 34, So, for example 27 will be represented as (LSDigit first): Since 27 = R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

14 What is the avg codeword length for KZ? We don’t know.  But asymptotically (large alphabet) it can be upper-bounded by (H 0 + r), where r < 1 is the Huffman redundancy for a given distribution = 1+sqrt(5) / 2 (golden ratio) R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

15 Benefits of KZ No Bh array (and its rank structure)....So we don’t perform the final pair of rank operations either. With FM-Huffman, selectnext (telling the pos. of the next 1) is needed at a start of report / display query handling. Now all the matches are in a contiguous range of rows. Drawbacks of KZ B (and its rank) is longer, as KZ code is longer than Huffman. Longer encoded patterns mean more rank operations (as opposed to FM-Huff4). Harder analysis... R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

16 On a Fibonacci numbers application... The number Does it ring a bell? R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06 1 mile = km How does a mathematician convert miles into kilometers? (According to Graham, Knuth, and Patashnik, Concrete Mathematics.) Represent the distance in the Fibonacci base (e.g. KZ2), shift left by 1, sum what you’ve obtained. Example: 80 miles. 80 = = (fib) After the << : (fib) = = 130 km Ratio 1.625, not bad...

17 R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06 Generalized Kautz–Zeckendorf KZ1: 10 prefix (unary coding!) KZ2: 110 prefix KZ3: 1110 prefix, etc. Is KZ2 best? Not always. For example, for DNA (4 symbols) the seemingly very naive unary coding has 2.5 bit avg codeword length (assuming non- compressible symbols, ie. H 0 = 2 bits / symbol)....Ops, this is for a slightly twisted variant: the codewords are simply 1, 10, 100 and 1000.

18 Reporting queries (basic idea) – same for all the FM-* One extra bit per original symbol needed (plus some sublinear data), and one position index per h symbols (h user-selected parameter, e.g., 32). We sample positions of T’ in regular intervals, but only at codewords’ beginnings. Handling a query: for each found occurence of P continue digit-by-digit backward moving in T’ until a sampled position (signalled with a flag) is met. Read its index (original position) and it’s done. The backward moving in T’ is limited. R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

19 Experimental results Datasets: 80 MB of English text (from TREC-3) 60 MB of DNA (from BLAST database), 5 characters!  55 MB of proteins (from BLAST database) Test platform: Intel Xeon 3.06 GHz 2 GB of RAM 512 KB cache Gentoo Linux Gcc O9 R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

20 Experimental results, cont’d Counting queries: Pattern length from 10 to 100, for each length 1000 patterns taken from random positions of each text. Reporting queries: Pattern length random patterns taken. Display queries: 1000 random patterns, 100 chars to display around each of the found occurence. Competitors: FM-index (very simple and fast (byte-oriented) variant by Navarro, 2004), Compressed Suffix Array (CSA) (Sadakane, 2000), Run-Length FM (RLFM) (Mäkinen & Navarro, 2005), Succinct Suffix Array (SSA) (Mäkinen & Navarro, 2005), LZ-index (Navarro, 2004), FM-Huffman2 and FM-Huffman4 (Grabowski et al., 2005) FM-KZ1 and FM-KZ2 (this work). R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

21 English text, search time R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06 time in sec for varying pattern lengths

22 English text, space vs. search time R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06 time in sec per character

23 DNA, space vs. search time R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06 time in sec per character

24 Proteins, space vs. search time R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06 time in sec per character

25 Observations CSA and RLFM: hardly ever competitive. FM-Huff-16 fastest for counting queries for English and proteins. FM-KZ1: most succinct and among the fastest on DNA. Reporting time: FM-Huff variants lose to FM-index for English and proteins. They (k=2 and k=4) win on DNA instead (but there SSA is even better, and more flexible for low space use). Display time: FM-KZ1 best for DNA. Best for proteins: FM and then FM-KZ2 (but the fastest is FM-Huff16). English text: similar to proteins but LZ-index equally fast to FM-Huff16 and needs about 25% less space. Original binary Huffman: never competitive. R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

26 Presented algorithm – properties Search time: O((H 0 +1)m + occ) avg search time. O(m log n + occ) worst-case search time. Space occupancy: less than (H 0 +1)n + o(H 0 n) bits. Pros and cons (summary): very simple and practical succinct index; no dependence on the alphabet size; among the fastest (but not the most succinct) compressed indexes; worse “in theory” than some recent indexes (but simpler); quite flexible R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

27 To do: Better analysis? Some more little tricks (and tweaks), e.g., the B array may be truncated somewhat. Good for space and even also for speed (elimination of some rank operations). More experiments with more succinct rank (e.g. 5% overhead rank is only moderately slower than the 10% one; definitely not twice; quite an option for Huff4 and Huff16). Higher arity KZ? R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

28 DNA, search time R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

29 Proteins, search time R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

30 English text, report time R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

31 DNA, report time R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

32 Proteins, report time R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

33 Possible extensions: D-ary Huffman: somewhat less space (if D chosen carefully for given data), faster search; faster and more succinct rank/selectnext solutions; Zeckendorf coding (instead of Huffman), even simpler structure (no Bh array), "free" counting (no "+occ" additive term) and less space for some distributions but slower in practice. R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

34 Experimental results, counting queries C++ code by V. Mäkinen, g compiler machine: Pentium IV 2.6 GHz 2 GB RAM Red Hat MB English text (ZIFF-2) HuffFM-index size: 1.84 times text size ( FM-index: 1.33, CSA: 0.71, CCSA: 1.64 ) counting queries, times summed over 10,000 search patterns R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

35 Searching (counting query) in the original FM-index Searching for pattern P in T bwt R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

36 How to implement rank() Basic: needs of space. 3 levels (with 2-d array lookup in the last stage). ”Real” O(1) time. Basic rank idea. Add 2 precomputed values (first having O(log n) bits, second having only O(log log n) bits), followed by scanning in O(w)-bit chunks (w=O(log n) on a RAM) R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

37 How to implement rank() better (González et al., WEA 2005) Byte-oriented, with popcounting on the byte level: Small (cache-friendly) lookup table for the last stage. We’ve experimented with setting the parameters. Basic rank needs approx. 67% of the orig. bit sequence. This practical implementation needs 37.5%. Another option (no results for those experiments in this paper): Single level + sequential scan: clear space/time trade-off. R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

38 Experimental results, overall index space How varying k affects the total space used. Alphabet size for DNA: 5, for proteins: 24. R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

39 Experimental results, space consumed by substructures R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

40 English text, space vs. display time R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

41 DNA, space vs. display time R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

42 Proteins, space vs. display time R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

43 The proposed index step-by-step, example R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06