Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.

Slides:

Advertisements

Similar presentations

Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio,

Advertisements

Boosting Textual Compression in Optimal Linear Time.

Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ.

Algoritmi per IR Dictionary-based compressors. Lempel-Ziv Algorithms Keep a “dictionary” of recently-seen strings. The differences are: How the dictionary.

Lecture #1 From 0-th order entropy compression To k-th order entropy compression.

Succinct Data Structures for Permutations, Functions and Suffix Arrays

Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna

Paolo Ferragina, Università di Pisa Compressed Permuterm Index Paolo Ferragina Dipartimento di Informatica, Università di Pisa.

Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.

Paolo Ferragina, Università di Pisa Compressing and Indexing Strings and (labeled) Trees Paolo Ferragina Dipartimento di Informatica, Università di Pisa.

Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

Interplay between Stringology and Data Structure Design Roberto Grossi.

What about the trees of the Mississippi? Suffix Trees explained in an algorithm for indexing large biological sequences Jacob Kleerekoper & Marjolijn Elsinga.

Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.

Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo.

String Processing II: Compressed Indexes Patrick Nichols Jon Sheffi Dacheng Zhao

Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following.

Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino

From Kolmogorov and Shannon to Bioinformatics and Grid Computing Raffaele Giancarlo Dipartimento di Matematica, Università di Palermo.

Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.

Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler Ben Langmead Based on work by Juha Kärkkäinen.

Web Algorithmics Dictionary-based compressors. LZ77 Algorithm’s step: Output Advance by len + 1 A buffer “window” has fixed length and moves aacaacabcaaaaaa.

Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

Amortized Rigidness in Dynamic Cartesian Trees Iwona Białynicka-Birula and Roberto Grossi Università di Pisa STACS 2006.

Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin.

Tries. (Compacted) Trie y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.

Indexing and Searching

Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino

Compressed Index for a Dynamic Collection of Texts H.W. Chan, W.K. Hon, T.W. Lam The University of Hong Kong.

Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.

15-853Page :Algorithms in the Real World Data Compression II Arithmetic Coding – Integer implementation Applications of Probability Coding – Run.

296.3Page 1 CPS 296.3:Algorithms in the Real World Data Compression: Lecture 2.5.

Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Information and Coding Theory Heuristic data compression codes. Lempel- Ziv encoding. Burrows-Wheeler transform. Juris Viksna, 2015.

Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Random access to arrays of variable-length items

TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo.

Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }

Lecture 4: Lossless Compression(1) Hongli Luo Fall 2011.

B ACKWARD S EARCH FM-I NDEX (F ULL - TEXT INDEX IN M INUTE SPACE ) Paper by Ferragina & Manzini Presentation by Yuval Rikover.

BACKWARD SEARCH FM-INDEX (FULL-TEXT INDEX IN MINUTE SPACE)

Joint Advanced Student School Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache.

Index construction: Compression of postings Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 5.3 and a paper.

The Burrows-Wheeler Transform: Theory and Practice Article by: Giovanni Manzini Original Algorithm by: M. Burrows and D. J. Wheeler Lecturer: Eran Vered.

Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Basics

Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

On searching and extracting strings from compressed textual data Rossano Venturini Dipartimento di Informatica University of Pisa, Italy.

A simple storage scheme for strings achieving entropy bounds Paolo Ferragina and Rossano Venturini Dipartimento di Informatica University of Pisa, Italy.

Linear Time Suffix Array Construction Using D-Critical Substrings

Burrows-Wheeler Transformation Review

15-853:Algorithms in the Real World

Tries 07/28/16 11:04 Text Compression

Succinct Data Structures

Succinct Data Structures

Information and Coding Theory

Algorithms in the Real World

13 Text Processing Hongfei Yan June 1, 2016.

Context-based Data Compression

Problem with Huffman Coding

Advanced Seminar in Data Structures

Index construction: Compression of postings

CPS 296.3:Algorithms in the Real World

Presentation transcript:

Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di Pisa

Paolo Ferragina, Università di Pisa What do we mean by Indexing ? Linguistic or tokenizable text Raw sequence of characters or bytes Types of data Types of query Word-based query Character-based query Two indexing approaches : Word-based indexes, here a notion of word must be devised ! » Inverted files, Signature files, Bitmaps. Full-text indexes, no constraint on text and queries ! » Suffix Array, Suffix tree, String B-tree [Ferragina-Grossi, JACM 99]. DNA sequences Audio-video files Executables Arbitrary substring Complex match

Paolo Ferragina, Università di Pisa What do we mean by Compression ? Compression has two positive effects: Space saving (or, double memory at the same cost) Performance improvement Better use of memory levels close to processor Increased disk and memory bandwidth Reduced (mechanical) seek time From March 2001 the Memory eXpansion Technology (MXT) is available on IBM eServers x330MXT Same performance of a PC with double memory but at half cost Moral: More economical to store data in compressed form than uncompressed » CPU speed nowadays makes (de)compression costless !!

Paolo Ferragina, Università di Pisa Compression and Indexing : Two sides of the same coin ! Do we witness a paradoxical situation ? An index injects redundant data, in order to speed up the pattern searches Compression removes redundancy, in order to squeeze the space occupancy Moral: CPM researchers must have a multidisciplinary background, ranging from Data structure design to Data compression, from Architectural Knowledge to Database principles, till Algoritmic Engineering and more... NO, new results proved a mutual reinforcement behaviour ! Better indexes can be designed by exploiting compression techniques Better compressors can be designed by exploiting indexing techniques In terms of space occupancyAlso in terms of compression ratio

Paolo Ferragina, Università di Pisa Our journey, today... Suffix Array (1990) Index design (Weiner 73)Compressor design (Shannon 48) Burrows-Wheeler Transform (1994) Compressed Index -Space close to gzip, bzip - Query time close to O(|P|) Compression Booster Tool to transform a poor compressor into a better compression algorithm Improved Indexes and Compressors Wavelet Tree

Paolo Ferragina, Università di Pisa The Suffix Array [BaezaYates-Gonnet, 87 and Manber-Myers, 90] Prop 1. All suffixes in SUF(T) having prefix P are contiguous P=si T = mississippi# # i# ippi# issippi# ississippi# mississippi# pi# ppi# sippi# sissippi# ssippi# ssissippi# SUF(T) SA + T occupy N log 2 N) bits (N 2 ) space SA T = mississippi# suffix pointer 5 Prop 2. These suffixes follow Ps lexicographic position

Paolo Ferragina, Università di Pisa ppi# sippi# sissippi# P=si final Searching in Suffix Array [Manber-Myers, 90] T = mississippi# SA P=si Suffix Array search O (log 2 N) binary-search steps Each step takes O( |P| ) char cmp overall, O (|P| log 2 N) time Suffix permutation cannot be any of {1,...,N} # binary texts = 2 N « N! = # permutations on {1, 2,..., N} N log 2 N is not a lower bound to the bit space occupancy O (|P| + log 2 N) time O (|P|/B + log B N) I/Os [JACM 99] Self-adjusting version on disk [FOCS 02]

Paolo Ferragina, Università di Pisa p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i i ssippi#mis s m ississippi # i ssissippi# m The Burrows-Wheeler Transform (1994) Let us given a text T = mississippi# mississippi# ississippi#m ssissippi#mi sissippi#mis sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi ssippi#missi issippi#miss Sort the rows # mississipp i i #mississip p i ppi#missis s FL T

Paolo Ferragina, Università di Pisa Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression ! T = m i s s i s s i p p i # i #mississip p p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i i ssippi#mis s m ississippi # i ssissippi# m Why L is so interesting for compression ? # mississipp i i ppi#missis s FL A key observation: l L is locally homogeneous Algorithm Bzip : Move-to-Front coding of L Run-Length coding Statistical coder: Arithmetic, Huffman L is highly compressible unknown Building the BWT SA construction Inverting the BWT array visit...overall O(N) time...

Paolo Ferragina, Università di Pisa Rotated text #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m Suffix Array vs. BW-transform ipssm#pissiiipssm#pissii L SA L includes SA and T. Can we search within L ? mississippi

Paolo Ferragina, Università di Pisa A compressed index [Ferragina-Manzini, IEEE Focs 2000] Bridging data-structure design and compression techniques: Suffix array data structure Burrows-Wheeler Transform The corollary is that: The Suffix Array is compressible It is a self-index In practice, the index is much appealing: Space close to the best known compressors, ie. bzip Query time of few millisecs on hundreds of MBs The theoretical result: Query complexity: O(p + occ log N) time Space occupancy: O( N H k (T) ) + o(N) bits k-th order empirical entropy o(N) if T compressible O(n) space: A plethora of papers H k : Grossi-Gupta-Vitter (03), Sadakane (02),... Now, more than 20 papers with more than 20 authors on related subjects Index does not depend on k Bound holds for all k, simultaneously

Paolo Ferragina, Università di Pisa p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i i ssippi#mis s m ississippi # i ssissippi# m A useful tool: L F mapping # mississipp i i #mississip p i ppi#missis s FL How do we map Ls onto Fs chars ?... Need to distinguish equal chars... unknown

Paolo Ferragina, Università di Pisa fr occ=2 [lr-fr+1] rows prefixed by si Substring search in T ( Count occurrences ) #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m ipssm#pissiiipssm#pissii L mississippi # 0 i 1 m 6 p 7 s 9 C Available info P = si First step fr lr Inductive step: Given fr,lr for P[j+1,p] Take P[j] P[ j ] Find first P[j] in L[fr, lr] Find last P[j] in L[fr, lr] L-to-F mapping of these chars lr rows prefixed by char i s s unknown

Paolo Ferragina, Università di Pisa Many details are missing... Still guarantee O(p) time to count the Ps occurrences The column L is actually kept compressed: Interesting issues: What about arbitrary alphabets ? [Grossi et al., 03; Ferragina et al., 04] What about disk-aware, or cache-oblivious, or self-adjusting versions ? What about challenging applications: bio,db,data mining,handheld PC,... Efficient and succinct index construction [Hon et al., Focs 03] The Locate operation takes O(log N) time Some additional data structure, in o(n) bits Bio-application: fit Human-genome index in a PC [Sadakane et al., 02]

Paolo Ferragina, Università di Pisa We investigated the reinforcement relation: Compression ideas Index design Lets now turn to the other direction Indexing ideas Compressor design Where we are... Booster

Paolo Ferragina, Università di Pisa What do we mean by boosting ? It is a technique that takes a poor compressor A and turns it into a compressor with better performance guarantee A memoryless compressor is poor in that it assigns codewords to symbols according only to their frequencies (e.g. Huffman) It incurs in some obvious limitations: T = a n b n T= random string of length 2n and same number of a,b

Paolo Ferragina, Università di Pisa Qualitatively, we would like to achieve c is shorter than c, if T is compressible Time( A boost ) = O(Time ( A )), i.e. no slowdown A is used as a black-box What do we mean by boosting ? c Booster A Tc The more compressible is T, the shorter is c It is a technique that takes a poor compressor A and turns it into a compressor with better performance guarantee Two Key Components: Burrows-Wheeler Transform and Suffix Tree

Paolo Ferragina, Università di Pisa The emprirical entropy H 0 H 0 (T) = ̶ i (n i /n) log 2 (n i /n) Frequency in T of the i-th symbol We get a better compression using a codeword that depends on the k symbols preceding the one to be compressed ( context ) |T| H 0 (T) is the best you can hope for a memoryless compressor E.g. Huffman or Arithmetic coders come close to this bound

Paolo Ferragina, Università di Pisa The empirical entropy H k H k (T) = (1/|T|) | | =k | T[ ] | H 0 ( T [ ] ) Example: Given T = mississippi, we have T[ ] = string of symbols that precede in T T[i]= mssp,T[is] = ms Problems with this approach: How to go from all T[ ] back to the string T ? How do we choose efficiently the best k ? Compress T up to H k (T) compress all T[ ] up to their H 0 Use Huffman or Arithmetic For any k-long context BWTSuffix Tree

Paolo Ferragina, Università di Pisa T = mississippi# pi#mississi p ppi#mississ i sippi#missi s sissippi#mi s ssippi#miss i ssissippi#m i issippi#mis s mississippi # ississippi# m Compress T up to H k (T) compress all T[ ] up to their H 0 Use BWT to approximate H k bwt(T) #mississipp i i#mississip p ippi#missis s compress pieces of bwt(T) up to H 0 H k (T) = (1/|T|) | | =k |T[ ]| H 0 ( T[ ] ) Remember that... unknown T[ ] is a permutation of a piece of bwt(T) T [ is ] = ms

Paolo Ferragina, Università di Pisa H 1 (T) H 2 (T) Compress T up to H k compress pieces of bwt(T) up to H 0 pi#mississi p ppi#mississ i sippi#missi s sissippi#mi s ssippi#miss i ssissippi#m i issippi#mis s mississippi # ississippi# m #mississipp i i#mississip p ippi#missis s What are the pieces of BWT to compress ? Bwt(T) Compressing those pieces up to their H 0, we achieve H 1 (T) Compressing those pieces up to their H 0, we achieve H 2 (T) We have a workable way to approximate H k unknown Recall that T[ ]s permutation

Paolo Ferragina, Università di Pisa Goal: find the best BWT-partition induced by a Leaf Cover !!Some leaf covers are related to H k !!! pi#mississi p ppi#mississ i sippi#missi s sissippi#mi s ssippi#miss i ssissippi#m i issippi#mis s mississippi # ississippi# m #mississipp i i#mississip p ippi#missis s bwt(T) Finding the best pieces to compress # i pm s 119 # ppi# ssi 52 ppi# ssippi# 109 i# pi# i si 74 ppi# ssippi# 63 ppi# ssippi# Row order i p s ii sm i s # p s Leaf cover ? L1L1 L2L2 unknown H 1 (T) H 2 (T)

Paolo Ferragina, Università di Pisa Let A be the compressor we wish to boost Let bwt(T)=t 1, …, t r be the partition induced by the leaf cover L, and let us define cost ( L, A )= j | A ( t j )| Goal: Goal: Find the leaf cover L * of minimum cost It suffices a post-order visit of the suffix tree, hence linear time We have : Cost( L *, A ) Cost( L k, A ) H k (T), k |c | λ |s| H (s) + f( |s|) Technically, we show that 0 k A compression booster [Ferragina-Manzini, SODA 04] k + log 2 |s| + k Researchers may now concentrate on the apparently simpler task of designing 0-th order compressors [further results joint with Giancarlo-Sciortino]

Paolo Ferragina, Università di Pisa May we close the mutual reinforcement cycle ? The Wavelet Tree [Grossi-Gupta-Vitter, Soda 03] Using the WT within each piece of the optimal BWT-partition, we get: A compressed index that scales well with the alphabet size Reduce the compression problem to achieve H 0 on binary strings [joint with Manzini, Makinen, Navarro] [joint with Giancarlo, Manzini,Sciortino] Interesting issues: What about space construction of BWT ? What about these tools for XML or Images ? Other application contexts: bio,db,data mining, network,... From Theory to Technology ! Libraries, Libraries,.... [e.g. LEDA]

Paolo Ferragina, Università di Pisa

A historical perspective Shannon showed a narrower result for a stationary ergodic S Idea: Compress groups of k chars in the string T Result: Compress ratio the entropy of S, for k Various limitations It works for a source S It must modify A s structure, because of the alphabet change For a given string T, the best k is found by trying k=0,1,…,|T| (|T| 2 ) time slowdown k is eventually fixed and this is not an optimal choice ! Any string s Black-box O(|s|) time Variable length contexts Two Key Components: Burrows-Wheeler Transform and Suffix Tree

Paolo Ferragina, Università di Pisa How do we find the best partition (i.e. k) Approximate via MTF [Burrows-Wheeler, 94] MTF is efficient in practice [bzip2] Theory and practice showed that we can aim for more ! Use Dynamic Programming [Giancarlo-Sciortino, CPM 03] It finds the optimal partition Very slow, the time complexity is cubic in |T| Surprisingly, full-text indexes help in finding the optimal partition in optimal linear time !!

Paolo Ferragina, Università di Pisa s = (bad) n (cad) n (xy) n (xz) n Example Example: not one k 1-long contexts2-long contexts x s = y n z n > yx s = y n-1, zx s = z n-1 a s = d 2n < ba s = d n, ca s = d n

Paolo Ferragina, Università di Pisa Word-based compressed index T = …bzip…bzip2unbzip2unbzip … What about word-based occurrences of P ? The FM-index can be adapted to support word-based searches: Preprocess T and transform it into a digested text DT wordprefixsubstring suffix P=bzip...the post-processing phase can be time consuming ! Use the FM-index over the digested DT Word-search in T Substring-search in DT

Paolo Ferragina, Università di Pisa The WFM-index Variant of Huffman algorithm: Symbols of the huffman tree are the words of T The Huffman tree has fan-out 128 Codewords are byte-aligned and tagged 100 Byte-aligned codeword tagging yesno Any word 7 bits Codeword huffman WFM-index 1. Dictionary of words 2. Huffman tree 3. FM-index built on DT [bzip][ ] [bzip] [ ] [not] [or] T = bzip or not bzip 1 [ ] DT space ~ 22 % word search ~ 4 ms P= bzip yes 1. Dictionary of words = Huffman tree 3. FM-index built on DT

Paolo Ferragina, Università di Pisa p T =.... # i #mississip p p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i i ssippi#mis s m ississippi # i ssissippi# m The BW-Trasform is invertible # mississipp i i ppi#missis s FL unknown 1. We can map Ls to Fs chars 2. T =.... L[ i ] F[ i ]... Two key properties: Reconstruct T backward: i p i Building the BWT SA construction Inverting BWT array visit...overall O(N) time...

Paolo Ferragina, Università di Pisa miiiisssspp The Wavelet Tree [Grossi-Gupta-Vitter, Soda 03] [collaboration: Giancarlo, Manzini, Makinen, Navarro, Sciortino] mississippi imps Use WT within each BWT piece Alphabet independence Binary string compression/indexing