Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following.

Slides:



Advertisements
Similar presentations
Lecture #1 From 0-th order entropy compression To k-th order entropy compression.
Advertisements

Space-Efficient Algorithms for Document Retrieval Veli Mäkinen University of Helsinki Joint work with Niko Välimäki.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Paolo Ferragina, Università di Pisa Compressed Rank & Select on general strings Paolo Ferragina Dipartimento di Informatica, Università di Pisa.
Dictionaries and Data-Aware Measures Ankur Gupta Butler University.
The Wavelet Trie: Maintaining an Indexed Sequence of Strings in Compressed Space Roberto GrossiGiuseppe Ottaviano * Università di Pisa * Part of the work.
A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.
Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.
A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
Tries Standard Tries Compressed Tries Suffix Tries.
Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino
BTrees & Bitmap Indexes
1 Compressed Index for Dictionary Matching WK Hon (NTHU), TW Lam (HKU), R Shah (LSU), SL Tam (HKU), JS Vitter (Purdue)
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
2015/6/15VLC 2006 PART 1 Introduction on Video Coding StandardsVLC 2006 PART 1 Variable Length Coding  Information entropy  Huffman code vs. arithmetic.
Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler Ben Langmead Based on work by Juha Kärkkäinen.
Text Operations: Coding / Compression Methods. Text Compression Motivation –finding ways to represent the text in fewer bits –reducing costs associated.
Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University.
Compressed Suffix Arrays based on Run-Length Encoding Veli Mäkinen Bielefeld University Gonzalo Navarro University of Chile BWTRLFID.
6/26/2015 7:13 PMTries1. 6/26/2015 7:13 PMTries2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3) Huffman encoding.
Lossless Data Compression Using run-length and Huffman Compression pages
2015/7/12VLC 2008 PART 1 Introduction on Video Coding StandardsVLC 2008 PART 1 Variable Length Coding  Information entropy  Huffman code vs. arithmetic.
Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino
Compressed Index for a Dynamic Collection of Texts H.W. Chan, W.K. Hon, T.W. Lam The University of Hong Kong.
Data Structures and Algorithms Huffman compression: An Application of Binary Trees and Priority Queues.
CS559-Computer Graphics Copyright Stephen Chenney Image File Formats How big is the image? –All files in some way store width and height How is the image.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Information and Coding Theory Heuristic data compression codes. Lempel- Ziv encoding. Burrows-Wheeler transform. Juris Viksna, 2015.
Compressed suffix arrays and suffix trees with applications to text indexing and string matching.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
CSC401 – Analysis of Algorithms Chapter 9 Text Processing
Succinct Orthogonal Range Search Structures on a Grid with Applications to Text Indexing Prosenjit Bose, Carleton University Meng He, Unversity of Waterloo.
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
Random access to arrays of variable-length items
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Szymon Grabowski, Marcin Raniszewski Institute of Applied Computer Science, Lodz University of Technology, Poland The Prague Stringology Conference, 1-3.
06/12/2015Applied Algorithmics - week41 Non-periodicity and witnesses  Periodicity - continued If string w=w[0..n-1] has periodicity p if w[i]=w[i+p],
Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.
Parallel Data Compression Utility Jeff Gilchrist November 18, 2003 COMP 5704 Carleton University.
Joint Advanced Student School Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache.
Index construction: Compression of postings Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 5.3 and a paper.
Week 9 - Monday.  What did we talk about last time?  Practiced with red-black trees  AVL trees  Balanced add.
M5 research group, University of Central Florida Weifeng Sun 1 25 April 2003 StarNT: Dictionary-based Fast Transform Weifeng Sun School.
The Burrows-Wheeler Transform: Theory and Practice Article by: Giovanni Manzini Original Algorithm by: M. Burrows and D. J. Wheeler Lecturer: Eran Vered.
ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park.
Page 1KUT Graduate Course Data Compression Jun-Ki Min.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Basics
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
CPS 100, Spring Burrows Wheeler Transform l Michael Burrows and David Wheeler in 1994, BWT l By itself it is NOT a compression scheme  It’s.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections Jouni Sirén 1, Niko Välimäki 1, Veli Mäkinen 1, and Gonzalo Navarro.
A simple storage scheme for strings achieving entropy bounds Paolo Ferragina and Rossano Venturini Dipartimento di Informatica University of Pisa, Italy.
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
HUFFMAN CODES.
Tries 07/28/16 11:04 Text Compression
Tries 5/27/2018 3:08 AM Tries Tries.
Succinct Data Structures
Succinct Data Structures
Reducing the Space Requirement of LZ-index
13 Text Processing Hongfei Yan June 1, 2016.
Context-based Data Compression
CSE373: Data Structures & Algorithms Lecture 11: Implementing Union-Find Linda Shapiro Spring 2016.
CSE 589 Applied Algorithms Spring 1999
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Tries 2/27/2019 5:37 PM Tries Tries.
Sequences 5/17/ :43 AM Pattern Matching.
Presentation transcript:

Wavelet Trees Ankur Gupta Butler University

Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following queries –char(i) – returns the symbol at position i –rank c (i) – the number of c’s from T up to i –select c (i) – the ith occurrence of symbol c in T Text T can be compressed to nH 0 space, answering queries in –O(log Σ ) time using the wavelet tree [GGV03] –O(log log Σ ) time using [GMR06], but space is more When Σ = polylog(n), queries can be answered in O(1) time [FMMN04]

preparedpeppers preparedpeppers eedee eedee ppppp ppppp prprppprs prprppprs eaedee eaedee a1a1 a1a1 rrrs 0001 rrrs 0001 s1s1 s1s1 rrr 111 rrr 111 d1d1 d1d1 eeee 1111 eeee 1111 Compute rank r (10) (answer is 2) Actually compute rank 1 (10) = 5 Actually compute rank 0 (2)=2 Actually compute rank 1 (2)=2 Actually compute rank 1 (5) = 2

preparedpeppers preparedpeppers eedee eedee ppppp ppppp prprppprs prprppprs eaedee eaedee a1a1 a1a1 rrrs 0001 rrrs 0001 s1s1 s1s1 rrr 111 rrr 111 d1d1 d1d1 eeee 1111 eeee 1111 Compute select r (2) (answer is 6) Actually compute select 1 (4) = 6 Actually compute select 1 (2)=2 Actually compute select 1 (2)=4

preparedpeppers preparedpeppers eedee eedee ppppp ppppp prprppprs prprppprs eaedee eaedee a1a1 a1a1 rrrs 0001 rrrs 0001 s1s1 s1s1 rrr 111 rrr 111 d1d1 d1d1 eeee 1111 eeee 1111 Compute char(7) (answer is e) Actually compute char(7)=0 select 0 (7)=3 Actually compute char(2)=1 rank 1 (2)=2 Actually compute char(2)=1 rank 1 (2)=2 Actually compute char(3)=1 select 1 (3)=2

Some comments Don’t have to store any of the “all 1s” nodes –That’s just to help for the example. What does the wavelet tree imply? –Converts representation of a finite string on an alphabet to representation of many bitvectors. –Useful to achieve, ultimately, high-order compression. –Easy to implement – very simple structure and query pattern

Shapin’ Up To Something Special What about the shape of a wavelet tree? –Does it affect space? No. (You will see why in a bit.) –Time? Yes. Good news! Reorganize it to optimize query time... –Use a Huffman orientation based on query access. – If you choose symbol frequency, you now can search in O(H 0 ) time instead of O(log Σ ).

Wavelet Tree Space/Time Simple bitvectors –n bits per level and log |Σ| levels n log |Σ| overall bits O(n log log n / log n) extra bits for rank/select [J89] –Same space as original text but can now support rank c /select c /char in O(log |Σ|) time. (RAM) Fancy –[RRR02] Gets O(nH 0 ) + O(n log log n / log n) bits of space with O(log |Σ|) query time

Even Skewed Is a Shape

Empirical Entropy Text T of n symbols drawn from alphabet Σ (n lg |Σ| bits) Entropy: measure to assess compression size based on text Higher order entropy H h (of order h) –Considers context x of neighboring h symbols –Each Prob[y|x] term is thus conditioned on context x of h symbols –Note that H h (T) ≤ lg |Σ| –Now the text takes nH h ≤ n lg |Σ| bits of space to encode

One Text Indexing Result Because Frankly, There Are Lots Main Results (using CSA [GGV03]) –Space usage: nH h + o(n log |Σ|) bits –Search time: O(mlog |Σ| + polylog(n)) time –Can improve to o(m) time with constant factor more space When the text T is highly compressible (i.e. nH h = o(n)), we achieve the first index sublinear in both space and search time Second-order terms represent the space needed for –Fast indexing –Storing count statistics for the text Obtain nearly tight bounds on the encoding length of the Burrows-Wheeler Transform (BWT)

Tell Me More! How Do You Do It? SA 0 Text Positions SA  For even index, use SA 1. Example: SA 0 [5] = 2·SA 1 [Rank red (5)] = 8.  For odd index, use neighbor function Φ 0. Example: SA 0 [2] = SA 0 [Φ 0 (5)] – 1 = 7. Perform these steps recursively for a compressed suffix array Encode increasing subsequences together to get zero-order entropy Φ0Φ SA 0 Subdivide subsequences and encode to get high-order entropy Neighbor function Φ 0 tells the position in the suffix array of the next suffix (in text order) It turns out that the neighbor function Φ is the primary bottleneck for space. For this example, suppose we know SA 1 Rank red

Burrows-Wheeler Transform (BWT) and the Neighbor Φ function The Φ function has a strong relationship to the Burrows-Wheeler Transform (BWT) The BWT has had a profound impact on a myriad of fields. –Simply put, it pre-processes an input text T by a reversible transform. –The result is easily compressible using simple methods. The BWT (and the Φ function) are at the heart of many text compression and indexing techniques, such as bzip2. We also call the Φ function the FL mapping from the BWT.

Burrows-Wheeler Transform (BWT)

A Shifty Little BWT list i list s

Where Oh Where Is My Wavelet Tree? For each list from the previous slide, we store a wavelet tree to achieve 0 th order entropy –The collection of 0 th order compressors gives high- order entropy based on the context (not shown in this talk). Technical point: number of alphabet symbols cannot be more than text length –We “rerank” symbols to match this requirement (negligible extra cost in space, O(1) time)

Any questions?