 Divide the encoded file into blocks of size b  Use an auxiliary bit vector to indicate the beginning of each block  Time – O(b)  Time vs. Memory.

Slides:



Advertisements
Similar presentations
February 12, 2007 WALCOM '2007 1/22 DiskTrie: An Efficient Data Structure Using Flash Memory for Mobile Devices N. M. Mosharaf Kabir Chowdhury Md. Mostofa.
Advertisements

Succinct Representations of Dynamic Strings Meng He and J. Ian Munro University of Waterloo.
CS252: Systems Programming Ninghui Li Program Interview Questions.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??
22C:19 Discrete Structures Trees Spring 2014 Sukumar Ghosh.
Fast Compressed Tries through Path Decompositions Roberto Grossi Giuseppe Ottaviano* Università di Pisa * Part of the work done while at Microsoft Research.
A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
Succinct Data Structures Ian Munro University of Waterloo Joint work with David Benoit, Andrej Brodnik, D, Clark, F. Fich, M. He, J. Horton, A. López-Ortiz,
The Trie Data Structure Basic definition: a recursive tree structure that uses the digital decomposition of strings to represent a set of strings for searching.
Tries Standard Tries Compressed Tries Suffix Tries.
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
Modern Information Retrieval Chapter 8 Indexing and Searching.
Modern Information Retrieval
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
© 2004 Goodrich, Tamassia Greedy Method and Compression1 The Greedy Method and Text Compression.
A Data Compression Algorithm: Huffman Compression
CSC 213 Lecture 18: Tries. Announcements Quiz results are getting better Still not very good, however Average score on last quiz was 5.5 Every student.
A New Point Access Method based on Wavelet Trees Nieves R. Brisaboa, Miguel R. Luaces, Diego Seco Database Laboratory University of A Coruña A Coruña,
Data Structures – LECTURE 10 Huffman coding
6/26/2015 7:13 PMTries1. 6/26/2015 7:13 PMTries2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3) Huffman encoding.
1 A Lempel-Ziv text index on secondary storage Diego Arroyuelo and Gonzalo Navarro Combinatorial Pattern Matching 2007.
Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004.
CSE Lectures 22 – Huffman codes
Mike 66 Sept Succinct Data Structures: Techniques and Lower Bounds Ian Munro University of Waterloo Joint work with/ work of Arash Farzan, Alex Golynski,
1 Trees Tree nomenclature Implementation strategies Traversals –Depth-first –Breadth-first Implementing binary search trees.
A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
CSC401 – Analysis of Algorithms Chapter 9 Text Processing
Data : The Small Forwarding Table(SFT), In general, The small forwarding table is the compressed version of a trie. Since SFT organizes.
Succinct Orthogonal Range Search Structures on a Grid with Applications to Text Indexing Prosenjit Bose, Carleton University Meng He, Unversity of Waterloo.
Algorithm Paradigms High Level Approach To solving a Class of Problems.
Succinct Data Structures Ian Munro University of Waterloo Joint work with David Benoit, Andrej Brodnik, D, Clark, F. Fich, M. He, J. Horton, A. López-Ortiz,
Introduction to Algorithms Chapter 16: Greedy Algorithms.
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,
ENHANCED EXTRACTION FROM HUFFMAN ENCODED FILES Shmuel T. Klein Dana Shapira Bar Ilan University Ariel University PSC-AUGUST 2015.
Bahareh Sarrafzadeh 6111 Fall 2009
بسم الله الرحمن الرحيم My Project Huffman Code. Introduction Introduction Encoding And Decoding Encoding And Decoding Applications Applications Advantages.
SEAC-2 J.Teuhola Coding-Theoretic Foundations Source alphabet S  Target alphabet {0, 1} Categories of source encoding: 1. Codebook methods:
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Succinct Data Structures
Generic Trees—Trie, Compressed Trie, Suffix Trie (with Analysi
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
HUFFMAN CODES.
Data Compression.
Tries 07/28/16 11:04 Text Compression
Data Structures: Disjoint Sets, Segment Trees, Fenwick Trees
Tries 5/27/2018 3:08 AM Tries Tries.
Succinct Data Structures
Succinct Data Structures
The use and Usefulness of Fibonacci Codes
Mark Redekopp David Kempe
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
13 Text Processing Hongfei Yan June 1, 2016.
Huffman Coding.
Advanced Algorithms Analysis and Design
Chapter 11 Data Compression
Data Structure and Algorithms
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Tries 2/27/2019 5:37 PM Tries Tries.
Sequences 5/17/ :43 AM Pattern Matching.
Succinct Data Structures
Metric Preserving Dense SIFT Compression
Huffman Coding Greedy Algorithm
Huffman codes Binary character code: each character is represented by a unique binary string. A data file can be coded in two ways: a b c d e f frequency(%)
Algorithms CSCI 235, Spring 2019 Lecture 31 Huffman Codes
Presentation transcript:

 Divide the encoded file into blocks of size b  Use an auxiliary bit vector to indicate the beginning of each block  Time – O(b)  Time vs. Memory storage tradeoff

 Grossi, Gupta and Vitter –

 Grossi and Ottaviano - Wavelet trees based on Patricia trie  Brisaboa, Ladra, Navarro (IPM 2013) – Wavelet tree for Byte Codes  Kulekci (DCC 2014) - Elias and Rice code  P. Prochazka, J. Holub – (DCC 2014) compression for similar biological sequences

 Fibonacci Codes  Rank and Select  Random Access using auxiliary index  Random Access using Wavelet trees  Improved Wavelet trees for Random Access  Experimental Results

 Fibonacci Codes  Rank and Select  Random Access using auxiliary index  Random Access using Wavelet trees  Improved Wavelet trees for Random Access  Experimental Results

… Basis elements of a numeration system

Basis elements: = Fibonacci: = No adjacent 1’s00000

EExample: 19 = PProblem: Not instantaneous Solution: Reverse the codeword EExample: 19 = {{11, 011, 0011, 1011, 00011, 10011, 01011, , , , , , , …}

SSet of strings ending in 11 with no other adjacent 1’s {{11, 011, 0011, 1011, 00011, 10011, 01011, , , , , , , …}

 Fibonacci Codes  Rank and Select  Random Access using auxiliary index  Random Access using Wavelet trees  Improved Wavelet trees for Random Access  Experimental Results

 Given a bit vector B of length n  rank 1 (B,i) - (resp. rank 0 (B,i) ) - the number of 1s (resp. 0s) up to and including position i in B  select 1 (B,i) - (resp. select 0 (B,i) ) - returns the index of the i th 1 (resp. 0s)

 rank 1 (B,i) = i-rank 0 (B,i) ›  compute only rank 1 (B,i)  Naive Solution: Store rank answers:  Example:

 Store rank answers every lg 2 n bits of B. › Use lg n bits for each answer  Divide each chunk into ( lg n)/2 chunks,  Store rank answers relative to last sample every ( lg n)/2 bits › Use 2lglg n bits per sub-sample  Bottom Level – use a simple Lookup table. Space Complexity -

7041 blocks Output = … … … …112 … 1111…0 1111…1

 Fibonacci Codes  Rank and Select  Random Access using auxiliary index  Random Access using Wavelet trees  Improved Wavelet trees for Random Access  Experimental Results

1. E(T) compress T 2. Generate B of size |E(T)| so that: B[i] 1 iff E(T)[i] is the first bit of a codeword 3. Construct a rank/select data structure for B Space Complexity

 Fibonacci Codes  Rank and Select  Random Access using auxiliary index  Random Access using Wavelet trees  Improved Wavelet trees for Random Access  Experimental Results

 T = COMPRESSORS   = {C, M, P, E, O, R, S}  Occ = {1,1,1,1,2,2,3}  E(T)=

extract(V root, i){ code  v V root while v is not a leaf if B v [i] = 0; v left(v) codecode  0 i rank 0 (B v, i) else v right(v) codecode  1 i rank 1 (B v, i) return D(code)

select x (T, i){ w leaf corresponding to f(x) v father of w while v  V root if w is a left child of v iindex of the i th 0 in B v else iindex of the i th 1 in B v return i

 Redundant information for single child nodes. › Similar to the collapsing strategy suffix trees

 E(T)=  E(T)=

if suffix of code = 0 codecode  11 if suffix of code  11 codecode  1 return D(code)

 Recursive definition of a FWT of depth h+1  Assumption: if the tree is of depth h+1 then all the F h codewords of length h+1 are in the alphabet.

 N h+1 =N h +N h-1 +3 ThTh T h-1 T h+1

 N h+1 =N h +3F h  N h+1 =3F h+2 -3  P h-1 =2F h+2 -3  P h-1 /N h+1 =(2F h+2 -3)/3F h+2 -3 ⅔ h 

 English Heaps – distribution of 26 characters and 371 bigram  Finnish – Pesonen- 29 letters  French – Tr é sor de la Langue Fran ç aise 26 letters  German Bauer & Goos– 30 letters  Hebrew and Aramaic The Responsa Retrieval Project– 30 letters, 735 bigrams  Italian – 26 letters  Spanish – 26 letters  Portuguese – 26 letters

File n HeightFWTPrunedHuffman English Finnish French German Hebrew Italian Portuguese Spanish Russian English Hebrew