Succinct Data Structures

Slides:



Advertisements
Similar presentations
Cell-Probe Lower Bounds for Succinct Partial Sums Mihai P ă trașcuEmanuele Viola.
Advertisements

Succinct Representations of Dynamic Strings Meng He and J. Ian Munro University of Waterloo.
Data Compression CS 147 Minh Nguyen.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Processing of large document collections
Lecture04 Data Compression.
SWE 423: Multimedia Systems
Text Operations: Coding / Compression Methods. Text Compression Motivation –finding ways to represent the text in fewer bits –reducing costs associated.
A Data Compression Algorithm: Huffman Compression
CS336: Intelligent Information Retrieval
1 A Lempel-Ziv text index on secondary storage Diego Arroyuelo and Gonzalo Navarro Combinatorial Pattern Matching 2007.
Lossless Data Compression Using run-length and Huffman Compression pages
Management Information Systems Lection 06 Archiving information CLARK UNIVERSITY College of Professional and Continuing Education (COPACE)
8. Compression. 2 Video and Audio Compression Video and Audio files are very large. Unless we develop and maintain very high bandwidth networks (Gigabytes.
Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.
15-853Page :Algorithms in the Real World Data Compression II Arithmetic Coding – Integer implementation Applications of Probability Coding – Run.
Source Coding-Compression
296.3Page 1 CPS 296.3:Algorithms in the Real World Data Compression: Lecture 2.5.
Multimedia Data Introduction to Lossless Data Compression Dr Sandra I. Woolley Electronic, Electrical.
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
Huffman Codes Juan A. Rodriguez CS 326 5/13/2003.
CS654: Digital Image Analysis Lecture 34: Different Coding Techniques.
Chapter 7 Lossless Compression Algorithms 7.1 Introduction 7.2 Basics of Information Theory 7.3 Run-Length Coding 7.4 Variable-Length Coding (VLC) 7.5.
Lecture 12 Huffman Algorithm. In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly.
Compressing Bi-Level Images by Block Matching on a Tree Architecture Sergio De Agostino Computer Science Department Sapienza University of Rome ITALY.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Basics
Lampel ZIV (LZ) code The Lempel-Ziv algorithm is a variable-to-fixed length code Basically, there are two versions of the algorithm LZ77 and LZ78 are the.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Submitted To-: Submitted By-: Mrs.Sushma Rani (HOD) Aashish Kr. Goyal (IT-7th) Deepak Soni (IT-8 th )
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
A simple storage scheme for strings achieving entropy bounds Paolo Ferragina and Rossano Venturini Dipartimento di Informatica University of Pisa, Italy.
Data Compression Michael J. Watts
Succinct Data Structures
Succinct Data Structures
Huffman Codes ASCII is a fixed length 7 bit code that uses the same number of bits to define each character regardless of how frequently it occurs. Huffman.
Data Coding Run Length Coding
Data Compression.
Tries 07/28/16 11:04 Text Compression
Assignment 6: Huffman Code Generation
Digital Image Processing Lecture 20: Image Compression May 16, 2005
Succinct Data Structures
Tries 5/27/2018 3:08 AM Tries Tries.
Succinct Data Structures
Increasing Information per Bit
Information and Coding Theory
Data Compression.
Discrete Methods in Mathematical Informatics
Algorithms in the Real World
Applied Algorithmics - week7
Succinct Data Structures
Reducing the Space Requirement of LZ-index
Quasi-Distinct Parsing and Optimal Compression Methods
Chapter 7 Special Section
Data Compression CS 147 Minh Nguyen.
Context-based Data Compression
Chapter 9: Huffman Codes
Analysis & Design of Algorithms (CSCE 321)
Why Compress? To reduce the volume of data to be transmitted (text, fax, images) To reduce the bandwidth required for transmission and to reduce storage.
Advanced Algorithms Analysis and Design
Discrete Methods in Mathematical Informatics
Greedy: Huffman Codes Yin Tat Lee
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
15 Data Compression Foundations of Computer Science ã Cengage Learning.
Chapter 7 Special Section
Huffman Coding Greedy Algorithm
CPS 296.3:Algorithms in the Real World
15 Data Compression Foundations of Computer Science ã Cengage Learning.
Presentation transcript:

Succinct Data Structures Kunihiko Sadakane National Institute of Informatics

Compressing Arrays Input: array (string) of length n S[0..n1] query S[i]  A, alphabet size  query Return a substring of S at given position S[i..i+w1] (w = O(log n) i.e. O(log n) bits) O(1) time on word RAM with word-length O(log n) bits Index size: nHk(S) + o(n log ) bits If the compressed suffix array is used to represent the string, query time is not O(1)

Theorem Size: asymptotically same as LZ78 [Ziv, Lempel78] Consecutive log n characters (log n bits) of S at given position i are obtained in O(1) time The above access time is the same as that on uncompressed string Data can be regarded as uncompressed

LZ78(LZW) Compression [3] Divide the string into phrases in a dictionary A phrase is encoded as a number Update the dictionary Compression ratio converges into the entropy as the string grows Dictionary 1 a b 2 a 3 b 4 Input a a a b a a b a a b a b 5 1 a 1 b 3 b 5 a Output a 6

Compression Ratio of LZ78 The number of phrases c when a string S of length n is parsed Compressed size: bits If S is generated from a stationary ergodic source (entropy H) For the order-k empirical entropy Hk ( : alphabet size) [4]

Difficulty in Partial Decoding To achieve the order-k entropy, codes for characters must be determined from preceding k characters To decode a substring, its preceding substring is also necessary However,…

The term in the compressed size by LZ78 indicates there are O(k log ) bits redundancy for each word (log n bits). That is, even if k characters are stored without compression for every one word, the redundancy does not increase asymptotically. The information necessary to decode one word is only the preceding k characters, it is not necessary to decode other parts. Note that it must hold k log  < log n.

Simple Data Structure 1 (S1) Divide S into blocks of w = ½ log n characters Encode characters by Huffman code n(1+H0(S)) bits Store pointers to blocks Characters in a block are decoded in O(1) time by table lookups

Simple Data Structure 2 (S2) [5] Divide S into blocks of w = ½ log n characters For each block, the first k characters are stored as it is. The remaining w  k characters are encoded using arithmetic codes defined by the context of length k For all blocks, the space is

Redundancy of using arithmetic codes is 2n/w = O(n log  / log n) Store pointers to the blocks Table for decoding arithmetic codes in O(1) time In total, (if k = o(log n))

Simple Data Structure 3 (S3) [6] Divide S into blocks of w = ½ log n characters Regard each block as a character, and assign code character is represented by integer from 1 to count frequency of each character Assign codes 0, 1, 00, 01, 10, 11, 000, 001,... in decreasing order of frequency Store pointers to blocks

Size Analysis Pointers to blocks: O(n log  log log n/log n) Table to decode substring from a code for a block Lemma: The sum of code lengths for all blocks is at most (Size of S2)

Proof: Codes for a block S2: first k characters are stored without compression the remaining ones are encoded by arithmetic codes S3: w characters are encoded by one code In S3, more frequent patterns are assigned shorter codes ⇒ total code length is not longer than S2 Note: Size of S3 does not depend on k ⇒ The claim holds simultaneously for all 0  k  o(log n)

References [1] Jesper Jansson, Kunihiko Sadakane, Wing-Kin Sung. Compressed random access memory, arXiv:1011.1708v1. [2] Kunihiko Sadakane, Gonzalo Navarro: Fully-Functional Succinct Trees. SODA 2010: 134-149. [3] Jacob Ziv and Abraham Lempel; Compression of Individual Sequences Via Variable-Rate Coding, IEEE Transactions on Information Theory, September 1978. [4] S. Rao Kosaraju, Giovanni Manzini: Compression of Low Entropy Strings with Lempel-Ziv Algorithms. SIAM J. Comput. 29(3): 893-911 (1999). [5] Rodrigo González and Gonzalo Navarro. Statistical Encoding of Succinct Data Structures. Proc. CPM'06, pages 295-306. LNCS 4009. [6] P. Ferragina and R. Venturini. A simple storage scheme for strings achieving entropy bounds. Theoretical Computer Science, 372(1):115– 121, 2007.