Compression of a Dictionary Jan Lánský, Michal Žemlička Dept. of Software Engineering Faculty of Mathematics.

Slides:



Advertisements
Similar presentations
Lecture 4 (week 2) Source Coding and Compression
Advertisements

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Character and String definitions, algorithms, library functions Characters and Strings.
An introduction to Data Compression
Lecture 10 : Huffman Encoding Bong-Soo Sohn Assistant Professor School of Computer Science and Engineering Chung-Ang University Lecture notes : courtesy.
Review Ch.1,Ch.4,Ch.7. Review of tags covered various header tags Img tag Style, attributes and values alt.
CSCI 3 Chapter 1.8 Data Compression. Chapter 1.8 Data Compression  For the purpose of storing or transferring data, it is often helpful to reduce the.
Chapter 8_2 Bits and the "Why" of Bytes: Representing Information Digitally.
Data Representation CS105. Data Representation Types of data: – Numbers – Text – Audio – Images & Graphics – Video.
Chapter 7 Special Section Focus on Data Compression.
Management Information Systems Lection 06 Archiving information CLARK UNIVERSITY College of Professional and Continuing Education (COPACE)
Dale & Lewis Chapter 3 Data Representation
Representing text Each of different symbol on the text (alphabet letter) is assigned a unique bit patterns the text is then representing as.
Bits & Bytes: How Computers Represent Data
07/10/ Strings ASCII& Processing Strings with the Functions - Locate (Instr), Mid, Length (Len), Char (ChrW) & ASCII (Asc)
Data Compression1 File Compression Huffman Tries ABRACADABRA
Lecture Objectives  To learn how to use a Huffman tree to encode characters using fewer bytes than ASCII or Unicode, resulting in smaller files and reduced.
Fundamental Structures of Computer Science Feb. 24, 2005 Ananda Guna Lempel-Ziv Compression.
Data Structures Week 6: Assignment #2 Problem
Introduction to Computer Design CMPT 150 Section: D Ch. 1 Digital Computers and Information CMPT 150, Chapter 1, Tariq Nuruddin, Fall 06, SFU 1.
CS151 Introduction to Digital Design
Data Compression By, Keerthi Gundapaneni. Introduction Data Compression is an very effective means to save storage space and network bandwidth. A large.
 The amount of data we deal with is getting larger  Not only do larger files require more disk space, they take longer to transmit  Many times files.
Addressing Image Compression Techniques on current Internet Technologies By: Eduardo J. Moreira & Onyeka Ezenwoye CIS-6931 Term Paper.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
Chapter 1 Data Storage © 2007 Pearson Addison-Wesley. All rights reserved.
Recent Results in Combined Coding for Word-Based PPM Radu Rădescu George Liculescu Polytechnic University of Bucharest Faculty of Electronics, Telecommunications.
Data Storage © 2007 Pearson Addison-Wesley. All rights reserved.
CS654: Digital Image Analysis Lecture 34: Different Coding Techniques.
Chapter 3 Data Representation. 2 Compressing Files.
Lossless Compression(2)
Chapter 1 Data Storage © 2007 Pearson Addison-Wesley. All rights reserved.
Lossless Decomposition and Huffman Codes Sophia Soohoo CS 157B.
Information Coding Schemes Group Member : Yvonne Tiffany Jurifah bt Junaidi Clara Jane George.
Base 16 (hexadecimal) Uses the decimal digits and the first letters of the alphabet to encode 4 binary bits (16=2 4 ) abcdef or ABCDEF.
Text Compression: Syllables Jan Lánský, Michal Žemlička Dept. of Software Engineering Faculty of Mathematics.
1.4 Representation of data in computer systems Character.
Chapter 1: Data Storage.
Computer Science: An Overview Eleventh Edition
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Text Compression: Syllables
Huffman Codes ASCII is a fixed length 7 bit code that uses the same number of bits to define each character regardless of how frequently it occurs. Huffman.
Tries 07/28/16 11:04 Text Compression
ISNE101 – Introduction to Information Systems and Network Engineering
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Chapter 7 Special Section
Representing Characters
Binary Code  
The Huffman Algorithm We use Huffman algorithm to encode a long message as a long bit string - by assigning a bit string code to each symbol of the alphabet.
Inverse Wavelet Transform
Advanced Algorithms Analysis and Design
4.1 Strings ASCII & Processing Strings with the Functions
Presenting information as bit patterns
COMS 161 Introduction to Computing
Storing Information Each memory cell stores a set number of bits (usually 8 bits, or one byte) (byte addressable)
Chapter Nine: Data Transmission
COMS 161 Introduction to Computing
Learning Intention I will learn how computers store text.
File Compression Even though disks have gotten bigger, we are still running short on disk space A common technique is to compress files so that they take.
Introduction Reading: Sections 1.5 – 1.7.
Chapter 7 Special Section
Huffman Encoding.
C Programming Language
Huffman Coding Greedy Algorithm
ASCII LP1.
ASCII and Unicode.
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

Compression of a Dictionary Jan Lánský, Michal Žemlička Dept. of Software Engineering Faculty of Mathematics and Physics Charles University

Synopsis Introduction Existing methods Trie-based methods Results Conclusion

Introduction Why we are compressing a dictionary ?

Large Alphabet Compression Text Files - Compression over alphabet of words or syllables. Alphabet (Dictionary) must be transferred with the coded message Word-based methods Moffat 1989 Syllable-based methods Lánský, Žemlička 2005

Influence of File Size Large files Dictionary takes small part of message Influence of compression of the dictionary on compression ratio is small Small files Dictionary takes large part of message Influence of compression of the dictionary on compression ratio is large

Existing methods Common used methods for compression of a dictionary of words or syllables

Character by Character (CD) Code of string is composed from Code of string type Moffat: 2 types of words (word, non-word) Lánský, Žemlička: 5 types of syllables Encoded length of the string Symbol codes

Character by Character (CD) Examples code("to") = codeType(lower), codeLength(2), codeLower('t'), codeLower('o') code("153") = codeType(numeric), codeLength(3), codeDigit('1'), codeDigit('5'), codeDigit('3')

External Compression All strings from dictionary are concatenated by using separator This resulting string is compressed by LZW (we denote LZWD) Bzip2 (we denote bzipD)...

Trie-based methods TD1, TD2, TD3 Compression of a dictionary using its structure

Dictionary Data structure trie Nodes may represent strings Father represents a prefix of its sons Mapping between strings and its order is unique in whole dictionary Order is obtained during compression

Trie data structure For each node we know Whether a node represents a string (represents) Number of sons (count) Array of sons (son) Extension of each son (extension)

TD1 - encoding EncodeTD1 () EncodeGamma number of sons count Encode represents ( bit 0 or 1) For each son s Distance = s.extension – previous(s).extension EncodeDelta(Distance) EncodeNode(s)

TD1 - Example Dictionary: "the", "to", "ACM", "AC", ".\n "... Code node 'C': Code(1) – count Bit(1) – repr. Code(67-0) – dist Code node 'M'...

TD2 - Improvement In TD1 version the distances between sons are coded. Distances are calculated according binary values of the extending symbols These distances are encoded by Elias delta coding representing smaller numbers by shorter codes larger numbers by longer codes. Goal – decrease distances

TD2 - Improvement Reordering alphabet Primary according symbol type Secondary according symbol frequency 0-27 lower-case letter, upper-case letters, digits, other symbols TD2 - Distances between sons are counting in this new alphabet TD2 gives shorter distances and its codes

TD2 - Example Dictionary: "the", "to", "ACM", "AC", ".\n "... Code node 'C': Code(1) – count Bit(1) – repr. Code(34-0) – dist Code node 'M'...

TD3 - Improvement 5 types of words and syllables Lower ("hour") Upper ("HOUR") Mixed ("Hour") Numeric ("123") Other ("???") After coding 1-2 symbols from a string we can determine its type and improve its coding 2 symbols per Mixed/ Upper, 1 symbol otherwise

TD3 - Improvement Function first First(lower-case letter) = 0 First(upper-case letter) = 28 First(digit) = 54 First(other) = 64 TD3 – if we know the type of the string, we decrease the distance of the first son by the value of function first for the son extension

TD3 - Example Dictionary: "the", "to", "ACM", "AC", ".\n "

Results Comparison of TD1, TD2, TD3, CD, LZWD and BzipD on dictionaries of words and syllables in Czech, English and German

Results - syllables

TD3 outperforms other methods on all languages and file sizes Syllables are short Trie of syllables is dense Example 10Kb Czech file 770 bytes of dictionary by TD bytes of dictionary by CD (second best)

Results - words

Czech On 50kB and larger files is TD3 best Long words, dense trie of words English On 200kB and larger files is TD3 best Short words, quite dense trie of words German On 2MB and larger files is TD3 best Long words, quite sparse trie of words

Results - words How are methods succesfull on? Smaller files 1. CD, 2.-3.TD3, BzipD, 4. LZWD Middle-sized files 1. BzipD, 2. TD3, 3. CD, 4. LZWD Larger files 1. TD3, 2. BzipD, 3. CD, 4. LZWLD

Conclusion On what types of dictionaries is TD3 good ?

Conclusion Where is TD3 successful Dense tries with short string Dictionaries of syllables Larger dictionaries of words TD3 is not bad on other types of dictionaries TD3 is usually at least the second best method