Recent Results in Combined Coding for Word-Based PPM Radu Rădescu George Liculescu Polytechnic University of Bucharest Faculty of Electronics, Telecommunications.

Slides:



Advertisements
Similar presentations
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Advertisements

Lecture 4 (week 2) Source Coding and Compression
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Using Parallel Genetic Algorithm in a Predictive Job Scheduling
Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??
Greedy Algorithms Amihood Amir Bar-Ilan University.
Procedures of Extending the Alphabet for the PPM Algorithm Radu Rădescu George Liculescu Polytechnic University of Bucharest Faculty of Electronics, Telecommunications.
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
Midterm 2 Overview Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Huffman Coding: An Application of Binary Trees and Priority Queues
A Data Compression Algorithm: Huffman Compression
CS336: Intelligent Information Retrieval
Is ASCII the only way? For computers to do anything (besides sit on a desk and collect dust) they need two things: 1. PROGRAMS 2. DATA A program is a.
Using CTW as a language modeler in Dasher Phil Cowans, Martijn van Veen Inference Group Department of Physics University of Cambridge.
Chapter 9: Huffman Codes
Fundamentals of Multimedia Chapter 7 Lossless Compression Algorithms Ze-Nian Li and Mark S. Drew 건국대학교 인터넷미디어공학부 임 창 훈.
Using CTW as a language modeler in Dasher Martijn van Veen Signal Processing Group Department of Electrical Engineering Eindhoven University.
Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004.
Management Information Systems Lection 06 Archiving information CLARK UNIVERSITY College of Professional and Continuing Education (COPACE)
Huffman Codes Message consisting of five characters: a, b, c, d,e
Dale & Lewis Chapter 3 Data Representation
Data Structures and Algorithms Huffman compression: An Application of Binary Trees and Priority Queues.
By Ravi Shankar Dubasi Sivani Kavuri A Popularity-Based Prediction Model for Web Prefetching.
Genetic Programming.
Information and Coding Theory Heuristic data compression codes. Lempel- Ziv encoding. Burrows-Wheeler transform. Juris Viksna, 2015.
Data Compression1 File Compression Huffman Tries ABRACADABRA
Huffman Encoding Veronica Morales.
Lecture Objectives  To learn how to use a Huffman tree to encode characters using fewer bytes than ASCII or Unicode, resulting in smaller files and reduced.
Fundamental Structures of Computer Science Feb. 24, 2005 Ananda Guna Lempel-Ziv Compression.
JavaDoc1 JavaDoc DEPARTMENT OF COMPUTER SCIENCE AND SOFTWARE ENGINEERING CONCORDIA UNIVERSITY July 24, 2006 by Emil Vassev & Joey Paquet revision 1.2 –
Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.
ICS 220 – Data Structures and Algorithms Lecture 11 Dr. Ken Cosh.
ALGORITHMS FOR ISNE DR. KENNETH COSH WEEK 13.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
Introduction to Algorithms Chapter 16: Greedy Algorithms.
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression.
Chapter 5: Hashing Part I - Hash Tables. Hashing  What is Hashing?  Direct Access Tables  Hash Tables 2.
Abdullah Aldahami ( ) April 6,  Huffman Coding is a simple algorithm that generates a set of variable sized codes with the minimum average.
Huffman Codes Juan A. Rodriguez CS 326 5/13/2003.
A New Operating Tool for Coding in Lossless Image Compression Radu Rădescu University POLITEHNICA of Bucharest, Faculty of Electronics, Telecommunications.
CPS 100, Spring Huffman Coding l D.A Huffman in early 1950’s l Before compressing data, analyze the input stream l Represent data using variable.
CS654: Digital Image Analysis Lecture 34: Different Coding Techniques.
Huffman’s Algorithm 11/02/ Weighted 2-tree A weighted 2-tree T is an extended binary tree with n external nodes and each of the external nodes is.
Foundation of Computing Systems
Bahareh Sarrafzadeh 6111 Fall 2009
Compression of a Dictionary Jan Lánský, Michal Žemlička Dept. of Software Engineering Faculty of Mathematics.
Lossless Decomposition and Huffman Codes Sophia Soohoo CS 157B.
بسم الله الرحمن الرحيم My Project Huffman Code. Introduction Introduction Encoding And Decoding Encoding And Decoding Applications Applications Advantages.
Huffman code and Lossless Decomposition Prof. Sin-Min Lee Department of Computer Science.
Lossless Compression-Statistical Model Lossless Compression One important to note about entropy is that, unlike the thermodynamic measure of entropy,
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
Introduction to genetic algorithm
3.3 Fundamentals of data representation
Huffman Codes ASCII is a fixed length 7 bit code that uses the same number of bits to define each character regardless of how frequently it occurs. Huffman.
HUFFMAN CODES.
DATA STRUCTURES AND ALGORITHM (CSE 220)
Information and Coding Theory
The Greedy Method and Text Compression
The Greedy Method and Text Compression
Huffman Coding, Arithmetic Coding, and JBIG2
Chapter 9: Huffman Codes
Analysis & Design of Algorithms (CSCE 321)
Huffman Coding CSE 373 Data Structures.
COMS 161 Introduction to Computing
Data Structures Sorted Arrays
File Compression Even though disks have gotten bigger, we are still running short on disk space A common technique is to compress files so that they take.
Thesis Presented By Mohammad Abul Kalam Azad C Shabbir Ahmad C Francis Palma Tony C Supervised by S. M. Kamruzzaman Assistant.
Huffman Coding Greedy Algorithm
Presentation transcript:

Recent Results in Combined Coding for Word-Based PPM Radu Rădescu George Liculescu Polytechnic University of Bucharest Faculty of Electronics, Telecommunications and Information Technology Applied Electronics and Information Engineering Department ACCT2008

1. Introduction  Let us consider that internal words are present at the decoding phase because they are internally generated, and they can be reproduced at the decoding.  The external words that could be present at the decoding are inserted externally at both coding and decoding stages. It is considered an optimization of the data tree, so it can be used on the purpose of word-based coding (strings of octets).  In order to minimize the searching time, an optimized algorithm must be used: the red-black tree.  The red-black tree is a binary tree, which keeps inside of every node an extra- information – the color of the node – that can be red or black.  The red-black tree ensures there is no other way which is longer than the one keeping the tree approximately balanced.  The method used here is dedicated only to the PPM algorithm and it performs the adaptive search of the words.

2. PPM encoding with the extended alphabet  The extended alphabet encoding is similar to the basic alphabet (8 bits symbols). In order to determine which word is next coded, we need to check all the words, starting from the current position from the considered coding stage.  The search must be made for every word that has a chosen potential, this being a very big extra task. In order to compute the gain, we will use the real number of appearances (imposed on inside basis) for internal words, or the maximum between the real and false number of appearances (imposed on outside basis) for external words.  In order to reduce the time of adding and searching within the tree, all the words from other structures will be references to words from the used alphabet. All the comparisons between words could based on references, but comparison between elements is not made.  The only comparison between byte-level words: when a word is searched in the alphabet.

3. Combining external words & adaptive generated words  We will consider only the words marked as being present at decoding. This is why it is important that every word which was external and absent at decoding to be marked as being extern and present at decoding only after this word has been encoded character by character and was followed by a special word and a counter.  The disadvantage of combining external words with internal words is that the internal ones have priority, replacing the external ones absent at the decoding step. The external words are the result of other algorithms or of user's experience and many times this can be a useful information, which may improve the encoding. At the occurrence of an internal word, which replaces an external one, this useful information is ignored.  The advantage of this combination is that an absent word at decoding can be replaced with an adaptive generated one, which was seen many times until the end of the survival period. The final result is a gain.

4. Files used for experimental results: case 1  The aaa file has all the characters the same: a.  The limit_comp.xmcd is an XML format that contains elements of the same type in the tags.  The ByteEditor.exe file is an executable file, which is extended and not compressed.  The concertBach file did not contain words that will match the rules imposed by the search in other stage than that of coding, so for this reason any experiment that had this type of search was not performed.

4. Experimental results: case 1

4. Experimental results: case 2 (Calgary Corpus)

4. Experimental results: remarks  The best performances of the aaa, limit_comp.xmcd and progc.cs compression is obtained by combining the adaptive search with that performed in a separate stage, because coding a missing word means coding all its characters, while coding a present word does not have this disadvantage.  The encoding time usually increases compared to PPM with regular alphabet because words represented by strings of bytes (not only by bytes) must be checked.  The concertBach and ByteEditor.exe files are better coded by using the adaptive search because the restrictions imposed to the search performed in a separate stage of that of coding are too strong. These texts contain short words that appear repeatedly, and the adaptive search manages to find some of them because its minimum length is 5, while for the search before coding the minimum length is 20.

5. Parameters for adaptive search without restrictions

6. Remarks for adaptive search without restrictions  For the proposed files the use of adaptive search combined with adding of words with no restrictions has better compression results but less quality time results from the gain and minimum length point of view.  The time increases because there are many words in the alphabet and their search lasts a long time.  The compression ratio is better because the words are early discovered and used.  Although the alphabet has many words and the probability of a word at the –1 PPM prediction level (the reverse value of the alphabet length) is small, the encoding is not strongly influenced by this because the –1 level situations are rare.

7. Conclusions  The most efficient is the search before encoding together with the use of maximum length word at the encoding.  The adaptive search can be performed in the case of files with many repeating words and has the advantage of performing at the coding phase.  The combining of the two procedures of search can be used only for files with nearby-repeated words and with long-separated words, so they will not be included in the alphabet by the adaptive search.  The gain function depends on the file type for most of the files.  The difference between the extended alphabet PPM encoding and the WinRar compression is about 1.5 bits/character.  The encoding with adaptive search without restrictions is the most efficient and most files are compressed better with extended-alphabet PPM algorithms.

Thank you !