Source Coding Data Compression A.J. Han Vinck. DATA COMPRESSION NO LOSS of information and exact reproduction (low compression ratio 1:4) general problem.

Slides:



Advertisements
Similar presentations
15-583:Algorithms in the Real World
Advertisements

Data Compression CS 147 Minh Nguyen.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Greedy Algorithms Amihood Amir Bar-Ilan University.
An introduction to Data Compression
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
SIMS-201 Compressing Information. 2  Overview Chapter 7: Compression Introduction Entropy Huffman coding Universal coding.
Huffman Encoding Dr. Bernard Chen Ph.D. University of Central Arkansas.
Data Compression Michael J. Watts
Compression & Huffman Codes
Lecture 6: Huffman Code Thinh Nguyen Oregon State University.
CSc 461/561 CSc 461/561 Multimedia Systems Part B: 1. Lossless Compression.
A Data Compression Algorithm: Huffman Compression
DL Compression – Beeri/Feitelson1 Compression דחיסה Introduction Information theory Text compression IL compression.
Data Structures – LECTURE 10 Huffman coding
Variable-Length Codes: Huffman Codes
EEE377 Lecture Notes1 EEE436 DIGITAL COMMUNICATION Coding En. Mohd Nazri Mahmud MPhil (Cambridge, UK) BEng (Essex, UK) Room 2.14.
Source Coding Hafiz Malik Dept. of Electrical & Computer Engineering The University of Michigan-Dearborn
Data Compression Basics & Huffman Coding
Management Information Systems Lection 06 Archiving information CLARK UNIVERSITY College of Professional and Continuing Education (COPACE)
Basics of Compression Goals: to understand how image/audio/video signals are compressed to save storage and increase transmission efficiency to understand.
Huffman Coding Vida Movahedi October Contents A simple example Definitions Huffman Coding Algorithm Image Compression.
Compression Algorithms
Chapter 2 Source Coding (part 2)
Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.
Source Coding-Compression
Dr.-Ing. Khaled Shawky Hassan
Information theory in the Modern Information Society A.J. Han Vinck University of Duisburg/Essen January 2003
CS Spring 2011 CS 414 – Multimedia Systems Design Lecture 7 – Basics of Compression (Part 2) Klara Nahrstedt Spring 2011.
1 Analysis of Algorithms Chapter - 08 Data Compression.
Multimedia Specification Design and Production 2012 / Semester 1 / L3 Lecturer: Dr. Nikos Gazepidis
Source Coding Data Compression May 7, 2012 A.J. Han Vinck.
Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.
The LZ family LZ77 LZ78 LZR LZSS LZB LZH – used by zip and unzip
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
COMPRESSION. Compression in General: Why Compress? So Many Bits, So Little Time (Space) CD audio rate: 2 * 2 * 8 * = 1,411,200 bps CD audio storage:
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
Huffman Code and Data Decomposition Pranav Shah CS157B.
Bahareh Sarrafzadeh 6111 Fall 2009
Lossless Compression(2)
Lossless Decomposition and Huffman Codes Sophia Soohoo CS 157B.
1 Data Compression Hae-sun Jung CS146 Dr. Sin-Min Lee Spring 2004.
Chapter 7 Lossless Compression Algorithms 7.1 Introduction 7.2 Basics of Information Theory 7.3 Run-Length Coding 7.4 Variable-Length Coding (VLC) 7.5.
CS Spring 2012 CS 414 – Multimedia Systems Design Lecture 7 – Basics of Compression (Part 2) Klara Nahrstedt Spring 2012.
Page 1KUT Graduate Course Data Compression Jun-Ki Min.
Huffman code and Lossless Decomposition Prof. Sin-Min Lee Department of Computer Science.
Lossless Compression-Statistical Model Lossless Compression One important to note about entropy is that, unlike the thermodynamic measure of entropy,
Submitted To-: Submitted By-: Mrs.Sushma Rani (HOD) Aashish Kr. Goyal (IT-7th) Deepak Soni (IT-8 th )
Data Compression: Huffman Coding in Weiss (p.389)
Data Compression Michael J. Watts
Design & Analysis of Algorithm Huffman Coding
HUFFMAN CODES.
Data Coding Run Length Coding
Compression & Huffman Codes
EE465: Introduction to Digital Image Processing
Digital Image Processing Lecture 20: Image Compression May 16, 2005
Data Compression.
Introduction to Information theory
Applied Algorithmics - week7
Data Compression.
Data Compression CS 147 Minh Nguyen.
Analysis & Design of Algorithms (CSCE 321)
Why Compress? To reduce the volume of data to be transmitted (text, fax, images) To reduce the bandwidth required for transmission and to reduce storage.
Advanced Algorithms Analysis and Design
Chapter 11 Data Compression
Huffman Coding Greedy Algorithm
Presentation transcript:

Source Coding Data Compression A.J. Han Vinck

DATA COMPRESSION NO LOSS of information and exact reproduction (low compression ratio 1:4) general problem statement: “find a means for spending as little time as possible on packing as much of data as possible into as little space as possible, and with no loss of information”

GENERAL IDEA: represent likely symbols with short length binary words where likely is derived from -prediction of next symbol in source output q-ue q-ua q-ui q-uo q ? q-00 q-01 q-10 q-11 - context between the source symbols words sounds context in pictures

Why compress? 1.- Lossless compression often reduces file size by 40% to 80%. 1.- More economical to transport and store 2.- Most Internet content is compressed for transmission 3.- Compression before encryption can make code-breaking difficult 4.- Conserve battery power and storage space on mobile devices 5.- Compression and decompression can be hardwired

Some history 1948 – Shannon-Fano coding 1952 – Huffman coding –reduced redundancy in symbol coding –demonstrably optimal fixed-length coding 1977 – Lempel-Ziv coding –first major “dictionary method” –maps repeated word patterns to code words

MODEL KNOWLEDGE  best performance: exact prediction!  exact prediction: no new information!  no new information: no message to transmit

Example No prediction source: C message code representation length: = 3

Example with prediction ENCODE DIFFERENCE probability difference-101 code source Ccode - P L =.25 * * * 2 = 1.5 bit/difference symbol

binary tree codes the relation between source symbols and codewords A:= 11 B:=10 C:= General Properties: - every node has two successors: leaves or/and nodes - the way to reach a leave gives the connected codeword - source letters are only assigned to leaves i.e. no codeword is prefix of another code word code

tree codes Tree codes are prefix codes and uniquely decodable i.e. a string of codewords can be uniquely decomposed into the individual codewords Non-prefix codes may be uniquely decodable example: A:=1 B:=10 C:=100

binary tree codes The average codeword length Property: an optimal code has minimum L Property: for an optimal code the two least probable codewords have the same length, are the longest by manipulating the assignment differ only in the last code digit

Tree encoding (1) for data / text the compression should be: lossless  no errors –STEP 1: assign messages to nodes codeword n i P(i) a b c d e AVERAGE CODEWORD LENGTH: = 2.75 bit/source symbol

Tree encoding (2) STEP 2 OPTIMIZE ASSIGNMENT (MINIMIZE average length ) codeword n i P(i) e d c b a AVERAGE CODEWORD LENGTH: = 2.35 bit/source symbol !

Kraft inequality Prefix codes with M code words satisfy the Kraft inequality: where n k is the code word length for message k Proof: let n M be the longest codeword length then, in a code tree of depth n M, the terminal nodes eliminate from the total number of available nodes

example Depth = 4 eliminates 8 eliminates 4 eliminates 2 Homework: can we replace ≤ into = in the Kraft inequality?

Kraft inequality Suppose that the length specification of M code words satisfies the Kraft inequality, Then where N i is the number of code words of length i. Then, we can construct a prefix code with the specified lengths Note that:

Kraft inequality From this, Interpretation: at every level less nodes used than available! E.g. for level 3, we have 8 nodes minus the nodes cancelled by Level 1 and 2.

performance Suppose that we select the code word lengths as Then, a prefix code exists, since with average length

Lower bound for prefix codes We show that We write Equality can be established for

Huffman Coding: (JPEG, MPEG, MP3 ) 1take together smallest probabilites: P(i) + P(j) 2 replace symbol i and j by new symbol 3 go to 1 - until end Example:code

Huffman Coding: optimality Given code C with average length L and M symbols Construct C‘: replace the 2 least probable symbols C M and C M-1 in C by symbol C M-1 ‘ with probability P(M) + P(M-1) to minimize L, we have to minimize L‘.

Properties ADVANTAGES: –uniquely decodable code –smallest average codeword length DISADVANTAGES: –LARGE tables give complexity –variable word length –sensitive to channel errors

Conclusion Huffman Tree coding (Huffman) is not universal! it is only valid for one particular type of source! For COMPUTER DATA data reduction is lossless  no errors at reproduction universal  effective for different types of data

Performance Huffman Using the probability distribution for the source U, a prefix code exists with average length L < H(U) + 1 Since Huffman is optimum, this bound is also true for Huffman codes Improvements can be made when we take J symbols together, then –JH(U) ≤ L < J H(U) + 1 and –H(U) ≤ L’ = L/J < H(U) + 1/J

Encoding idea Lempel Ziv Welch-LZW Assume we have just read a segment w from the text. a is the next symbol. If wa is not in the dictionary, ● Write the index of w in the output file. ● Add wa to the dictionary, and set w  a. ● If wa is in the dictionary, ● Process the next symbol with segment wa. a w a

Encoding example address 0: aaddress 1: baddress 2: c String a a b a a c a b c a b c boutputupdate a a aa not in dictionry, output 0 add aa to dictionary 0aa 3 a a b continue with a, store ab in dictionary 0ab 4 a a b a continue with b, store ba in dictionary 1ba 5 a a b a a c aa in dictionary, aac not, 3aac 6 a a b a a c a 2ca 7 a a b a a c a b c 4abc 8 a a b a a c a b c a b 7cab 9

UNIVERSAL (LZW) (decoder) 1.Start with basic symbol set 2.Read a code c from the compressed file. - The address c in the dictionary determines the segment w. - write w in the output file. 3.Add wa to the dictionary: a is the first letter of the next segment

Decoding example address 0: aaddress 1: baddress 2: c Stringinputupdate a ? output a 0 a a ! output a determines ? = a, update aa 0 aa 3 a a b. output 1 determines !=b, update ab 1 ab 4 a a b a a. 3 ba 5 a a b a a c. 2 aac 6 a a b a a c a b. 4 ca 7 a a b a a c a b c a. 7 abc 8

Conclusion (LZW) IDEA: TRY to copy long parts of source output –if overflow throw least-recently used entry away in en- and decoder –universal –lossless Homework: encode/decode the sequence Try to solve the problem that occurs!

Some history GIF, TIFF, V.42bis modem compression standard, PostScript Level 2 –1977 published by Abraham Lempel and Jakob Ziv –1984 LZ-Welch algorithm published in IEEE Computer –Sperry patent transferred to Unisys (1986) –GIF file format Required use of LZW algorithm

Summary of operations ENCODINGoutputupdate location W 1 A loc( W 1 ) W 1 A N W 2 F loc( W 2 ) W 2 F N+1 W 3 X loc( W 3 ) W 3 X N+2 DECODE: INPUT update location –loc( W 1 ) W 1 ? –loc( W 2 ) W 2 ? W 1 A N –loc( W 3 ) W 3 ? W 2 F N+1

Problem and solution ENCODINGoutputupdate location – W 1 A loc( W 1 ) W 1 A N W 2 = W 1 A F loc( W 2 ) W 2 F N+1 DECODE: INPUT update location –loc( W 1 ) W 1 ? –loc( W 2 = W 1 A) W 2 # W 1 A N Since W 2 = W 1 A, the ? can be solved  W 2 updated at location N as W 1 A

Shannon-Fano coding Suppose that we have a source with M symbols. Every symbol u i occurs with probability P(u i ). We try to encode symbol u i with bits Then the average representation length is

code realization Define

continued Define: The codeword for u i is the binary expansion for Q(u i ) of length n i Property: The code is a prefix code with the promised length Proof: Let i  k+1

continued 1.The binary radix-2 representation for Q(u i ) and Q(u k ) differ at least in position n k. 2.The codewords for Q(u i ) and Q(u k ) have length 3.The truncated representation for Q(u k ) can never be a prefix for the codeword n i.

example P(u 0 u 1 u 2 u 3 u 4 u 5 u 6 u 7 )=(5/16, 3/16,1/8, 1/8, 3/32, 1/16, 1/16, 1/32)