COMP261 Lecture 22 Data Compression 2.

COMP261 Lecture 22 Data Compression 2

Data/Text Compression
Reducing the memory required to store some information. Original text/image/sound compressed text/image/sound compress

Coding Problem: Messages: Given a set of symbols/messages
Encode them as bit strings Need a separate code for each message Try to minimising the total number of bits. Messages: Characters Numbers ….

Equal Length Codes With N bits, we can have up to 2N different codes.
Use the same number of bits for every value/message to be encoded. E.g. digits: msg: code: E.g. letters: msg: a b c d e f g … z code: … 11 With N bits, we can have up to 2N different codes.

Equal Length Codes How many bits are needed?
Digits: msg: code: messages, 4 bits Letters: msg: a b c d e f g … z code: … messages, 5 bits N different messages, log2N bits per message 10 numbers, message length = 4 26 letters, message length = 5 If there are many repeated messages, can we do better?

Frequency based encoding
Vary the code length, using fewer bits for more common messages Eg digits: msg: code: Suppose: 0 occurs 50% of the time, 1 occurs 20% of time, % each, % encode: 0 by '0' by '1' by '10' by '11' by '100' by '101‘ …. by '1010'

Variable length encoding
More efficient to have variable length codes Problem: where are the boundaries? Need codes that tell you where the end is: msg: code: Prefix-free/instantaneous coding: no code is the prefix of another code.

Example: Building the tree
1 20% j 50% j: 50% View the powerpoint animation! 0: 50% 1: 20% 2: 5% 3: 5% 4: 5% 5: 5% 6: 2% 7: 2% 8: 2% 9: 2% 10: 2% 0 50% h 30% h: 30% g 19% g: 19% 6 2% c 6% c: 6% 2 5% f 11% f: 11% 4 5% 3 5% e 10% e: 10% 5 5% d 9% d: 9% 9 2% 10 2% a 4% a: 4% 7 2% 8 2% b 4% b: 4% New nodes added in the order indicated by their letters! The letters don't mean anything

Example: Assigning the codes
Assign parent code + 0 to left child parent code + 1 to right child 0: 50% 1: 20% 2: 5% 3: 5% 4: 5% 5: 5% 6: 2% 7: 2% 8: 2% 9: 2% 10: 2% average msg length = (1*.5)+(2*.2)+(4*.05)+(5*.17)+(6*.08) = 2.43 bits 10 1 1100 0 50% j 50% 11101 10 11 11100 1 20% h 30% 11111 110 111 f 11% g 19% 11010 1100 1101 1110 1111 110110 2 5% c 6% e 10% d 9% 110111 11010 11011 11100 11101 11110 11111 111100 6 2% b 4% 4 5% 3 5% a 4% 5 5% 111101 110110 110111 111100 111101 7 2% 8 2% 9 2% 10 2%

Huffman Coding Generates the best set of codes, given frequencies/probabilities on all the messages. Creates a binary tree to construct the codes. Construct a leaf node for each message with the given probability Create a priority queue of messages/nodes, (lowest probability = highest priority) while there is more than one node in the queue: (i.e. more than one tree) remove the top two nodes create a new tree node with these two nodes as children. node probability = sum of two nodes add new node to the queue final node is root of tree. Traverse tree to assign codes: if node has code c, assign c0 to left child, c1 to right child This is a “greedy” algorithm – can always choose the nodes that lead to best code. See video on YouTube: Text compression with Huffman coding

Huffman Coding To decode, we need a table of the codes used.
If we label the edges of the tree with 0’s and 1’s, as added at each level, we get a trie which can be used like a scanner to split the coded string/file into separate codes to be decoded. Last 3 slides added alter and covered at start of lecture 23.

Lempel-Ziv 77 revisited Basic idea is to store a pointer (offset,length) to a maximal substring that has occurred earlier – within a given window size. Basic algorithm is: for cursor = 0 to maxcursorval: look for longest prefix of text[cursor..text.length] occurring in text[max(cursor-windowsize,0)..curor-1] if found, added [offset,length,text[cursor]] to output else add [0,0, text[cursor]] to output Can use various approaches to find the substring

Lempel-Ziv 77 Note corrections! cursor  0
Cursor – WindowSize should never point before 0, cursor+lookahead mustn't go past end of text cursor  0 windowSize  100 // some suitable size while cursor < text.size lookahead  0 prevMatch  0 loop match  stringMatch( text[cursor.. cursor+lookahead], text[(cursor<windowSize)?0:cursor-windowSize .. cursor-1] ) if match succeeded then prevMatch  match lookahead  lookahead + 1 else output( [suitable value for prevMatch, lookahead, text[cursor+lookahead ]]) cursor  cursor + lookahead + 1 break This looks for an occurrence of text[cursor..cursor+lookahead] in text[start..cursor-1], for increasing values of lookahead, until none is found, then outputs a triple. This is wasteful – we know there is no match before prevMatch, so there’s no point looking there again! Could we improve by starting from prevMatch? Or find longest match starting at each position in window and record longest? Note corrections!

COMP261 Lecture 22 Data Compression 2.

Similar presentations

Presentation on theme: "COMP261 Lecture 22 Data Compression 2."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

COMP261 Lecture 22 Data Compression 2.

Similar presentations

Presentation on theme: "COMP261 Lecture 22 Data Compression 2."— Presentation transcript:

Similar presentations

About project

Feedback