269202 ALGORITHMS FOR ISNE DR. KENNETH COSH WEEK 13.

269202 ALGORITHMS FOR ISNE DR. KENNETH COSH WEEK 13

REVIEW  Game Theory  Minimax Algorithm  Horizon Effect  Decision Theory  Probability & Utility  MEU

THIS WEEK  Data Compression Techniques  We can speed up the efficiency of many algorithms by compressing data.  Obviously storing ‘M’ or ‘F’, rather than ‘Male’ or ‘Female’, takes less space, AND less time to transfer between bits of a program.  Lets investigate how we can efficiently encode data.

ENCODING 128  128  80  1000000  CXXVIII  ρκη  |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

ASSUMPTIONS  We have ‘n’ different symbols we can use to encode a message.  In Morse code n=3.  In Binary n=2.  All symbols m i forming the set M, have probabilities of occurrence P(m i ) such that P(m i ) + … + P(m n ) =1  Infrequently occurring symbols can be assigned a long code word, while short code words are reserved for frequent symbols.

CODING OBJECTIVES  Each codeword corresponds to exactly one symbol.  Decoding should not require any look ahead.  This is known as the ‘prefix’ property.

PREFIX PROPERTY  Symbols: A, B, C  Codes: 1, 2, 12  Message: 12  Is it ‘AB’? Is it ‘C’?

PREFIX PROPERTY  Symbols: A, B, C,  Codes: 1, 22, 12  Message: 1222  Read in 1, is it an A?  Read in 2, was it a C?  Read in 2, Should it be AB?  Read in 2, Ah, finally we can assume it was CB.

PREFIX PROPERTY  Symbols: A, B, C  Codes: 11, 12, 21  No Ambiguity here.

CODE OPTIMISATION  The length of a code for one symbol should not exceed the length of a less likely symbol;  if P(m i ) ≤ P(m j ) then L(m i ) ≥ L(m j )  There should be no unused short codes, either as stand alone encodings or as prefixs for longer codes.  01, 000, 001, 100, 101 is not ideal as 11 is not used.

HUFFMAN CODING  Huffman coding is a method for choosing a representation for each symbol, resulting in a prefix-free code  The bit string representing some particular symbol is never a prefix of the bit string representing any other symbol  The most common characters are expressed using shorter strings of bits than are used for less common symbols.

HUFFMAN CODING  Huffman coding starts by creating a heap, based on the frequencies of the symbols to be encoded.  The root of the heap should be 1.0 – the sum of both child nodes.  Each data node appears in a leaf node.  Take the set (with corresponding occurrence frequencies out of 120); A B C D E F G H I 10 15 5 15 20 5 15 30 5

CREATING A HEAP  There are several ways a heap can be created from the given data, here is one option; 120 55 65 25 A 30 15 C 10 FI BD H 35 EG 5 5 5 1015 20 30

HUFFMAN CODING  Each node is then given a code based on their position in the tree – 0 for left node, 1 for right node; 120 55 65 25 A 30 15 C 10 FI BD H 35 EG 5 5 5 1015 20 30 A = 000 B = 010 C = 0010 D = 011 E = 111 F = 00110 G = 110 H = 10 I = 00111

HUFFMAN DECODING A = 000 B = 010 C = 0010 D = 011 E = 111 F = 00110 G = 110 H = 10 I = 00111 0010000010010000110111 000011011111011 01111100000110 0100011111010111000011111011 10111000011000001010111 Does this code fit our prefix requirements?

ADAPTIVE HUFFMAN  Huffman coding was invented back in the 1950’s, and has a key assumption – that we know the expected frequency of each symbol.  We could get this frequency from a large corpus, such as the BNC, but then the resulting huffman tree might not be ideal for all circumstances.  A technical computing paper is likely to contain more non alphabet symbols than ‘war and peace’.  Adaptive Huffman is usable in this situation, as it reconstructs the tree as data is read in.

ADAPTIVE HUFFMAN  Initially all symbols are unused and so stored in a leaf with value 0. As symbols are used they are moved up the tree.  As symbols are used more frequently, they are compared and if necessary swapped with their siblings.  Sibling comparison is easy if the heap is stored as a doubly linked list.

DOUBLY LINKED LIST 120 55 65 25 A 30 15 C 10 FI BD H 35 EG 5 5 5 1015 20 30

CABBAGES ARE BAD 0 abcdegrs 0 1 1 c abdegrs 1 2 1 c bdegrs 0 a 1 2 3 1 c degrs 1 a 1 01 b 2 3 1 c 1 a 1 01 b add c add aadd b transpose

ADAPTIVE HUFFMAN  Note that Adaptive Huffman requires just one pass through the message to encode it;  The first time the b is transmitted it has the code 101.  The second time the b is transmitted it has the code 0.  As symbols rise up the tree their codes become shorter.  So long as the decoder uses the same formula it can decode the same message.

RUN-LENGTH ENCODING  The word cabbage has a repeated b, so run length encoding might be able to shorten the code, by indicating that there are 2 b’s.  This is similar to the ๆ symbol in Thai.  Run-length encoding doesn’t work if there are numbers in the sequence!

269202 ALGORITHMS FOR ISNE DR. KENNETH COSH WEEK 13.

Similar presentations

Presentation on theme: "269202 ALGORITHMS FOR ISNE DR. KENNETH COSH WEEK 13."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

269202 ALGORITHMS FOR ISNE DR. KENNETH COSH WEEK 13.

Similar presentations

Presentation on theme: "269202 ALGORITHMS FOR ISNE DR. KENNETH COSH WEEK 13."— Presentation transcript:

Similar presentations

About project

Feedback