Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information and Coding Theory

Similar presentations


Presentation on theme: "Information and Coding Theory"— Presentation transcript:

1 Information and Coding Theory
Transmission over lossless channels. Entropy. Compression codes - Shannon code, Huffman code, arithmetic code. Juris Viksna, 2015

2 Information transmission
We will focus on compression/decompression parts, assuming that there are no losses during transmission. [Adapted from D.MacKay]

3 Noiseless channel [Adapted from D.MacKay]

4 Noiseless channel How many bits we need to transfer a particular piece of information? All possible n bit messages, each with probability 1/2n Receiver Noiseless channel Obviously n bits will be sufficient. Also, it is not hard to guess that n bits will be necessary to distinguish between all possible messages.

5 Noiseless channel All possible n bit messages. Receiver Msg. Prob.
½ ½ other Receiver Noiseless channel n bits will still be sufficient. However, we can do quite nicely with just 1 bit!

6 Noiseless channel All possible n bit messages. Receiver Msg. Prob.
¼ ¼ Receiver Noiseless channel Try to use 2 bits for “00” and “01” and 1 bit for “10”: 00  00 01  01 10  1

7 Noiseless channel All possible n bit messages, the probability of message i being pi. Receiver Noiseless channel We can try to generalize this by defining entropy (the minimal average number of bits we need to distinguish between messages) in the following way: Derived from the Greek εντροπία "a turning towards" (εν- "in" + τροπή "a turning").

8 Entropy - The idea The entropy, H, of a discrete random variable X is a measure of the amount of uncertainty associated with the value of X. [Adapted from T.Mitchell]

9 Entropy - The idea [Adapted from T.Mitchell]

10 Entropy - Definition Example [Adapted from D.MacKay]

11 Entropy - Definition NB!!!
If not explicitly stated otherwise, in this course (as well in Computer Science in general)expressions log x denote logarithm of base 2 (i.e. log2 x). [Adapted from D.MacKay]

12 Entropy - Definition The entropy, H, of a discrete random variable X is a measure of the amount of uncertainty associated with the value of X. [Adapted from T.Mitchell]

13 Entropy - Some examples
[Adapted from T.Mitchell]

14 Entropy - Some examples
[Adapted from T.Mitchell]

15 Binary entropy function
Entropy of a Bernoulli trial as a function of success probability, often called the binary entropy function, Hb(p). The entropy is maximized at 1 bit per trial when the two possible outcomes are equally probable, as in an unbiased coin toss. [Adapted from

16 Entropy - some properties
[Adapted from D.MacKay]

17 Entropy - some properties
Entropy is maximized if probability distribution is uniform – i.e. all probabilities pi are equal. Sketch of proof: Assume probabilities p and q, then taking both probabilities equal to (p+q)/2 entropy does not decrease. H(p,q) = – (p log p + q log q) H((p+q)/2, (p+q)/2) = – (p+q) log ((p+q)/2)) – (p+q) log ((p+q)/2)) + (p log p + q log q)  – (p+q) log ((pq)1/2) + (p log p + q log q) = – (p/2+q/2) (log p + log q) + (p log p + q log q) = log p (1/2p – 1/2q)+log q (1/2q – 1/2p) = 1/2(p –q)(log p – log q)  0 In addition we need also some continuity assumptions about H.

18 Joint entropy Assume that we have a set of symbols  with known frequencies of symbol occurrences. We have assumed that on average we will need H() bits to distinguish between symbols. What about sequences of length n of symbols from  (assuming independent occurrence of each symbol with the given frequency)? The entropy of n will be: it turns out that H(n) = nH(). Later we will show that (assuming some restrictions) the encoding that use nH() bits on average are the best we can get.

19 Joint entropy The joint entropy of two discrete random variables X and Y is merely the entropy of their pairing: (X,Y). This implies that if X and Y are independent, then their joint entropy is the sum of their individual entropies. [Adapted from D.MacKay]

20 Conditional entropy The conditional entropy of X given random variable Y (also called the equivocation of X about Y) is the average conditional entropy over Y: [Adapted from D.MacKay]

21 Conditional entropy [Adapted from D.MacKay]

22 Mutual information Mutual information measures the amount of information that can be obtained about one random variable by observing another. Mutual information is symmetric: [Adapted from D.MacKay]

23 Entropy (summarized) Relations between entropies, conditional entropies, joint entropy and mutual information. [Adapted from D.MacKay]

24 Entropy - example [Adapted from D.MacKay]

25 Entropy and data compression
We have some motivation to think that H(X) should represent the minimal number of bits that on average will be needed to transmit a random message xX. Another property that could be expected from a good compression code is that probabilities of all code words should be as similar as possible. [Adapted from D.MacKay]

26 Coin weighting problem
The minimal number of weightings that is needed is three. can you devise a strategy that uses only three weightings? can you show that there is no strategy requiring less than 3 weightings? It turns out that “good” strategy needs to use “most informative” weightings with probabilities of all their outcomes being as similar as possible. [Adapted from D.MacKay]

27 Coin weighting problem
A strategy that uses only 3 weightings. [Adapted from D.MacKay]

28 Binary encoding - The problem
Straightforward approach - use 3 bits to encode each character (e.g. '000' for a, '001' for b, '010' for c, '011' for d, '100' for e, '101' for f). The length of the data file then will be Can we do better? [Adapted from S.Cheng]

29 Variable length codes [Adapted from S.Cheng]

30 Encoding [Adapted from S.Cheng]

31 Decoding [Adapted from S.Cheng]

32 Prefix codes [Adapted from S.Cheng]

33 Prefix codes [Adapted from S.Cheng]

34 Binary trees and prefix codes
[Adapted from S.Cheng]

35 Binary trees and prefix codes
[Adapted from D.MacKay]

36 Binary trees and prefix codes
Another requirement for uniquely decodable codes – Kraft inequality: For any uniquely decodable code the codeword lengths must satisfy: [Adapted from D.MacKay]

37 Binary trees and prefix codes
[Adapted from D.MacKay]

38 Binary trees and prefix codes
[Adapted from S.Cheng]

39 Optimal codes Is this prefix code optimal? [Adapted from S.Cheng]

40 Optimal codes [Adapted from S.Cheng]

41 Shannon encoding [Adapted from M.Brookes]

42 Huffman encoding [Adapted from S.Cheng]

43 Huffman encoding - example
[Adapted from S.Cheng]

44 Huffman encoding - example
[Adapted from S.Cheng]

45 Huffman encoding - example
[Adapted from S.Cheng]

46 Huffman encoding - example
[Adapted from S.Cheng]

47 Huffman encoding - example
[Adapted from S.Cheng]

48 Huffman encoding - example 2
Construct Huffman code for symbols with frequencies: A 15 D 6 F 6 H 3 I 1 M 2 N 2 U 2 V 2 # 7

49 Huffman encoding - example 2
[Adapted from H.Lewis, L.Denenberg]

50 Huffman encoding - algorithm
[Adapted from D.MacKay]

51 Huffman encoding - algorithm
[Adapted from S.Cheng]

52 Huffman code for English alphabet
[Adapted from D.MacKay]

53 Huffman encoding - optimality
[Adapted from H.Lewis and L.Denenberg]

54 Huffman encoding - optimality
If n1 and n2 are siblings in T replacing them by n3 will give T with: W(T) = W(T) d w(n1)  d w(n2) +(d1) w(n3) = W(T)  w(n3) [Adapted from H.Lewis and L.Denenberg]

55 Huffman encoding - optimality
Otherwise let n1 be at depth d1 and n2 at depth d2 and d1  d2. Then exchange first n1 with sibling subtree T2 of n3, producing T with: W(T )  W(T) d1 w(n1)  d2 w(T2) + d2 w(n1) + d1 w(T2)  W(T) Now n1 and n2 are siblings and replace T with T as in previous case. [Adapted from H.Lewis and L.Denenberg]

56 Huffman encoding - optimality
Proof by induction: n = 1 OK assume T is obtained by Huffman algorithm and X is an optimal tree. Construct T’ and X’ as described by lemma. Then: w(T’)  w(X’) w(T) = w(T’)+C(n1)+C(n2) w(X)  w(X’)+C(n1)+C(n2) w(T)  w(X) [Adapted from H.Lewis and L.Denenberg]

57 Huffman encoding and entropy
W() - average number of bits used by Huffman code H() - entropy Then H() W()<H()+1. Assume all probabilities are in form 1/2k. Then we can prove by induction that H() =W() (we can state that symbol with probability 1/2k. will always be at depth k) obvious if ||=1 or ||=2 otherwise there will always be two symbols having smallest probabilities both equal to 1/2k these will be joined by Huffman algorithm, thus we reduced the problem to alphabet containing one symbol less.

58 Huffman encoding and entropy
W() - average number of bits used by Huffman code H() - entropy Then W()<H()+1. Consider symbols a with probabilities 1/2k+1  p(a) < 1/2k modify alphabet: for each a reduce its probability to 1/2k+1 add extra symbols with probabilities in form 1/2k (so that sum of all probabilities is 1) construct Huffman encoding tree the depth of initial symbols will be k+1, thus W() < H()+1 we can prune the tree by deleting extra symbols, this procedure certainly will decrease W()

59 Huffman encoding and entropy
Can we claim that H() W()<H()+1? In general case symbol with probability 1/2k can be at depth other than k: Consider two symbols with probabilities 1/2k and 1  1/2k, both of them will be at depth 1. However changing both probabilities to ½ the entropy will only increase. By induction we can show that all symbol probabilities can be all changed to have a form 1/2k in such a way that entropy does not decrease and the Huffman tree does not change its structure. Thus we always will have H() W()<H()+1.

60 Arithmetic coding Unlike the variable-length codes described previously, arithmetic coding, generates non-block codes. In arithmetic coding, a one-to-one correspondence between source symbols and code words does not exist. Instead, an entire sequence of source symbols (or message) is assigned a single arithmetic code word. The code word itself defines an interval of real numbers between 0 and 1. As the number of symbols in the message increases, the interval used to represent it becomes smaller and the number of information units (say, bits) required to represent the interval becomes larger. Each symbol of the message reduces the size of the interval in accordance with the probability of occurrence. It is supposed to approach the limit set by entropy.

61 Arithmetic coding Let the message to be encoded be a1a2a3a3a4

62 Arithmetic coding 0.072 0.0688 0.8 0.16 0.4 0.056 0.0624 0.08 0.2 0.048 0.0592 0.04

63 Arithmetic coding So, any number in the interval [ ,0.0688), for example 0.068, can be used to represent the message. Here 3 decimal digits are used to represent the 5 symbol source message. This translates into 3/5 or 0.6 decimal digits per source symbol and compares favourably with the entropy of  (3x0.2log x0.4log100.4) = digits per symbol As the length of the sequence increases, the resulting arithmetic code approaches the bound set by entropy. In practice, the length fails to reach the lower bound, because: The addition of the end of message indicator that is needed to separate one message from another The use of finite precision arithmetic

64 Arithmetic coding Message is lluure? (we use ? As message terminator)
Initial partition of (0,1) interval Final range is [ , ). Transmit any number within range, e.g … 16 bits. (Huffman coder needs 18bits. Fixed coder: 21bits). [Adapted from X.Wu]

65 Arithmetic coding

66 Arithmetic coding Decoding:
Decode (assuming we know that the number of symbols = 5) Since 0.8> code word > 0.4, the first symbol should be a3. 1.0 0.8 0.4 0.2 0.72 0.56 0.48 0.0 0.688 0.624 0.592 0.5856 0.5728 0.5664 056896 Therefore, the message is: a3a3a1a2a4

67 Golomb-Rice codes Golomb code of parameter m for positive integer n is given by coding n div m (quotient) in unary and n mod m (remainder) in binary. When m is power of 2, a simple realization also known as Rice code. Example: n = 22, m = 4. n = 22 = ‘10110’. Shift right n by k = log m (= 2) bits. We get ‘101’. Output 5 (for ‘101’) ‘0’s followed by ‘1’. Then also output the last k bits of n. So, Golomb-Rice code for 22 is ‘ ’. Decoding is simple: count up to first 1. This gives us the number 5. Then read the next k (=2) bits - ‘10’ , and n = m  (for ‘10’) = = 22. [Adapted from X.Wu]

68 Golomb-Rice codes Well, at least a “popular textbook claim”. What certainly matters is the choice of block length K  for code to make any sense K should be larger, but not much larger than m. As it seems the best choice of K sometimes even is determined experimentally... It is also difficult to find the clear statements regarding equivalence to Huffman codes (although there are good experimental demonstrations that with well chosen parameters the performance might approach that of arithmetic coding). Golomb code of parameter m for positive integer n is given by coding n div m (quotient) in unary and n mod m (remainder) in binary. When m is power of 2, a simple realization also known as Rice code. What parameters one should chose and why these codes are good? p = P(X = 0) m = 1/log (1 p) It turns out that for large m such Golomb codes are quite good and in certain sense equivalent (???) Huffman code (and no need to compute the code – saves time and practically not doable for larger m). Widely used in audio and image compression (FLAC, MPEG-4) [Adapted from X.Wu]


Download ppt "Information and Coding Theory"

Similar presentations


Ads by Google