Introduction to Information theory A.J. Han Vinck University of Duisburg-Essen April 2012.

Slides:



Advertisements
Similar presentations
Noise, Information Theory, and Entropy (cont.) CS414 – Spring 2007 By Karrie Karahalios, Roger Cheng, Brian Bailey.
Advertisements

Information theory Multi-user information theory A.J. Han Vinck Essen, 2004.
15-583:Algorithms in the Real World
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Information Theory EE322 Al-Sanie.
Source Coding Data Compression A.J. Han Vinck. DATA COMPRESSION NO LOSS of information and exact reproduction (low compression ratio 1:4) general problem.
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
SIMS-201 Compressing Information. 2  Overview Chapter 7: Compression Introduction Entropy Huffman coding Universal coding.
Chain Rules for Entropy
Protein- Cytokine network reconstruction using information theory-based analysis Farzaneh Farhangmehr UCSD Presentation#3 July 25, 2011.
Chapter 6 Information Theory
. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.
Fundamental limits in Information Theory Chapter 10 :
Information Theory Eighteenth Meeting. A Communication Model Messages are produced by a source transmitted over a channel to the destination. encoded.
Lossless data compression Lecture 1. Data Compression Lossless data compression: Store/Transmit big files using few bytes so that the original files.
2015/7/12VLC 2008 PART 1 Introduction on Video Coding StandardsVLC 2008 PART 1 Variable Length Coding  Information entropy  Huffman code vs. arithmetic.
EEE377 Lecture Notes1 EEE436 DIGITAL COMMUNICATION Coding En. Mohd Nazri Mahmud MPhil (Cambridge, UK) BEng (Essex, UK) Room 2.14.
Source Coding Hafiz Malik Dept. of Electrical & Computer Engineering The University of Michigan-Dearborn
Compression with Side Information using Turbo Codes Anne Aaron and Bernd Girod Information Systems Laboratory Stanford University Data Compression Conference.
Information Theory and Security
Noise, Information Theory, and Entropy
X= {x 0, x 1,….,x J-1 } Y= {y 0, y 1, ….,y K-1 } Channel Finite set of input (X= {x 0, x 1,….,x J-1 }), and output (Y= {y 0, y 1,….,y K-1 }) alphabet.
Management Information Systems Lection 06 Archiving information CLARK UNIVERSITY College of Professional and Continuing Education (COPACE)
Noise, Information Theory, and Entropy
1 Statistical NLP: Lecture 5 Mathematical Foundations II: Information Theory.
©2003/04 Alessandro Bogliolo Background Information theory Probability theory Algorithms.
Information and Coding Theory
§1 Entropy and mutual information
STATISTIC & INFORMATION THEORY (CSNB134)
Information Theory & Coding…
INFORMATION THEORY BYK.SWARAJA ASSOCIATE PROFESSOR MREC.
DIGITAL COMMUNICATION Error - Correction A.J. Han Vinck.
Source Coding-Compression
Channel Coding Part 1: Block Coding
§4 Continuous source and Gaussian channel
Information theory in the Modern Information Society A.J. Han Vinck University of Duisburg/Essen January 2003
Information Coding in noisy channel error protection:-- improve tolerance of errors error detection: --- indicate occurrence of errors. Source.
Basic Concepts of Encoding Codes, their efficiency and redundancy 1.
Channel Capacity.
COMMUNICATION NETWORK. NOISE CHARACTERISTICS OF A CHANNEL 1.
JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Essential Information Theory I AI-lab
COMPRESSION. Compression in General: Why Compress? So Many Bits, So Little Time (Space) CD audio rate: 2 * 2 * 8 * = 1,411,200 bps CD audio storage:
Communication System A communication system can be represented as in Figure. A message W, drawn from the index set {1, 2,..., M}, results in the signal.
Information Theory The Work of Claude Shannon ( ) and others.
Outline Transmitters (Chapters 3 and 4, Source Coding and Modulation) (week 1 and 2) Receivers (Chapter 5) (week 3 and 4) Received Signal Synchronization.
Cryptography and Authentication A.J. Han Vinck Essen, 2008
Source Coding Efficient Data Representation A.J. Han Vinck.
Lecture 4: Lossless Compression(1) Hongli Luo Fall 2011.
Basic Concepts of Information Theory Entropy for Two-dimensional Discrete Finite Probability Schemes. Conditional Entropy. Communication Network. Noise.
1 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan Room: C3-222, ext: 1204, Lecture 10 Rate-Distortion.
1 Lecture 7 System Models Attributes of a man-made system. Concerns in the design of a distributed system Communication channels Entropy and mutual information.
STATISTIC & INFORMATION THEORY (CSNB134) MODULE 11 COMPRESSION.
1 CSCD 433 Network Programming Fall 2013 Lecture 5a Digital Line Coding and other...
ECE 101 An Introduction to Information Technology Information Coding.
Channel Coding Theorem (The most famous in IT) Channel Capacity; Problem: finding the maximum number of distinguishable signals for n uses of a communication.
UNIT I. Entropy and Uncertainty Entropy is the irreducible complexity below which a signal cannot be compressed. Entropy is the irreducible complexity.
Institute for Experimental Mathematics Ellernstrasse Essen - Germany DATA COMMUNICATION introduction A.J. Han Vinck May 10, 2003.
UNIT –V INFORMATION THEORY EC6402 : Communication TheoryIV Semester - ECE Prepared by: S.P.SIVAGNANA SUBRAMANIAN, Assistant Professor, Dept. of ECE, Sri.
(C) 2000, The University of Michigan 1 Language and Information Handout #2 September 21, 2000.
1 CSCD 433 Network Programming Fall 2016 Lecture 4 Digital Line Coding and other...
Basic Concepts of Information Theory Entropy for Two-dimensional Discrete Finite Probability Schemes. Conditional Entropy. Communication Network. Noise.
8 Coding Theory Discrete Mathematics: A Concept-based Approach.
Introduction to Information theory
Hiroki Sayama NECSI Summer School 2008 Week 3: Methods for the Study of Complex Systems Information Theory p I(p)
Context-based Data Compression
COT 5611 Operating Systems Design Principles Spring 2012
COT 5611 Operating Systems Design Principles Spring 2014
Distributed Compression For Binary Symetric Channels
Data Compression Section 4.8 of [KT].
Information Theoretical Analysis of Digital Watermarking
Presentation transcript:

Introduction to Information theory A.J. Han Vinck University of Duisburg-Essen April 2012

2 content Introduction Entropy and some related properties Source coding Channel coding

3 First lecture What is information theory about Entropy or shortest average presentation length Some properties of entropy Mutual information Data processing theorem Fano inequality

4 Field of Interest It specifically encompasses theoretical and applied aspects of - coding, communications and communications networks - complexity and cryptography - detection and estimation - learning, Shannon theory, and stochastic processes Information theory deals with the problem of efficient and reliable transmission of information

5 Satellite communications: Reed Solomon Codes (also CD-Player) Viterbi Algorithm Public Key Cryptosystems (Diffie-Hellman) Compression Algorithms Huffman, Lempel-Ziv, MP3, JPEG,MPEG Modem Design with Coded Modulation ( Ungerböck ) Codes for Recording ( CD, DVD ) Some of the successes of IT

6 OUR Definition of Information Information is knowledge that can be used i.e. data is not necessarily information we: 1) specify a set of messages of interest to a receiver 2) and select a message to be transmitted 3) sender and receiver build a pair

7 Communication model source Analogue to digital conversion compression /reduction security error protection from bit to signal digital

8 A generator of messages: the discrete source source X Output x  { finite set of messages} Example: binary source: x  { 0, 1 } with P( x = 0 ) = p; P( x = 1 ) = 1 - p M-ary source: x  {1,2, , M} with  P i =1.

9 Express everything in bits: 0 and 1 Discrete finite ensemble: a,b,c,d  00, 01, 10, 11 in general: k binary digits specify 2 k messages M messages need  log 2 M  bits Analogue signal: (problem is sampling speed) 1) sample and 2) represent sample value binary t v Output 00, 10, 01, 01, 11

10 The entropy of a source a fundamental quantity in Information theory entropy The minimum average number of binary digits needed to specify a source output (message) uniquely is called “SOURCE ENTROPY”

11 SHANNON (1948): 1) Source entropy:= = L 2) minimum can be obtained ! QUESTION: how to represent a source output in digital form? QUESTION: what is the source entropy of text, music, pictures? QUESTION: are there algorithms that achieve this entropy?

12 Properties of entropy A: For a source X with M different outputs: log 2 M  H(X)  0 the „worst“ we can do is just assign log 2 M bits to each source output B: For a source X „related“ to a source Y: H(X)  H(X|Y) Y gives additional info about X when X and Y are independent, H(X) = H(X|Y)

13 Joint Entropy: H(X,Y) = H(X) + H(Y|X) alsoH(X,Y) = H(Y) + H(X|Y) intuition: first describe Y and then X given Y from this: H(X) – H(X|Y) = H(Y) – H(Y|X) Homework: check the formel

14 Cont. As a formel:

15 log 2 M = lnM log 2 e ln x = y => x = e y log 2 x = y log 2 e = ln x log 2 e Entropy: Proof of A We use the following important inequalities Homework: draw the inequality M-1 lnM 1-1/M M

16 Entropy: Proof of A

17 Entropy: Proof of B

18 The connection between X and Y X Y P(X=0)Y = 0 P(X=1)Y = 1 P(X=M-1)Y = N-1 P(Y=0|X=0) P(Y=1|X=M-1) P(Y=0|X=M-1) P(Y= N-1|X=1) P(Y= N-1|X=M-1) P(Y= N-1|X=0)

19 Entropy: corrolary H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y) H(X,Y,Z) = H(X) + H(Y|X) + H(Z|XY)  H(X) + H(Y) + H(Z)

20 Binary entropy interpretation: let a binary sequence contain pn ones, then we can specify each sequence with log 2 2 nh(p) = n h(p) bits Homework: Prove the approximation using ln N! ~ N lnN for N large. Use also log a x = y  log b x = y log b a The Stirling approximation 

21 The Binary Entropy: h(p) = -plog 2 p – (1-p) log 2 (1-p) Note: h(p) = h(1-p)

22 homework Consider the following figure Y X All points are equally likely. Calculate H(X), H(X|Y) and H(X,Y)

23 Source coding Two principles: data reduction: remove irrelevant data (lossy, gives errors) data compression: present data in compact (short) way (lossless) remove irrelevance original data compact description Relevant data „unpack“ „original data“ Transmitter side receiver side

24 Shannons (1948) definition of transmission of information: Reproducing at one point (in time or space) either exactly or approximatelya message selected at another point Shannon uses:Binary Information digiTS (BITS) 0 or 1 n bits specify M = 2 n different messages OR M messages specified by n =  log 2 M  bits

25 Example: fixed length representation  a     y  b11010  z - the alphabet: 26 letters,   log 2 26  = 5 bits - ASCII: 7 bits represents 128 characters

26 ASCII Table to transform our letters and signs into binary ( 7 bits = 128 messages) ASCII stands for American Standard Code for Information Interchange

27 Example: suppose we have a dictionary with words these can be numbered (encoded) with 15 bits if the average word length is 5, we need „on the average“ 3 bits per letter   

28 another example Source output a,b, or c translate output binary a00 b01 c10 In out improve efficiency In out aaa aab aba  ccc Efficiency = 2 bits/output symbol improve efficiency ? Efficiency = 5/3 bits/output symbol Homework: calculate optimum efficiency

29 Source coding (Morse idea) Example: A system generates the symbols X, Y, Z, T with probability P(X) = ½; P(Y) = ¼; P(Z) = P(T) = 1/8 Source encoder: X  0; Y  10; Z  110; T = 111 Average transm. length = ½ x 1 + ¼ x 2 +2 x 1/8 x 3 = 1¾ bit/s. A naive approach gives X  00; Y  10; Z  11; T = 01 With average transm. length 2 bit/s.

30 Example: variable length representation of messages C1 C2 letter frequency of occurence P(*) 001 e a x q …aeeqea… Note: C2 is uniquely decodable! (check!)

31 Efficiency of C 1 and C 2 C 2 is more efficient than C 1 Average number of coding symbols of C 1 Average number of coding symbols of C 2

32 Source coding theorem Shannon shows that source coding algorithms exist that have a Unique average representation length that approaches the entropy of the source We cannot do with less

33 Basic idea cryptography messageoperation cryptogram secret messageoperation cryptogram secret send receive open closed open

34 Source coding in Message encryption (1) Part 1Part 2 Part n (for example every part 56 bits) key n cryptograms, encypher Part 1 decypher Part 2Part n Attacker: n cryptograms to analyze for particular message of n parts key dependancy exists between parts of the message dependancy exists between cryptograms

35 Source coding in Message encryption (2) Part 1Part 2 Part n 1 cryptogram source encode encypher key decypher Source decode Part 1Part 2 Part n Attacker: - 1 cryptogram to analyze for particular message of n parts - assume data compression factor n-to-1 Hence, less material for the same message! (for example every part 56 bits) n-to-1

36 Transmission of information Mutual information definition Capacity Idea of error correction Information processing Fano inequality

37 mutual information I(X;Y):= I(X;Y) := H(X) – H(X|Y) = H(Y) – H(Y|X) ( homework: show this! ) i.e. the reduction in the description length of X given Y note that I(X;Y)  0 or: the amount of information that Y gives about X equivalently: I(X;Y|Z) = H(X|Z) – H(X|YZ) the amount of information that Y gives about X given Z

38 3 classical channels Binary symmetric erasureZ-channel (satellite)(network)(optical) 0X10X1 0X10X1 0X10X1 0E10E1 0Y10Y1 0Y10Y1 Homework: find maximum H(X)-H(X|Y) and the corresponding input distribution

39 Example 1 Suppose that X Є { 000, 001, , 111 } with H(X) = 3 bits Channel: X Y = parity of X channel H(X|Y) = 2 bits: we transmitted H(X) – H(X|Y) = 1 bit of information! We know that X|Y Є { 000, 011, 101, 110 } or X|Y Є { 001, 010, 001, 111 } Homework: suppose the channel output gives the number of ones in X. What is then H(X) – H(X|Y)?

40 Transmission efficiency Example: Erasure channel E10E1 e e 1-e ½ ½ e (1-e)/2 H(X) = 1 H(X|Y) = e H(X)-H(X|Y) = 1-e = maximum!

41 Example 2 Suppose we have 2 n messages specified by n bits 1-e Transmitted : 0 0 e E 11 1-e After n transmissions we are left with ne erasures Thus: number of messages we cannot specify = 2 ne We transmitted n(1-e) bits of information over the channel!

42 Transmission efficiency Easy obtainable when feedback! 0,1 0,1,E 0 or 1 received correctly If Erasure, repeat until correct R = 1/ T =1/ Average time to transmit 1 correct bit = 1/ {(1-e) + 2e(1-e) + 3e 2 (1-e) +  }= 1- e erasure

43 Transmission efficiency I need on the average H(X) bits/source output to describe the source symbols X After observing Y, I need H(X|Y) bits/source output H(X) H(X|Y) Reduction in description length is called the transmitted information Transmitted R = H(X) - H(X|Y) = H(Y) – H(Y|X) from earlier calculations We can maximize R by changing the input probabilities. CAPACITY The maximum is called CAPACITY (Shannon 1948) channel XY

44 Transmission efficiency Shannon shows that error correcting codes exist that have An efficieny k/n  Capacity n channel uses for k information symbols Decoding error probability  0 when n very large Problem: how to find these codes

45 In practice: Transmit 0 or 1 Receive 0 or correct 01 in - correct 11 correct 1 0 in - correct What can we do about it ?

46 Reliable: 2 examples Transmit A: = 0 0 B: = 1 1 Receive 0 0 or 1 1 OK 0 1 or 1 0 NOK 1 error detected! A: = B: =  000, 001, 010, 100  A  111, 110, 101, 011  B 1 error corrected!

47 Data processing (1) Let X, Y and Z form a Markov chain: X  Y  Z and Z is independent from X given Y i.e. P(x,y,z) = P(x) P(y|x) P(z|y) X P(y|x) YP(z|y) Z I(X;Y)  I(X; Z) Conclusion: processing destroys information

48 Data processing (2) To show that: I(X;Y)  I(X; Z) Proof: I(X; (Y,Z) ) =H(Y,Z) - H(Y,Z|X) =H(Y) + H(Z|Y) - H(Y|X) - H(Z|YX) = I(X; Y) + I(X; Z|Y) I(X; (Y,Z) ) = H(X) - H(X|YZ) = H(X) - H(X|Z) + H(X|Z) - H(X|YZ) = I(X; Z) + I(X;Y|Z) now I(X;Z|Y) = 0 (independency) Thus: I(X; Y)  I(X; Z)

49 I(X;Y)  I(X; Z) ? The question is: H(X) – H(X|Y)  H(X) – H(X|Z) or H(X|Z)  H(X|Y) ? Proof: 1)H(X|Z) - H(X|Y)  H(X|ZY) - H(X|Y) (conditioning make H larger) 2) From: P(x,y,z) = P(x)P(y|x)P(z|xy) = P(x)P(y|x)P(z|y) H(X|ZY) = H(X|Y) 3) Thus H(X|Z) - H(X|Y)  H(X|ZY) = H(X|Y) = 0

50 Fano inequality (1) Suppose we have the following situation: Y is the observation of X X p(y|x) Y decoder X‘ Y determines a unique estimate X‘: correct with probability 1-P; incorrect with probability P

51 Fano inequality (2) Since Y uniquely determines X‘, we have H(X|Y) = H(X|(Y,X‘))  H(X|X‘) X‘ differs from X with probability P Thusfor L experiments, we can describe X given X‘ by firstly: describe the positions where X‘  X with Lh(P) bits secondly: - the positions where X‘ = X do not need extra bits - for LP positions we need  log 2 (M-1) bits to specify X Hence, normalized by L: H(X|Y)  H(X|X‘)  h(P) + P log 2 (M-1)

52 Fano inequality (3) H(X|Y)  h (P) + P log 2 (M-1) H(X|Y) P log 2 (M-1) log 2 M (M-1)/M 10 Fano relates conditional entropy with the detection error probability Practical importance: For a given channel, with H(X|Y) the detection error probability has a lower bound: it cannot be better than this bound!

53 Fano inequality (3): example X  { 0, 1, 2, 3 }; P ( X = 0, 1, 2, 3 ) = (¼, ¼, ¼, ¼ ) X can be observed as Y Example 1: No observation of X P= ¾; H(X) = 2  h ( ¾ ) + ¾ log 2 3 Example 2:Example 3: 0 0 transition prob. = 1/3 11 H(X|Y) = log P > transition prob. = 1/2 11 H(X|Y) = log P > 0.23 x x y y

54 List decoding Suppose that the decoder forms a list of size L. P L is the probability of being in the list Then H(X|Y)  h(P L ) + P L log 2 L + (1-P L ) log 2 (M-L) The bound is not very tight, because of log 2 L. Can you see why?

55 Fano ( ) Shannon showed that it is possible to compress information. He produced examples of such codes which are now known as Shannon-Fano codes. Robert Fano was an electrical engineer at MIT (the son of G. Fano, the Italian mathematician who pioneered the development of finite geometries and for whom the Fano Plane is named). Robert Fano

56 Application source coding: example MP3 1:4 by Layer 1 (corresponds to 384 kbps for a stereo signal), 1:6...1:8by Layer 2 (corresponds to kbps for a stereo signal), 1:10...1:12by Layer 3 (corresponds to kbps for a stereo signal), Digital audio signals: Without data reduction, 16 bit samples at a sampling rate 44.1 kHz for Compact Discs Mbit represent just one second of stereo music in CD quality. With data reduction: MPEG audio coding, is realized by perceptual coding techniques addressing the perception of sound waves by the human ear. It maintains a sound quality that is significantly better than what you get by just reducing the sampling rate and the resolution of your samples. Using MPEG audio, one may achieve a typical data reduction of