Lecture 3. Relation with Information Theory and Symmetry of Information Shannon entropy of random variable X over sample space S: H(X) = ∑ P(X=x) log 1/P(X=x)‏,

Slides:



Advertisements
Similar presentations
Completeness and Expressiveness
Advertisements

Chapter 9 Greedy Technique. Constructs a solution to an optimization problem piece by piece through a sequence of choices that are: b feasible - b feasible.
Lecture 4 (week 2) Source Coding and Compression
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Recursive Definitions and Structural Induction
An introduction to Data Compression
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
Chain Rules for Entropy
Data Compression.
Entropy Rates of a Stochastic Process
Chapter 6 Information Theory
Introduction to Computability Theory
. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
ACT1 Slides by Vera Asodi & Tomer Naveh. Updated by : Avi Ben-Aroya & Alon Brook Adapted from Oded Goldreich’s course lecture notes by Sergey Benditkis,
DL Compression – Beeri/Feitelson1 Compression דחיסה Introduction Information theory Text compression IL compression.
Lossless data compression Lecture 1. Data Compression Lossless data compression: Store/Transmit big files using few bytes so that the original files.
Data Structures – LECTURE 10 Huffman coding
4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd.
Variable-Length Codes: Huffman Codes
DAST 2005 Week 4 – Some Helpful Material Randomized Quick Sort & Lower bound & General remarks…
Information Theory and Security
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 22, 2005
Induction and recursion
If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all.
Some basic concepts of Information Theory and Entropy
Introduction to AEP In information theory, the asymptotic equipartition property (AEP) is the analog of the law of large numbers. This law states that.
Ch. 8 & 9 – Linear Sorting and Order Statistics What do you trade for speed?
Information and Coding Theory
STATISTIC & INFORMATION THEORY (CSNB134)
2. Mathematical Foundations
1. Entropy as an Information Measure - Discrete variable definition Relationship to Code Length - Continuous Variable Differential Entropy 2. Maximum Entropy.
Induction and recursion
Huffman Codes. Encoding messages  Encode a message composed of a string of characters  Codes used by computer systems  ASCII uses 8 bits per character.
Section 5.3. Section Summary Recursively Defined Functions Recursively Defined Sets and Structures Structural Induction.
4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd.
Quantum Computing MAS 725 Hartmut Klauck NTU TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A.
Information and Coding Theory Transmission over noisy channels. Channel capacity, Shannon’s theorem. Juris Viksna, 2015.
Channel Capacity.
Discrete Structures Lecture 12: Trees Ji Yanyan United International College Thanks to Professor Michael Hvidsten.
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.
 Rooted tree and binary tree  Theorem 5.19: A full binary tree with t leaves contains i=t-1 internal vertices.
1 Information Theory Nathanael Paul Oct. 09, 2002.
. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.
CS 103 Discrete Structures Lecture 13 Induction and Recursion (1)
Expected values of discrete Random Variables. The function that maps S into S X in R and which is denoted by X(.) is called a random variable. The name.
Discrete Random Variables. Introduction In previous lectures we established a foundation of the probability theory; we applied the probability theory.
Mathematical Induction Section 5.1. Climbing an Infinite Ladder Suppose we have an infinite ladder: 1.We can reach the first rung of the ladder. 2.If.
Classifications LanguageGrammarAutomaton Regular, right- linear Right-linear, left-linear DFA, NFA Context-free PDA Context- sensitive LBA Recursively.
Chapter 8 Properties of Context-free Languages These class notes are based on material from our textbook, An Introduction to Formal Languages and Automata,
Bringing Together Paradox, Counting, and Computation To Make Randomness! CS Lecture 21 
Basic Concepts of Information Theory A measure of uncertainty. Entropy. 1.
CompSci 102 Discrete Math for Computer Science March 13, 2012 Prof. Rodger Slides modified from Rosen.
Channel Coding Theorem (The most famous in IT) Channel Capacity; Problem: finding the maximum number of distinguishable signals for n uses of a communication.
JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Essential Information Theory II AI-lab
Lecture 3. Symmetry of Information In Shannon information theory, the symmetry of information is well known. The same phenomenon is also in Kolmogorov.
Lecture 3. Symmetry of Information In Shannon information theory, the symmetry of information is well known. The same phenomenon is also in Kolmogorov.
5.6 Prefix codes and optimal tree Definition 31: Codes with this property which the bit string for a letter never occurs as the first part of the bit string.
Chapter 5 With Question/Answer Animations 1. Chapter Summary Mathematical Induction - Sec 5.1 Strong Induction and Well-Ordering - Sec 5.2 Lecture 18.
Theory of Computational Complexity Probability and Computing Ryosuke Sasanuma Iwama and Ito lab M1.
Sorting by placement and Shift Sergi Elizalde Peter Winkler By 資工四 B 周于荃.
ENTROPY Entropy measures the uncertainty in a random experiment. Let X be a discrete random variable with range S X = { 1,2,3,... k} and pmf p k = P X.
Primbs, MS&E345 1 Measure Theory in a Lecture. Primbs, MS&E345 2 Perspective  -Algebras Measurable Functions Measure and Integration Radon-Nikodym Theorem.
Theory of Computational Complexity Probability and Computing Chapter Hikaru Inada Iwama and Ito lab M1.
4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd.
Learning Tree Structures
Lebesgue measure: Lebesgue measure m0 is a measure on i.e., 1. 2.
Lecture 6. Prefix Complexity K
Huffman Coding Greedy Algorithm
CSE 589 Applied Algorithms Spring 1999
Presentation transcript:

Lecture 3. Relation with Information Theory and Symmetry of Information Shannon entropy of random variable X over sample space S: H(X) = ∑ P(X=x) log 1/P(X=x)‏, the sum taken over x in S. Interpretation: H(X) bits are necessary on P- average to describe the outcome x in a prefix-free code. Example. For P is uniform over finite S, we have H(X)=∑ (1/|S|)log |S| = log |S|. C(x), the Kolmogorov complexity, is the minimum description (smallest program) for one fixed x. It is a beautiful fact that H(X) and the P-expectation of C(x) converge to the same thing. The expected complexity and symmetry of information are examples that the two concepts approximately coincide.

Prefix (free) codes and the ubiquitous Kraft Inequality Prefix (free) code is a code such that no code word is a proper prefix of another one. Example: 1, 01, 000, 001 with length set l_1=1, l_2=2, l_3=3, l_4=3. Kraft Inequality: (*). (a) (*) holds for {l_x} is the length set of a prefix code. (b) If {l_x} satisfies (*) then  prefix code with that length set. Proof. Consider binary rooted tree with 0 labeling left edge and 1 labeling right edge. The prefix code words are leafs labeled with the concatenated labels of the edges on path from root to leaf. The code word length is # edges. Put weight 1 on root, weight ½ on each of its sons, ¼ on each of their sons,... So the prefix code leaves x have weight 2^{-l_x}, and the sum of the weights ≤ 1. QED

Kullback-Leibler Divergence The KL-divergence between two probability mass functions P and Q is D(P║Q) = ∑ P(x) log (P(x)/Q(x)). Asymmetric Always ≥ 0 and =0 only if P = Q

Noiseless Coding The expected code word length with code c for random variable X is l(X,c)=∑ P(X=x) |c(x)|. Noiseless Coding Theorem (Shannon):  H(X) ≤min {l(X,c): c is prefix code} ≤ H(X)+1. Proof: (left ≤) Let P be as above, and let {l_x:x in S} be a length set of a prefix code. Let Q be a probability mass function defined by Q(x)=2^{-l_x} (Q is probability by Kraft inequality). Then, -D(P║Q) = ∑ P(x) log 1/P(x) - ∑ P(x) log 1/Q(x) = H(P)-∑ P(x) log1/Q(x) = H(P)-∑ P(x) l_x)≤ 0 and = 0 only if Q=P. (right ≤) Shannon-Fano code achieves this by coding x as c(x) with length ≤ log 1/P(x)+1. (Code word length = log 1/P(x) rounded upwards). QED

Asymptotic Equivalence Entropy and Expected Complexity String x=y_1... y_m (l(y_i)=n)‏ p_k= d({i: y_i = k}) / m for k = 1,...,2^n=N. Theorem (2.8.1 in book). C(x) ≤ m(H + ε(m)) with H= ∑ p_k log 1/p_k, ε(m) = 2^{n+1} l(m)/m (ε(m)→0 for m→ ∞ and n fixed ). Proof. C(x) ≤ 2l(mp_1) l(mp_N)+l(j) with j ≤ ( m )‏ mp_1... mp_N Since mp_k ≤ m we have C(x)≤2N l(m) + l(j), and writing multinomial as factorials and using Stirling’s approximation, the theorem is proved. ■

Continued For x is an outcome of a sequence of independent trials, a random variable X, the inequality can be replaced by an asymptotic equality w.h.p. Namely, X uniform with 2^n outcomes: H(X)= ∑ P(X=x) log 1/P(X=x) and E=∑P(X=x) C(x) (l(x)=n). There are at least 2^n(1-2^{-c+1}) many x’s with C(x)≥n-c. Hence, n/(n+O(1)) ≤ H(X)/E ≤ n/(1-2^{-c+1})(n-c)). Substitute c=log n to obtain lim H(X)/E = 1 for n → ∞.

Symmetry of Information In Shannon information theory, the symmetry of information is well known: I(X;Y)=I(Y;X) with I(X;Y)=H(Y)-H(Y|X) ( information in random variable X about random variable Y)‏ Here X,Y are random variables, and probability P(X=x) = p_x and the entropy H(X) = ∑ p_x log 1/p_x.  The proof is by simple rewriting.

In Kolmogorov complexity, the symmetry of information is : I(x;y)=I(x;y) with I(x;y)=C(y)-C(y|x) up to an additive log term. The proof is totally different, as well as the meaning, from Shannon’s concept. The term C(y)-C(y|x) is known as “the information x knows about y”. That information is symmetric was first proved by Levin and Kolmogorov (in Zvonkin-Levin, Russ. Math Surv, 1970)‏ Theorem (2.8.2 book). C(x)-C(x|y)=C(y)-C(y|x), up to an additive log-term. Proof. Essentially we will prove: C(x,y)=C(y|x)+C(x) + O(log C(x,y)). (Since C(x,y)=C(x|y)+C(y) + O(log C(x,y)), Theorem follows). (≤). It is trivial that C(x,y)≤C(y|x)+C(x) + O(log C(x,y)) is true. (≥). We now need to prove C(x,y)≥C(y|x)+C(x) + O(log C(x,y)). Algorithmic Symmetry of Information

Proving: C(x,y) ≥ C(y|x)+C(x) + O(log C(x,y)). Assume to the contrary: for each c≥0, there are x and y s.t. C(x,y)<C(y|x)+C(x) - c log C(x,y) (1)‏ Let A={(u,z): C(u,z) ≤ C(x,y)}. Given C(x,y), the set A can be recursively enumerated. Let A x ={z: C(x,z) ≤ C(x,y)}. Given C(x,y) and x, we have a simple algorithm to recursively enumerate A x. One can describe y, given x, using its index in the enumeration of A x, and C(x,y). Hence C(y|x) ≤ log |A x | + 2logC(x,y) + O(1) (2)‏ By (1) and (2), for each c, there are x, y s.t. |A x |>2 e, where e=C(x,y)-C(x)+(c-2) log C(x,y)-O(1). But now, we obtain a too short description for x as follows. Given C(x,y) and e, we can recursively enumerate the strings u which are candidates for x by satisfying condition A u ={z: C(u,z) ≤ C(x,y)}, and 2 e <|A u |. (3)‏ Denote the set of such u by U. Clearly, x ε U. Also {(u,z) : u ε U & z ε A u } ⊆ A (4)‏ The number of elements in A cannot exceed the available number of programs that are short enough to satisfy its definition: |A| ≤ 2 C(x,y)+O(1) (5)‏ Note that {u} x A u is a disjoint subset of A for every different u in U. Using (3), (4), (5), |U| ≤ |A| / min { |{A u | : u in U}| < |A| /2 e ≤ 2 C(x,y)+O(1) / 2 e Hence we can reconstruct x from C(x,y), e, and the index of x in the enumeration of U. Therefore C(x) < 2log C(x,y) + 2log e + C(x,y) – e +O(1)‏ substituting e as given above yields C(x)<C(x), for large c, contradiction. QED

Symmetry of Information is sharp Example. Let n be random: C(n) = |n|+O(1). Let x be random of length n: C(x|n)=n+O(1)‏ and C(x) = n+O(1). Then: C(n)-C(n|x) = |n|+O(1) = log n+O(1)‏ C(x)-C(x|n) = n-n+O(1) = O(1). So I(x:n) = I(n:x)+log n +O(1).

Kamae Theorem For each natural number m, there is a string x such that for all but finitely many strings y, I(y;x)= C(x) – C(x|y) ≥ m That is: there exist finite objects x such that almost all finite objects y contain a large amount of information about them.