Lecture 6. Prefix Complexity K, Randomness, and Induction The plain Kolmogorov complexity C(x) has a lot of “minor” but bothersome problems Not subadditive:

Slides:



Advertisements
Similar presentations
Pretty-Good Tomography Scott Aaronson MIT. Theres a problem… To do tomography on an entangled state of n qubits, we need exp(n) measurements Does this.
Advertisements

Truthful Mechanisms for Combinatorial Auctions with Subadditive Bidders Speaker: Shahar Dobzinski Based on joint works with Noam Nisan & Michael Schapira.
Lecture 6. Prefix Complexity K The plain Kolmogorov complexity C(x) has a lot of “minor” but bothersome problems Not subadditive: C(x,y)≤C(x)+C(y) only.
Lecture 9. Resource bounded KC K-, and C- complexities depend on unlimited computational resources. Kolmogorov himself first observed that we can put resource.
QuickSort Average Case Analysis An Incompressibility Approach Brendan Lucier August 2, 2005.
Analysis of Algorithms
Theory of Computing Lecture 3 MAS 714 Hartmut Klauck.
Ancient Paradoxes With An Impossible Resolution. Great Theoretical Ideas In Computer Science Steven RudichCS Spring 2005 Lecture 29Apr 28, 2005Carnegie.
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
Lecture 12: Lower bounds By "lower bounds" here we mean a lower bound on the complexity of a problem, not an algorithm. Basically we need to prove that.
Chain Rules for Entropy
Data Compression.
Entropy and Shannon’s First Theorem
Chapter 6 Information Theory
Complexity 7-1 Complexity Andrei Bulatov Complexity of Problems.
Lecture 2. Randomness Goal of this lecture: We wish to associate incompressibility with randomness. But we must justify this. We all have our own “standards”
Introduction to Computability Theory
1 Introduction to Computability Theory Lecture11: Variants of Turing Machines Prof. Amos Israeli.
. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.
Complexity 5-1 Complexity Andrei Bulatov Complexity of Problems.
1 Undecidability Andreas Klappenecker [based on slides by Prof. Welch]
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
ACT1 Slides by Vera Asodi & Tomer Naveh. Updated by : Avi Ben-Aroya & Alon Brook Adapted from Oded Goldreich’s course lecture notes by Sergey Benditkis,
Analysis of greedy active learning Sanjoy Dasgupta UC San Diego.
1 Chapter 5 A Measure of Information. 2 Outline 5.1 Axioms for the uncertainty measure 5.2 Two Interpretations of the uncertainty function 5.3 Properties.
Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 22, 2005
Lecture 3. Relation with Information Theory and Symmetry of Information Shannon entropy of random variable X over sample space S: H(X) = ∑ P(X=x) log 1/P(X=x)‏,
If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all.
Introduction to AEP In information theory, the asymptotic equipartition property (AEP) is the analog of the law of large numbers. This law states that.
Information and Coding Theory
§1 Entropy and mutual information
INFORMATION THEORY BYK.SWARAJA ASSOCIATE PROFESSOR MREC.
Lecture 2 We have given O(n 3 ), O(n 2 ), O(nlogn) algorithms for the max sub-range problem. This time, a linear time algorithm! The idea is as follows:
Kolmogorov complexity and its applications Ming Li School of Computer Science University of Waterloo CS 898.
Introduction With such a wide variety of algorithms to choose from, which one is best? Are there any reasons to prefer one algorithm over another? Occam’s.
Lecture 2. Randomness Goal of this lecture: We wish to associate incompressibility with randomness. But we must justify this. We all have our own “standards”
1 Dr. Jerrell T. Stracener, SAE Fellow Leadership in Engineering EMIS 7370/5370 STAT 5340 : PROBABILITY AND STATISTICS FOR SCIENTISTS AND ENGINEERS Systems.
Order Statistics The ith order statistic in a set of n elements is the ith smallest element The minimum is thus the 1st order statistic The maximum is.
Lecture 2. Randomness Goal of this lecture: We wish to associate incompressibility with randomness. But we must justify this. We all have our own “standards”
Channel Capacity.
The Selection Problem. 2 Median and Order Statistics In this section, we will study algorithms for finding the i th smallest element in a set of n elements.
Lecture 5. The Incompressibility Method  A key problem in computer science: analyze the average case performance of a program.  Using the Incompressibility.
Kolmogorov complexity and its applications Paul Vitanyi Computer Science University of Amsterdam Spring, 2009.
Kolmogorov complexity and its applications Ming Li School of Computer Science University of Waterloo CS860,
Quantifying Knowledge Fouad Chedid Department of Computer Science Notre Dame University Lebanon.
Communication System A communication system can be represented as in Figure. A message W, drawn from the index set {1, 2,..., M}, results in the signal.
Kolmogorov Complexity and Universal Distribution Presented by Min Zhou Nov. 18, 2002.
Lecture 10. Paradigm #8: Randomized Algorithms Back to the “majority problem” (finding the majority element in an array A). FIND-MAJORITY(A, n) while (true)
Computer simulation Sep. 9, QUIZ 2 Determine whether the following experiments have discrete or continuous out comes A fair die is tossed and the.
Expected values of discrete Random Variables. The function that maps S into S X in R and which is denoted by X(.) is called a random variable. The name.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Occam’s Razor No Free Lunch Theorem Minimum.
Discrete Random Variables. Introduction In previous lectures we established a foundation of the probability theory; we applied the probability theory.
1 Introduction to Quantum Information Processing CS 467 / CS 667 Phys 467 / Phys 767 C&O 481 / C&O 681 Richard Cleve DC 3524 Course.
Bringing Together Paradox, Counting, and Computation To Make Randomness! CS Lecture 21 
Channel Coding Theorem (The most famous in IT) Channel Capacity; Problem: finding the maximum number of distinguishable signals for n uses of a communication.
Compression for Fixed-Width Memories Ori Rottenstriech, Amit Berman, Yuval Cassuto and Isaac Keslassy Technion, Israel.
JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Essential Information Theory II AI-lab
Lecture 3. Symmetry of Information In Shannon information theory, the symmetry of information is well known. The same phenomenon is also in Kolmogorov.
Lecture 3. Symmetry of Information In Shannon information theory, the symmetry of information is well known. The same phenomenon is also in Kolmogorov.
1 Dr. Jerrell T. Stracener, SAE Fellow Leadership in Engineering EMIS 7370/5370 STAT 5340 : PROBABILITY AND STATISTICS FOR SCIENTISTS AND ENGINEERS Systems.
Theory of Computational Complexity Probability and Computing Ryosuke Sasanuma Iwama and Ito lab M1.
CS-424 Gregory Dudek Lecture 14 Learning –Inductive inference –Probably approximately correct learning.
ENTROPY Entropy measures the uncertainty in a random experiment. Let X be a discrete random variable with range S X = { 1,2,3,... k} and pmf p k = P X.
(C) 2000, The University of Michigan 1 Language and Information Handout #2 September 21, 2000.
Theory of Computational Complexity M1 Takao Inoshita Iwama & Ito Lab Graduate School of Informatics, Kyoto University.
 2004 SDU Uniquely Decodable Code 1.Related Notions 2.Determining UDC 3.Kraft Inequality.
Kolmogorov Complexity
Lecture 6. Prefix Complexity K
Theorem 9.3 A problem Π is in P if and only if there exist a polynomial p( ) and a constant c such that ICp (x|Π) ≤ c holds for all instances x of Π.
Presentation transcript:

Lecture 6. Prefix Complexity K, Randomness, and Induction The plain Kolmogorov complexity C(x) has a lot of “minor” but bothersome problems Not subadditive: C(x,y)≤C(x)+C(y) only modulo a log n term. There exists x,y s.t. C(x,y)>C(x)+C(y)+log n –c. (This is because there are (n+1)2 n pairs of x,y s.t. |x|+|y|=n. Some pair in this set has complexity n+log n.)‏ Nonmonotonicity over prefixes Problems when defining random infinite sequences in connection with Martin-Lof theory where we wish to identify infinite random sequences with those whose finite initial segments are all incompressible, Lecture 2 Problem with Solomonoff’s initial universal distribution P(x) = 2 -C(x)‏ but  P(x)=∞.

In order to fix the problems … Let x=x 0 x 1 … x n, then x =x 0 0x 1 0x 2 0 … x n 1 and x’=|x| x Thus, x’ is a prefix code such that |x’| ≤ |x|+2 log|x| x’ is a self-delimiting version of x. Let reference TM’s have only binary alphabet {0,1}, no blank B. The programs p should form an effective prefix code:  p,p’ [ p is not prefix of p’] Resulting self-delimiting Kolmogorov complexity (Levin, 1974, Chaitin 1975). We use K for prefix Kolmogorov complexity to distinguish from C, the plain Kolmogorov complexity.

Properties By Kraft’s Inequality (proof – look at the binary tree):  x  * 2 -K(x) ≤ 1 Naturally subadditive Not monotonic over prefixes (then we need another version like monotonic Kolmogorov complexity) C(x) ≤ K(x) ≤ C(x)+2 log C(x)‏ K(x) ≤ K(x|n)+K(n)+O(1)‏ K(x|n) ≤ C(x) + O(1)‏ ≤ C(x|n) + K(n)+O(1)‏ ≤ C(x|n)+log*n+log n+loglog n+…+O(1)‏

Alice’s revenge Remember Bob at a cheating casino flipped 100 heads in a row. Now Alice can have a winning strategy. She proposes the following: She pays $1 to Bob for every time she looses on 0-flip, gets $1 for every time she wins on 1-flip. She pays $1 extra at start of the game. She receives K(x) in return, for flip sequence x of length 100. Note that this is a fair proposal as expectancy for 100 flips of fair coin is  |x|= K(x) < $1 But if Bob cheats with 1 100, then Alice gets log100

Chaitin’s mystery number Ω Define Ω = ∑ p halts 2 -|p| (<1 by Kraft’s inequality and there is a nonhalting program p). Now Ω is a nonrational number. Theorem 1. Let X i =1 iff the ith program halts. Then Ω 1:n encodes X 1:2^n. I.e., from Ω 1:n we can compute X 1:2^n Proof. (1) Ω 1:n Ω 1:n. Then if p, |p|≤n, has not halted yet, it will not (since otherwise Ω > Ω’+2 -n > Ω). QED Bennett: Ω 1:10,000 yields all interesting mathematics. Theorem 2. For some c and all n: K(Ω 1:n ) ≥n – c. Remark. Ω is a particular random sequence! Proof. By Theorem 1, given Ω 1:n we can obtain all halting programs of length ≤ n. For any x that is not an output of these programs, we have K(x)>n. Since from Ω 1:n we can obtain such x, it must be the case that K(Ω 1:n ) ≥n – c. QED

Universal distribution A (discrete) semi-measure is a function P that satisfies  x  N P(x)≤1. An enumerable (=lower semicomputable) semi-measure P 0 is universal (maximal) if for every enumerable semi-measure P, there is a constant c p, s.t. for all x  N, c P P 0 (x)≥P(x). We say that P 0 dominates each P. We can set c P = 2^{K(P)}. Next 2 theorems are due to L.A. Levin. Theorem. There is a universal enumerable semi-measure m. We can set m(x)=∑ P(x)/c P the sum taken over all enumerable probability mass functions P (countably many) Coding Theorem. log 1/m(x) = K(x) + O(1)-Proofs omitted. Remark. This universal distribution m is one of the foremost notions in KC theory. As prior probability in a Bayes rule, it maximizes ignorance by assigning maximal probability to all objects (as it dominates other distributions up to a multiplicative constant).

Randomness Test for Finite Strings Lemma. If P is computable, then δ 0 (x) = log m(x)/P(x)‏ is a universal P-test. Note -K(P) ≤ log m(x)/P(x) by dominating property of m. Proof. (i) δ 0 is lower semicomputable. δ 0 (x)‏ (ii) ∑ P(x)2 = ∑ m(x) ≤ 1. x x δ(x)‏ (iii) δ is a test → f(x)= P(x)2 is lower semicomputable & ∑ f(x) ≤ 1. Hence, by universality of m, f(x) = O(m(x)). Therefore, δ(x) ≤ δ 0 (x) +O(1). QED

Individual randomness (finite |x|) Theorem. X is P-random iff log m(x)/P(x) ≤0 (or a small value). Recall: log 1/m(x)=K(x) (ignore O(1) terms). Example. Let P be the uniform distribution. Then, log 1/P(x) =|x| and x is random iff K(x) ≥ |x|. 1. Let x= (|x|=n). Then, K(x) ≤ log n + 2 log log n. So K(x) << |x| and x is not random. 2. Let y = (|y|=n and typical fair coin flips). Then, K(y) ≥ n. So K(y)≥ |y| and y is random.

Occam’ Razor m(x) = 2^{-K(x)} embodies `Occam’s Razor’. Simple objects (with low prefix complexity)‏ have high probability and complex objects (with high prefix complexity) have low Probability. x= (n 0’s) has K(x) ≤ log n + 2 log log n and m(x) ≥ 1/n (log n)^2 y= (length n random string) has K(y) ≥ n and m(y) ≤ 1/2^n

Randomness Test for Infinite Sequences: Schnorr’s Theorem Theorem. An infinite binary sequence ω is (Martin-Lof) random (random with respect to the uniform measure λ) iff there is a constant c such that for all n, K(ω 1:n )≥n-c. Proof omitted---see textbook. (Note, please compare with Lecture 2, C-measure)‏

Complexity oscillations of initial segments of infinite high-complexity sequences

Entropy Theorem. If P is a computable probability mass function with finite entropy H(P), then H(P) ≤ ∑ P(x)K(x) ≤ H(P)+K(P)+O(1). Proof. Lower bound: by Noiseless Coding Theorem since {K(x)} is length set prefix-free code. Upper bound: m(x) ≥ 2^{-K(P)} P(x) for all x. Hence, K(x) = log 1/m(x)+O(1)≤ K(P)+ log 1/P(x)+O(1). QED

Symmetry of Information. Theorem. Let x* denote shortest program for x (1st in standard enumeration). Then, up to an additive constant K(x,y)=K(x)+K(y|x*)=K(y)+K(x|y*)=K(y,x). Proof. Omitted---see textbook. QED Remark 1.Let I(x:y)=K(x)-K(x|y*) (information in x about y). Then: I(x:y)=I(y:x) up to a constant. So we call I(x:y) the algorithmic mutual information which is symmetric up to a constant. Remark 2. K(x|y*)=K(x|y,K(y)).

Complexity of Complexity Theorem. For every n there are strings x of length n such that (up to a constant term): log n – log log n ≤ K(K(x)|x) ≤ log n. Proof. Upper bound is obvious since K(x) ≤ n+2 log n. Hence we have K(K(x)|x) ≤ K(K(x)|n)+O(1) ≤ log n +O(1). Lower bound is complex and omitted, see textbook. QED Corollary.Let length x be n. Then, K(K(x),x) = K(x)+K(K(x)|x,K(x))=K(x), but K(x)+K(K(x)|x) can be K(x)+log n – log log n. Hence the Symmetry of Information is sharp.

Average-case complexity under m Theorem [Li-Vitanyi]. If the input to an algorithm A is distributed according to m, then the average-case time complexity of A is order-of-magnitude of A’s worst-case time complexity. Proof. Let T(n) be the worst-case time complexity. Define P(x) as follows: a n =  |x|=n m(x) If |x|=n, and x is the first s.t. t(x)=T(n), then P(x):=a n else P(x):=0. Thus, P(x) is enumerable, hence c P m(x)≥P(x). Then the average time complexity of A under m(x) is: T(n|m) =  |x|=n m(x)t(x) /  |x|=n m(x)‏ ≥ 1/c P  |x|=n P(x)T(n) /  |x|=n m(x)‏ = 1/c P  |x|=n [P(x)/  |x|=n P(x)] T(n) = 1/c P T(n). QED Intuition: The x with worst time has low KC, hence large m(x)‏ Example: Quicksort. With easy inputs, more likely incur worst case.

General Prediction Hypothesis formation, experiment, outcomes, hypothesis adjustment, prediction, experiment, outcomes,.... Encode this (infinite) sequence as 0’s and 1’s The investigated phenomenon can be viewed as a measure μ over the {0,1}∞ with probability μ(y|x)=μ(xy)/μ(x) of predicting y after having seen x. If we know μ then we can predict as good as is possible.

Solomonoff’s Approach Solomonoff (1960, 1964): given a sequence of observations: S= Question: predict next bit of S. Using Bayesian rule: P(S1|S)=P(S1)P(S|S1) / P(S)‏ =P(S1) / P(S)‏ here P(S1) is the prior probability, and we know P(S|S1)=1. Choose universal prior probability: P(S) = M(S) = ∑ 2^-l(p) summed over all p which are shortest programs for which U(p…) = S.... M is the continuous version of m (for infinite sequences in {0,1}^∞.

Prediction a la Solomonoff Every predictive task is essentially extrapolation of a binary sequence: ░ 0 or 1 ? Universal semimeasure M(x)= M{x....: x ε {0,1}*} constant- multiplicatively dominates all (semi)computable semimeasures μ.

General Task Task of AI and prediction science: Determine for a phenomenon expresed by measure μ μ(y|x) = μ(xy)/μ(x)‏ The probability that after having observed data x the next observations show data y.

Solomonoff: M(x) is good predictor Expected error squared in the nth prediction: S = ∑ μ(x) [ μ(0|x) – M(0|x) ] ² n |x|=n-1 Theorem. ∑ S ≤ constant ( ½K(μ) ln 2)‏ n n Hence: Prediction error S in n-th prediction: n 1/n S n n

Predictor in ratio Theorem. For fixed length y and computable μ: M(y|x)/μ(y|x) → 1 for x →∞ with μ-measure 1. Hence we can estimate conditional μ- probability by M with almost no error. Question: Does this imply Occam’s razor: ``shortest program predicts best’’?

M is universal predictor for all computable μ in expectation But M is a continuous measure over {0,1}∞ and weighs all programs for x, including shortest one: -|p| M(x) = ∑ 2 (p minimal)‏ U(p…)=x.... Lemma (P. Gacs) For some x, log 1/ M(x) << shortest program for x. This is different from the Coding Theorem in the discrete case where always log 1/m(x) =K(x)+O(1). Corollary: Using shortest program for data is not always best predictor!

Theorem (Vitanyi-Li)‏ For almost all x (i.e. with μ-measure 1): log 1/M(y|x) = Km(xy)-Km(x) +O(1) with Km the complexity (shortest program length |p|) with respect to U(p...)= x.... Hence, it is a good heuristic to choose an extrapolation y that minimizes the length difference between the shortest program producing xy... and the one that produces x... I.e.; Occam’s razor!