Quantifying Knowledge Fouad Chedid Department of Computer Science Notre Dame University Lebanon.

Slides:



Advertisements
Similar presentations
The Equivalence of Sampling and Searching Scott Aaronson MIT.
Advertisements

Completeness and Expressiveness
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Lecture 6. Prefix Complexity K The plain Kolmogorov complexity C(x) has a lot of “minor” but bothersome problems Not subadditive: C(x,y)≤C(x)+C(y) only.
Lecture 9. Resource bounded KC K-, and C- complexities depend on unlimited computational resources. Kolmogorov himself first observed that we can put resource.
Generating Random Numbers
Approximation Algorithms Chapter 14: Rounding Applied to Set Cover.
CHAPTER 8 More About Estimation. 8.1 Bayesian Estimation In this chapter we introduce the concepts related to estimation and begin this by considering.
October 1999 Statistical Methods for Computer Science Marie desJardins CMSC 601 April 9, 2012 Material adapted.
Chapter 6 The Structural Risk Minimization Principle Junping Zhang Intelligent Information Processing Laboratory, Fudan University.
Optimization of thermal processes2007/2008 Optimization of thermal processes Maciej Marek Czestochowa University of Technology Institute of Thermal Machinery.
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
Copyright © Cengage Learning. All rights reserved. CHAPTER 5 SEQUENCES, MATHEMATICAL INDUCTION, AND RECURSION SEQUENCES, MATHEMATICAL INDUCTION, AND RECURSION.
1 CLUSTERING  Basic Concepts In clustering or unsupervised learning no training data, with class labeling, are available. The goal becomes: Group the.
Visual Recognition Tutorial
The Simple Linear Regression Model: Specification and Estimation
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Approximation Algorithms
Two Broad Classes of Functions for Which a No Free Lunch Result Does Not Hold Matthew J. Streeter Genetic Programming, Inc. Mountain View, California
July 3, Department of Computer and Information Science (IDA) Linköpings universitet, Sweden Minimal sufficient statistic.
Maximum likelihood (ML)
Lecture 3. Relation with Information Theory and Symmetry of Information Shannon entropy of random variable X over sample space S: H(X) = ∑ P(X=x) log 1/P(X=x)‏,
Introduction to AEP In information theory, the asymptotic equipartition property (AEP) is the analog of the law of large numbers. This law states that.
SVM by Sequential Minimal Optimization (SMO)
Positive and Negative Randomness Paul Vitanyi CWI, University of Amsterdam Joint work with Kolya Vereshchagin.
MATH 224 – Discrete Mathematics
1 Problem Solving using computers Data.. Representation & storage Representation of Numeric data The Binary System.
New Bulgarian University 9th International Summer School in Cognitive Science Simplicity as a Fundamental Cognitive Principle Nick Chater Institute for.
An Empirical Likelihood Ratio Based Goodness-of-Fit Test for Two-parameter Weibull Distributions Presented by: Ms. Ratchadaporn Meksena Student ID:
Data Structures and Algorithms Discrete Math Review.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
Markov Decision Processes1 Definitions; Stationary policies; Value improvement algorithm, Policy improvement algorithm, and linear programming for discounted.
Edge-disjoint induced subgraphs with given minimum degree Raphael Yuster 2012.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Channel Capacity.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Number Theory Project The Interpretation of the definition Andre (JianYou) Wang Joint with JingYi Xue.
Kolmogorov Complexity and Universal Distribution Presented by Min Zhou Nov. 18, 2002.
EASTERN MEDITERRANEAN UNIVERSITY Department of Industrial Engineering Non linear Optimization Spring Instructor: Prof.Dr.Sahand Daneshvar Submited.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
A Logic of Partially Satisfied Constraints Nic Wilson Cork Constraint Computation Centre Computer Science, UCC.
INTRODUCTION TO Machine Learning 3rd Edition
ECE-7000: Nonlinear Dynamical Systems Overfitting and model costs Overfitting  The more free parameters a model has, the better it can be adapted.
Data Modeling Patrice Koehl Department of Biological Sciences National University of Singapore
Sampling and estimation Petter Mostad
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Occam’s Razor No Free Lunch Theorem Minimum.
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
Stochastic Optimization
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Inequalities for Stochastic Linear Programming Problems By Albert Madansky Presented by Kevin Byrnes.
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
1Causal Performance Models Causal Models for Performance Analysis of Computer Systems Jan Lemeire TELE lab May 24 th 2006.
CS623: Introduction to Computing with Neural Nets (lecture-12) Pushpak Bhattacharyya Computer Science and Engineering Department IIT Bombay.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Information complexity - Presented to HCI group. School of Computer Science. University of Oklahoma.
2/5/98UCLA Data Mining Short Course1 Pattern Evaluation and Process Control Wei-Min Shen Information Sciences Institute University of Southern California.
2.5 The Fundamental Theorem of Game Theory For any 2-person zero-sum game there exists a pair (x*,y*) in S  T such that min {x*V. j : j=1,...,n} =
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Data Modeling Patrice Koehl Department of Biological Sciences
Kolmogorov Complexity
12. Principles of Parameter Estimation
Data Mining Lecture 11.
Lecture 6. Prefix Complexity K
Polyhedron Here, we derive a representation of polyhedron and see the properties of the generators. We also see how to identify the generators. The results.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Generally Discriminant Analysis
Parametric Methods Berlin Chen, 2005 References:
12. Principles of Parameter Estimation
Presentation transcript:

Quantifying Knowledge Fouad Chedid Department of Computer Science Notre Dame University Lebanon

Information Contents Versus Knowledge While the information contents of a string x can be measured by its Kolmogorov complexity K(x), it is not clear how to measure the knowledge stored in x. We argue that the knowledge contained by a string x is relative to the hypothesis assumed to compute x.

If H is the hypothesis used to explain x, then we suggest to measure the knowledge in x by K(H). The absolute knowledge in x is K(H 0 ), where H 0 is a simplest hypothesis for x.

Using Bayes’ rule and Solomonoff’s universal distribution, we obtain K(x) = K(H) + K(x|H) – K(H|x). Here one would expect H to be consistent with x and so K(H|x) to be minimal. Discarding K(H|x) gives K(x) = K(H) + K(x|H).

We interpret K(H) as a measure of the knowledge part in x relative to H. K(x|H) is a measure of the accidental information (noise) in x relative to H.

A Simple Example Suppose we record our observations of an ongoing phenomenon and we stop gathering data at time t 1 after having obtained the segment X = The information in x (=K(x)) is about the number of bits in a shortest program for x. Something like For I = 1 to 7 print “10”

This program assumes the hypothesis H = “x contains the repeating element 10”. It is this H that we call knowledge in x. The amount K(x|H) which is about log 7 measures the amount of noise in x under H. Other hypotheses exist that trade off the amount of knowledge for the level of noise that can be tolerated. This becomes application-dependent.

This work is similar to Kolmogorov’s 1974 result in which he proposed to found statistical theory on finite combinatorial and computational principles independent of probabilistic assumptions, as the relation between the individual data and its explanation (model or hypothesis), expressed by Kolmogorov’s structure function.

Kolmogorov’s Approach to Non- probabilistic Statistics As the relation between the individual data sample and a specific constrained data model, expressed by Kolmogorov’s structure function Φ(.). Let data be finite binary strings and models be finite sets of binary strings. Consider model classes consisting of models of given maximal Kolmogorov complexity.

The Kolmogorov's structure function Φ_x(k) of the given data x expresses the relation between the complexity level constraint on a model class S and the least log-cardinality of a model in the class containing the data. Φ_x(k) = min._S {log |S|: S  x, K(S)  k}.

Kolmogorov explains … To each constructive object x corresponds a function Φ x (k) of a natural number k – the log of minimal cardinality of x-containing sets that allow definitions of complexity at most k. If the element x itself allows a simple definition, then the function Φ drops to 1 even for small k. Lacking such definition, the element is random in the negative sense. But it is positively probabilistically random only when function Φ having taken the value Φ 0 at a relatively small k=k 0, then change approximately as Φx(k)= Φ 0 – (k-k 0 )

This function Φx(k), its variants, and its relation to model selection have been the subjects of numerous publications, but in my opinion it has not before been well understood. We view Kolmogorov’s structure function as a rewrite of Bayes’ rule using Solomonoff’s universal distribution as explained earlier.

Understanding Kolmogorov’s Structure Function Φx(k) as the log of minimal cardinality of x-containing sets that allow definitions of complexity at most k is a particular case of K(x|H), when H is a finite set containing x and K(H)  k. Thus we interpret Φx(k) as a measure of the amount of accidental information (noise) in x when bound to a model of maximum Kolmogorov complexity k. If x is typical of a finite set S, then we expect K(x|S) to be about log S.

The terms  0 and k 0 in Kolmogorov’s structure function corresponds to a hypothesis H 0 of small Kolmogorov complexity k 0, which explains nothing about x. In this case I(H 0 |x)=0, which leads to K(x|H 0 ) = K(x) and K(H 0 |x)=K(H 0 )=k 0. So, the approximation K(x) = K(H) + K(x|H) – K(H|x) or equivalently K(x|H) = K(x) – (K(H) – K(H|x)) Would degenerate to Kolmogorov’s structure function Φ x (k)= Φ 0 – (k-k 0 )

In general, for all hypotheses H for x of maximum Kolmogorov complexity k, we have K(x) = K(x|H)+K(H)-K(H|x)  K(x|H)+K(H)  Φx(k) + k. Thus Φx(k)  K(x) – k. This explains the act of Kolmogorov when he drew a picture of Φx(k) as a function of k monotonically approaching the diagonal (sufficiency line = L(k) = Φx(k) +k). This diagonal line corresponds to a minimum value of Φx(k) attained when there exists some H of max Kolmogorov complexity k such that K(x) = k + Φx(k).

Such k = K(H) is called a sufficient statistic for x and the expression k + Φx(k) is treated as a two-part code separating the meaningful information in x represented by k from the meaningless accidental information (noise) in x following the hypothesis H.

A Simple Derivation of a Fundamental Result Vitányi’s Best Fit Function: The randomness deficiency (x|S) of a string x in the set S is defined by (x|S) = log |S| - K(x|S) for x  S and  otherwise. The minimal randomness deficiency function is  x (k) = min. S {(x|S) : S  x, K(S)  k}. A model S for which x incurs  x (k) deficiency is a best-fit model. We say S is optimal for x and K(S|x)  0.

Rissanen’s Minimum Description Length Function: Consider the two-part code for x consisting of the constrained model cost K(S) and the length of the index of x in S. The MDL function is x (k) = min. S {K(S) + log |S| : S  x, K(S)  k}.

The results in [Vereshchagin and Vitányi 2004] are obtained by analysis of the relation between the three structure functions:  x (k),  x (k), and x (k). The most fundamental result there is the equality  x (k) =  x (k) + k – K(x) = x (k) – K(x), which holds within additive terms, that are logarithmic in |x|. This above result is an improvement of a previous result by Gács, Tromp, and Vitányi (2001) in which it was proven that  x (k)   x (k) + k – K(x) + O(1) where the authors mentioned that it would be nice to have an inequality also in the other direction.

We understand the structure functions  x (k) and  x (k) as being equivalent to K(x|S) and K(S|x), respectively, where K(S)  k. Using the approximation K(x) = K(S) + K(x|S) – K(S|x) or equivalently K(x|S) + K(S) = K(x) + K(S|x) gives the equality  x (k) + k = K(x) +  x (k) = x (k).

We mention that the approach used in the previous two references uses a much more complicated argument where a shortest program for a string x is assumed to be divisible into two parts, the model part (K(S)) and the data-to- model part (K(x|S)), which is a very difficult task to do. Gács and Vitányi credited the Invariance Theorem for such a deep and useful fact. This view lead to the equation K(x) = min T {K(T) + K(x|T): T  {T 0, T 1, …}} Which holds up to additive constants and where T 0, T 1, … is the standard enumeration of Turing machines.

The whole theory of algorithmic statistics is based on this interpretation of K(x) as the shortest length of a two-part code for x. We argue that the use of the Invariance Theorem to suggest a two-part code of an object is too artificial. We prefer to use the three-part code suggested by Bayes’ rule and think of the two-part code as an approximation of the three-part code in which the model is considered to be optimal.

Muchas Gracias