A Syntactic Justification of Occam’s Razor 1 John Woodward, Jerry Swan Foundations of Reasoning Group University of Nottingham Ningbo, China 宁波诺丁汉大学 Email:

Slides:



Advertisements
Similar presentations
A Syntactic Justification of Occam’s Razor 1 John Woodward, Andy Evans, Paul Dempster Foundations of Reasoning Group University of Nottingham Ningbo, China.
Advertisements

1 Machine Learning: Lecture 3 Decision Tree Learning (Based on Chapter 3 of Mitchell T.., Machine Learning, 1997)
Intelligent Systems On the efficacy of Occam's Razor as a model selection criterion for classification learning Geoff.
The General Linear Model. The Simple Linear Model Linear Regression.
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Spring 2004.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CII504 Intelligent Engine © 2005 Irfan Subakti Department of Informatics Institute Technology of Sepuluh Nopember Surabaya - Indonesia.
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Fall 2005.
Computational Learning Theory PAC IID VC Dimension SVM Kunstmatige Intelligentie / RuG KI2 - 5 Marius Bulacu & prof. dr. Lambert Schomaker.
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Fall 2004.
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18.
Random Number Generation
Two Broad Classes of Functions for Which a No Free Lunch Result Does Not Hold Matthew J. Streeter Genetic Programming, Inc. Mountain View, California
Fall 2006 – Fundamentals of Business Statistics 1 Chapter 8 Introduction to Hypothesis Testing.
Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.
Learning: Introduction and Overview
Algorithm Design and Analysis Liao Minghong School of Computer Science and Technology of HIT July, 2003.
CHP400: Community Health Program - lI Research Methodology. Data analysis Hypothesis testing Statistical Inference test t-test and 22 Test of Significance.
Machine Learning Chapter 3. Decision Tree Learning
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 26 of 41 Friday, 22 October.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Wednesday, February 7, 2001.
1 Institute of Engineering Mechanics Leopold-Franzens University Innsbruck, Austria, EU H.J. Pradlwarter and G.I. Schuëller Confidence.
Theory of Probability Statistics for Business and Economics.
CpSc 810: Machine Learning Decision Tree Learning.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 Wednesday, 20 October.
CHAPTER 18 SECTION 1 – 3 Learning from Observations.
 2003, G.Tecuci, Learning Agents Laboratory 1 Learning Agents Laboratory Computer Science Department George Mason University Prof. Gheorghe Tecuci 5.
1 10 Statistical Inference for Two Samples 10-1 Inference on the Difference in Means of Two Normal Distributions, Variances Known Hypothesis tests.
Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionSplitting Function Issues in Decision-Tree LearningIssues in Decision-Tree Learning.
G. Cowan Lectures on Statistical Data Analysis Lecture 1 page 1 Lectures on Statistical Data Analysis London Postgraduate Lectures on Particle Physics;
T. Poggio, R. Rifkin, S. Mukherjee, P. Niyogi: General Conditions for Predictivity in Learning Theory Michael Pfeiffer
Computing & Information Sciences Kansas State University Wednesday, 22 Oct 2008CIS 530 / 730: Artificial Intelligence Lecture 22 of 42 Wednesday, 22 October.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 28 of 41 Friday, 22 October.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 of 41 Monday, 25 October.
1 11 Simple Linear Regression and Correlation 11-1 Empirical Models 11-2 Simple Linear Regression 11-3 Properties of the Least Squares Estimators 11-4.
Quantifying Knowledge Fouad Chedid Department of Computer Science Notre Dame University Lebanon.
Kolmogorov Complexity and Universal Distribution Presented by Min Zhou Nov. 18, 2002.
Complexity of Functions with Cartesian GP and Recursion. John Woodward. School of Computer Science, The University of Birmingham, United Kingdom. 1 OVERVIEW.
Tests of Random Number Generators
Lecture V Probability theory. Lecture questions Classical definition of probability Frequency probability Discrete variable and probability distribution.
Programming Languages and Design Lecture 3 Semantic Specifications of Programming Languages Instructor: Li Ma Department of Computer Science Texas Southern.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.
Evaluating VR Systems. Scenario You determine that while looking around virtual worlds is natural and well supported in VR, moving about them is a difficult.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 24 of 41 Monday, 18 October.
Chapter 18 Section 1 – 3 Learning from Observations.
Learning From Observations Inductive Learning Decision Trees Ensembles.
Tests of Significance We use test to determine whether a “prediction” is “true” or “false”. More precisely, a test of significance gets at the question.
CS-424 Gregory Dudek Lecture 14 Learning –Inductive inference –Probably approximately correct learning.
Learning from Observations
Learning from Observations
Artificial Intelligence
Formulation of hypothesis and testing
Introduction to the Theory of Computation
Presented By S.Yamuna AP/CSE
Issues in Decision-Tree Learning Avoiding overfitting through pruning
Data Mining Lecture 11.
Sequence comparison: Multiple testing correction
Machine Learning Chapter 3. Decision Tree Learning
Computational Learning Theory
Machine Learning: Lecture 3
Computational Learning Theory
Machine Learning Chapter 3. Decision Tree Learning
Learning from Observations
Sequence comparison: Multiple testing correction
Lecture 14 Learning Inductive inference
Learning from Observations
Presentation transcript:

A Syntactic Justification of Occam’s Razor 1 John Woodward, Jerry Swan Foundations of Reasoning Group University of Nottingham Ningbo, China 宁波诺丁汉大学

Overview  Occam’s Razor  Definitions  Sampling of Program Spaces (Langdon)  Two Assumptions  Proof of Occam’s Razor  Further Work  Context 2

Occam’s Razor  Occam’s Razor says has been adopted by the machine learning community to mean;  “Given two hypotheses which agree with the observed data, pick the simplest, as this is more likely to make the correct predictions” 3

Definitions Program Hypothesis: a program (effective procedure which allows us to make predictions) Size: the number of bits needed to represent a hypothesis. Function Concept: A set of predictions (input- output pair) Complexity: the size of the smallest program which computes a given function. 4

From Hypotheses to Concepts. 5

Langdon 1 (Foundation of Genetic Programming) 1. The limiting distribution of functions is independent of program size! There is a correlation between the frequency in the limiting distribution and the complexity of a function. 6

Langdon 2 (Foundation of Genetic Programming) 7

Hypothesis-Concept Spaces 8

Notation  P is the hypothesis space (i.e. a set of programs).  |P| is the size of the space (i.e. the cardinality of the set of programs).  F is the concept space (i.e. a set of functions represented by the programs in P).  |F| is the size of the space (i.e. the cardinality of the set of functions).  If two programs pi and pj map to the same function (i.e. they are interpreted as the same function, I(pi)=f=I(pj)), they belong to the same equivalence class (i.e. pi is in [pj] ↔ I(pi)=I(pj)). The notation [p] denotes the equivalence class which contains the program p (i.e. given I(pi)=I(pj), [pi]=[pj]). The size of an equivalence class [p] is denoted by |[p]|. 9

Two assumptions 1. Uniformly sample the hypothesis space, probability of sampling a given program is 1/|P|. 2. There are fewer hypotheses that represent complex functions  |[p1]|>|[p2]| ↔ c(f1)<c(f2), where I(p1)=f1 and I(p2)=f2.  Note that |[p1]|/|P| = p(I(p1))=p(f1), that is |[p1]|/|P|=p(f1)  the probability of sampling a function is given by the ratio of the size of the equivalence class containing all the programs which are interpreted that function, divided by the size of the hypothesis space. 10

Proof  Starting from a statement of the assumption  |[p1]|>|[p2]| ↔ c(f1) < c(f2)  Dividing the left hand side by |P|,  |[p1]|/|P|>|[p2]|/|P| ↔ c(f1)< c(f2)  As |[p1]|/|P| = p(I(p1)) = p(f1), we can rewrite as  p(f1) > p(f2) ↔ c(f1) < c(f2)  Which is a statement of Occam’s razor. 11

Restatement of Occam’s Razor  Occam’s Razor is often stated as “prefer the shortest consistent hypothesis”  Our restatement of Occam’s Razor: “The preferred function is the one that is represented most frequently. The equivalence class which contains the shortest program is represented most frequently.” 12

Summary  Occam’s razor states “pick the simplest hypothesis consistent with data”  We agree, but for a different reason.  Restatement. Pick the function that is represented most frequently (i.e. belongs to the largest equivalence class). Occam’s razor is concerned with probability, and we presented a simple counting argument.  Unlike some interpretations of Occam’s razor we do not discard more complex hypotheses we count them in [p].  We offer no reason to believe the world is simple, our razor only gives a reason to predict using the simplest hypothesis. 13

Further work  There are fewer hypotheses that represent complex functions: |[p1]|>|[p2]| ↔ c(f1) < c(f2),  Why are some functions represented more frequently that other functions.  The base functions may contain functions which are:  Symmetrical i.e. f(x, y) = f(y, x), e.g. nand.  Complementary, i.e. f(g(x))= g(f(x)) e.g. inc and dec. 14

Further work  Further work - to prove our assumptions.  Does the result depend on the primitive set?  How are the primitive functions linked together (e.g. tree, lists, directed acyclic graphs…). Does this make a difference to the result. 15

How does nature compute?  Heuristics such as Occam’s razor need not be explicitly present as rules.  Random searches of an agents generating capacity may implicitly carry heuristics.  Axiomatic reasoning probably comes late. 16

Thanks & Questions? 1) Thomas M. Cover and Joy A. Thomas. Elements of information theory. John Wiley and Sons ) Michael J. Kearns and Umesh V. Vazirani. An introduction to computational learning theory. MIT Press, ) William B. Langdon. Scaling of program fitness spaces. Evolutionary Computation, 7(4): , ) Tom M. Mitchell. Machine Learning. McGraw-Hill ) S. Russell and P. Norvig. Artificial Intelligence: A modern approach. Prentice Hall, ) G. I. Webb. Generality is more significant than complexity: Toward an alternative to occam’s razor. In 7 th Australian Joint Conference on Artificial Intelligence – Artificial Intelligence: Sowing the Seeds for the Future, 60-67, Singapore, 1994, World Scientific. 7) Ming Li and Paul Vitanyi. An Introduction to Kolmogorov Complexity and Its Applications (2 nd Ed.). Springer Verlag. 17