A Syntactic Justification of Occam’s Razor 1 John Woodward, Andy Evans, Paul Dempster Foundations of Reasoning Group University of Nottingham Ningbo, China.

Slides:

Advertisements

Similar presentations

PAC Learning 8/5/2005. purpose Effort to understand negative selection algorithm from totally different aspects –Statistics –Machine learning What is.

Advertisements

1 Machine Learning: Lecture 3 Decision Tree Learning (Based on Chapter 3 of Mitchell T.., Machine Learning, 1997)

Intelligent Systems On the efficacy of Occam's Razor as a model selection criterion for classification learning Geoff.

Knowledge Representation and Reasoning Learning Sets of Rules and Analytical Learning Harris Georgiou – 4.

The Necessity of Metabias in Metaheuristics.

Introduction to probability theory and graphical models Translational Neuroimaging Seminar on Bayesian Inference Spring 2013 Jakob Heinzle Translational.

Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Spring 2004.

CII504 Intelligent Engine © 2005 Irfan Subakti Department of Informatics Institute Technology of Sepuluh Nopember Surabaya - Indonesia.

Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Fall 2005.

Computational Learning Theory PAC IID VC Dimension SVM Kunstmatige Intelligentie / RuG KI2 - 5 Marius Bulacu & prof. dr. Lambert Schomaker.

CS 331 / CMPE 334 – Intro to AI CS 531 / CMPE AI Course Outline.

Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Fall 2004.

Two Broad Classes of Functions for Which a No Free Lunch Result Does Not Hold Matthew J. Streeter Genetic Programming, Inc. Mountain View, California

Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.

Chapter Nine: Evaluating Results from Samples Review of concepts of testing a null hypothesis. Test statistic and its null distribution Type I and Type.

1 © Lecture note 3 Hypothesis Testing MAKE HYPOTHESIS ©

Section 9.1 Introduction to Statistical Tests 9.1 / 1 Hypothesis testing is used to make decisions concerning the value of a parameter.

Machine Learning Chapter 3. Decision Tree Learning

Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 26 of 41 Friday, 22 October.

PROBABILITY REVIEW PART 2 PROBABILITY FOR TEXT ANALYTICS Thomas Tiahrt, MA, PhD CSC492 – Advanced Text Analytics.

1 Institute of Engineering Mechanics Leopold-Franzens University Innsbruck, Austria, EU H.J. Pradlwarter and G.I. Schuëller Confidence.

Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 Wednesday, 20 October.

STA Lecture 61 STA 291 Lecture 6 Randomness and Probability.

 2003, G.Tecuci, Learning Agents Laboratory 1 Learning Agents Laboratory Computer Science Department George Mason University Prof. Gheorghe Tecuci 5.

Discrete Math 6-1 Copyright © Genetic Computer School 2007 Lesson 6 Probability.

Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionSplitting Function Issues in Decision-Tree LearningIssues in Decision-Tree Learning.

G. Cowan Lectures on Statistical Data Analysis Lecture 1 page 1 Lectures on Statistical Data Analysis London Postgraduate Lectures on Particle Physics;

T. Poggio, R. Rifkin, S. Mukherjee, P. Niyogi: General Conditions for Predictivity in Learning Theory Michael Pfeiffer

Computing & Information Sciences Kansas State University Wednesday, 22 Oct 2008CIS 530 / 730: Artificial Intelligence Lecture 22 of 42 Wednesday, 22 October.

Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 28 of 41 Friday, 22 October.

Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 of 41 Monday, 25 October.

Quantifying Knowledge Fouad Chedid Department of Computer Science Notre Dame University Lebanon.

Computing & Information Sciences Kansas State University Wednesday, 25 Oct 2006CIS 490 / 730: Artificial Intelligence Lecture 26 of 42 Wednesday. 25 October.

Rosnow, Beginning Behavioral Research, 5/e. Copyright 2005 by Prentice Hall Ch. 2: Creative Ideas and Working Hypotheses.

1 Chapter 8 Hypothesis Testing 8.2 Basics of Hypothesis Testing 8.3 Testing about a Proportion p 8.4 Testing about a Mean µ (σ known) 8.5 Testing about.

Kolmogorov Complexity and Universal Distribution Presented by Min Zhou Nov. 18, 2002.

Lecture V Probability theory. Lecture questions Classical definition of probability Frequency probability Discrete variable and probability distribution.

Programming Languages and Design Lecture 3 Semantic Specifications of Programming Languages Instructor: Li Ma Department of Computer Science Texas Southern.

Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.

ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.

1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.

A Syntactic Justification of Occam’s Razor 1 John Woodward, Jerry Swan Foundations of Reasoning Group University of Nottingham Ningbo, China 宁波诺丁汉大学

Theories and Hypotheses. Assumptions of science A true physical universe exists Order through cause and effect, the connections can be discovered Knowledge.

Evaluating VR Systems. Scenario You determine that while looking around virtual worlds is natural and well supported in VR, moving about them is a difficult.

V7 Foundations of Probability Theory „Probability“ : degree of confidence that an event of an uncertain nature will occur. „Events“ : we will assume that.

Discrete Random Variables. Introduction In previous lectures we established a foundation of the probability theory; we applied the probability theory.

Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 24 of 41 Monday, 18 October.

1 Learning Agents Laboratory Computer Science Department George Mason University Prof. Gheorghe Tecuci Abductive learning.

Basic Probability. Introduction Our formal study of probability will base on Set theory Axiomatic approach (base for all our further studies of probability)

George F Luger ARTIFICIAL INTELLIGENCE 5th edition Structures and Strategies for Complex Problem Solving STOCHASTIC METHODS Luger: Artificial Intelligence,

Tests of Significance We use test to determine whether a “prediction” is “true” or “false”. More precisely, a test of significance gets at the question.

CS-424 Gregory Dudek Lecture 14 Learning –Inductive inference –Probably approximately correct learning.

Introductory Lecture. What is Discrete Mathematics? Discrete mathematics is the part of mathematics devoted to the study of discrete (as opposed to continuous)

Chapter 9 Hypothesis Testing Understanding Basic Statistics Fifth Edition By Brase and Brase Prepared by Jon Booze.

Computing & Information Sciences Kansas State University Wednesday, 25 Oct 2006CIS 490 / 730: Artificial Intelligence Lecture 26 of 42 Wednesday. 25 October.

Introduction to Artificial Intelligence Heshaam Faili University of Tehran.

Artificial Intelligence

Formulation of hypothesis and testing

Unit 3 Hypothesis.

Hypothesis testing Chapter S12 Learning Objectives

Presented By S.Yamuna AP/CSE

Issues in Decision-Tree Learning Avoiding overfitting through pruning

Discrete Mathematics and Its Applications

Machine Learning Chapter 3. Decision Tree Learning

Computational Learning Theory

Computational Learning Theory

Machine Learning Chapter 3. Decision Tree Learning

Sequence comparison: Multiple testing correction

Lecture 14 Learning Inductive inference

Presentation transcript:

A Syntactic Justification of Occam’s Razor 1 John Woodward, Andy Evans, Paul Dempster Foundations of Reasoning Group University of Nottingham Ningbo, China 宁波诺丁汉大学

Overview  Occam’s Razor  Sampling of Program Spaces (Langdon)  Definitions  Assumptions  Proof  Further Work  Context 2

Occam’s Razor  Occam’s Razor says has been adopted by the machine learning community to mean;  “Given two hypotheses which agree with the observed data, pick the simplest, as this is more likely to make the correct predictions” 3

Definitions Program Hypothesis Size Function Set of predictions (concept) Complexity 4

5

Langdon 1 (Foundation of Genetic Programming) 1. The limiting distribution of functions is independent of program size! There is a correlation between the frequency in the limiting distribution and the complexity of a function. 6

Langdon 2 (Foundation of Genetic Programming) 7

Hypothesis-Concept Spaces 8

Notation  P is the hypothesis space (i.e. a set of programs).  |P| is the size of the space (i.e. the cardinality of the set of programs).  F is the concept space (i.e. a set of functions represented by the programs in P).  |F| is the size of the space (i.e. the cardinality of the set of functions).  If two programs pi and pj map to the same function (i.e. they are interpreted as the same function, I(pi)=f=I(pj)), they belong to the same equivalence class (i.e. pi is in [pj] ↔ I(pi)=I(pj)). The notation [p] denotes the equivalence class which contains the program p (i.e. given I(pi)=I(pj), [pi]=[pj]). The size of an equivalence class [p] is denoted by |[p]|. 9

Two assumptions 1. Uniformly sample the hypothesis space, probability of sampling a given program is 1/|P|. 2. There are fewer hypotheses that represent complex functions  |[p1]|>|[p2]| ↔ c(f1)<c(f2), where I(p1)=f1 and I(p2)=f2.  Note that |[p1]|/|P| = p(I(p1))=p(f1), that is |[p1]|/|P|=p(f1)  i.e. the probability of sampling a function is given by the ratio of the size of the equivalence class containing all the programs which are interpreted that function, divided by the size of the hypothesis space. 10

Proof  starting from a statement of the assumption  |[p1]|>|[p2]| ↔ c(f1)< c(f2)  Dividing the left hand side by |P|,  |[p1]|/|P|>|[p2]|/|P| ↔ c(f1)< c(f2)  As |[p1]|/|P| = p(I(p1)) =p(f1), we can rewrite as  p(f1)>p(f2) ↔ c(f1)< c(f2)  a mathematical statement of Occam’s razor. 11

Restatement of Occam’s Razor  Often stated as “prefer the shortest consistent hypothesis”  Restatement of Occam’s Razor: The preferred function is the one that is represented most frequently. The equivalence class which contains the shortest program is represented most frequently. 12

Summary  Occam’s razor states “pick the simplest hypothesis consistent with data”  We agree, but for a different reason.  Restatement. Pick the function that is represented most frequently (i.e. belongs to the largest equivalence class). Occam’s razor is concerned with probability, and we present a simple counting argument.  Unlike many interpretations of Occam’s razor we do not throw out more complex hypotheses we count them in [p].  We offer no reason to believe the world is simple, our razor only gives a reason to predict using the simplest hypothesis. 13

further work To prove Assumption 2  there are fewer hypotheses that represent complex functions: |[p1]|>|[p2]| ↔ c(f1)<c(f2),  Why are some functions represented more frequently that other functions.  The base functions may contain functions which are:  Symmetrical i.e. f(x, y) = f(y, x), e.g. nand.  Complementary, i.e. f(g(x))= g(f(x)) e.g. inc and dec. 14

Further work  Further work -> to prove out assumptions.  Does it depend on the primitive set???  How are the primitive linked together (e.g. tree, lists, directed acyclic graphs…) 15

How does nature compute?  Heuristics such as Occam’s razor need not be explicitly present as rules.  Random searches of an agents generating capacity may implicitly carry heuristics.  Axiomatic reasoning probably comes late. 16

Thanks & Questions? 1) Thomas M. Cover and Joy A. Thomas. Elements of information theory. John Wiley and Sons ) Michael J. Kearns and Umesh V. Vazirani. An introduction to computational learning theory. MIT Press, ) William B. Langdon. Scaling of program fitness spaces. Evolutionary Computation, 7(4): , ) Tom M. Mitchell. Machine Learning. McGraw-Hill ) S. Russell and P. Norvig. Artificial Intelligence: A modern approach. Prentice Hall, ) G. I. Webb. Generality is more significant than complexity: Toward an alternative to occam’s razor. In 7 th Australian Joint Conference on Artificial Intelligence – Artificial Intelligence: Sowing the Seeds for the Future, 60-67, Singapore, 1994, World Scientific. 7) Ming Li and Paul Vitanyi. An Introduction to Kolmogorov Complexity and Its Applications (2 nd Ed.). Springer Verlag. 17