Two Broad Classes of Functions for Which a No Free Lunch Result Does Not Hold Matthew J. Streeter Genetic Programming, Inc. Mountain View, California

Slides:



Advertisements
Similar presentations
Pretty-Good Tomography Scott Aaronson MIT. Theres a problem… To do tomography on an entangled state of n qubits, we need exp(n) measurements Does this.
Advertisements

The Equivalence of Sampling and Searching Scott Aaronson MIT.
Models of Computation Prepared by John Reif, Ph.D. Distinguished Professor of Computer Science Duke University Analysis of Algorithms Week 1, Lecture 2.
Shortest Vector In A Lattice is NP-Hard to approximate
Prior-free auctions of digital goods Elias Koutsoupias University of Oxford.
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Weakening the Causal Faithfulness Assumption
Fast Algorithms For Hierarchical Range Histogram Constructions
Week 11 Review: Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution.
Congestion Games with Player- Specific Payoff Functions Igal Milchtaich, Department of Mathematics, The Hebrew University of Jerusalem, 1993 Presentation.
Online Scheduling with Known Arrival Times Nicholas G Hall (Ohio State University) Marc E Posner (Ohio State University) Chris N Potts (University of Southampton)
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
An Introduction to Variational Methods for Graphical Models.
Chain Rules for Entropy
Theoretical Program Checking Greg Bronevetsky. Background The field of Program Checking is about 13 years old. Pioneered by Manuel Blum, Hal Wasserman,
Complexity 15-1 Complexity Andrei Bulatov Hierarchy Theorem.
1 Introduction to Computability Theory Lecture11: Variants of Turing Machines Prof. Amos Israeli.
1 Lecture 8: Genetic Algorithms Contents : Miming nature The steps of the algorithm –Coosing parents –Reproduction –Mutation Deeper in GA –Stochastic Universal.
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
1 An Asymptotically Optimal Algorithm for the Max k-Armed Bandit Problem Matthew Streeter & Stephen Smith Carnegie Mellon University NESCAI, April
Probably Approximately Correct Model (PAC)
1 Introduction to Kernels Max Welling October (chapters 1,2,3,4)
Constrained Optimization
No Free Lunch (NFL) Theorem Many slides are based on a presentation of Y.C. Ho Presentation by Kristian Nolde.
Complexity 19-1 Complexity Andrei Bulatov More Probabilistic Algorithms.
An Introduction to Black-Box Complexity
COMP305. Part II. Genetic Algorithms. Genetic Algorithms.
Visual Recognition Tutorial
Data Flow Analysis Compiler Design Nov. 8, 2005.
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Lecture 10 Comparison and Evaluation of Alternative System Designs.
Continuous Random Variables and Probability Distributions
Slide 1 Statistics Workshop Tutorial 4 Probability Probability Distributions.
CHAPTER 15 S IMULATION - B ASED O PTIMIZATION II : S TOCHASTIC G RADIENT AND S AMPLE P ATH M ETHODS Organization of chapter in ISSO –Introduction to gradient.
1.1 Chapter 1: Introduction What is the course all about? Problems, instances and algorithms Running time v.s. computational complexity General description.
Modeling and simulation of systems Simulation optimization and example of its usage in flexible production system control.
1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.
1 Institute of Engineering Mechanics Leopold-Franzens University Innsbruck, Austria, EU H.J. Pradlwarter and G.I. Schuëller Confidence.
Bayesian Networks Martin Bachler MLA - VO
Boltzmann Machine (BM) (§6.4) Hopfield model + hidden nodes + simulated annealing BM Architecture –a set of visible nodes: nodes can be accessed from outside.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Quantifying Knowledge Fouad Chedid Department of Computer Science Notre Dame University Lebanon.
Alternative Wide Block Encryption For Discussion Only.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
Introduction to Optimization
Chapter 11 Introduction to Computational Complexity Copyright © 2011 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1.
5.3 Algorithmic Stability Bounds Summarized by: Sang Kyun Lee.
Rate Distortion Theory. Introduction The description of an arbitrary real number requires an infinite number of bits, so a finite representation of a.
CAP6938 Neuroevolution and Artificial Embryogeny Evolutionary Computation Theory Dr. Kenneth Stanley January 25, 2006.
Theory of Computational Complexity Probability and Computing Ryosuke Sasanuma Iwama and Ito lab M1.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
Theory of Computational Complexity M1 Takao Inoshita Iwama & Ito Lab Graduate School of Informatics, Kyoto University.
Genetic Algorithm. Outline Motivation Genetic algorithms An illustrative example Hypothesis space search.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
Theory of Computational Complexity Probability and Computing Chapter Hikaru Inada Iwama and Ito lab M1.
Chapter 7. Classification and Prediction
Random Testing: Theoretical Results and Practical Implications IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 2012 Andrea Arcuri, Member, IEEE, Muhammad.
Dr. Kenneth Stanley September 13, 2006
Data Mining Lecture 11.
Digital Signature Schemes and the Random Oracle Model
Dijkstra’s Algorithm for the Shortest Path Problem
Slides for Introduction to Stochastic Search and Optimization (ISSO) by J. C. Spall CHAPTER 15 SIMULATION-BASED OPTIMIZATION II: STOCHASTIC GRADIENT AND.
Summarizing Data by Statistics
Boltzmann Machine (BM) (§6.4)
LECTURE 09: BAYESIAN LEARNING
Programming with data Lecture 3
Theorem 9.3 A problem Π is in P if and only if there exist a polynomial p( ) and a constant c such that ICp (x|Π) ≤ c holds for all instances x of Π.
Presentation transcript:

Two Broad Classes of Functions for Which a No Free Lunch Result Does Not Hold Matthew J. Streeter Genetic Programming, Inc. Mountain View, California GECCO 2003, July Chicago

Overview Definitions & prior work Bayesian formulation of No Free Lunch & its implications No Free Lunch and description length No Free Lunch and infinite sets

Definitions (search algorithm framework) X = set of points (individuals) in search space Y = set of cost (fitness) values x i = an individual y i = the fitness of x i Search algorithm takes as input a list {(x 0,y 0 ), (x 1,y 1 ), …, (x n,y n )} of previously visited points and their cost values and produces next point x n+1 as output We consider only deterministic, non-retracing search algorithms, but Conclusions can be extended to stochastic, potentially retracing algorithms using arguments of (Wolpert and Macready 1997)

Definitions (search algorithm framework) A = a search algorithm F = a set of functions to be optimized P = a probability distribution over F f = a particular function in F Performance vector V (A,f )  vector of cost (fitness) values that A receives when run against f. Performance measure M (V (A,f ))  function that assigns scores to performance vectors Overall performance M O (A) 

Definitions (search algorithm framework) No Free Lunch result applies to (F,P ) iff., for any M, A and B, M O (A) = M O (B)

Definitions (permutations of functions) A permutation  f of a function f is a rearrangement of f ’s y values with respect to its x values A set of function F is closed under permutation iff., for any f  F and any permutation  f,  f  F

Some No Free Lunch results NFL holds for (F,P ) where P is uniform and F is set of all functions f:X  Y where X and Y are finite (Wolpert and Macready 1997) NFL holds for (F,P ) where P is uniform iff. F is closed under permutation (Schumacher, Vose and Whitley 2001)

Known limitations of NFL Does not hold for some classes of extremely simple functions (Droste, Jansen, and Wegener 1999) Under certain restrictions on “successor graph”, S UBMEDIAN -S EEKER outperforms random search (Christensen and Oppacher 2001) Does not hold when F has less than the maximum number of local optima or less than the maximum steepness (Igel and Toussaint 2001)

Bayesian formulation of NFL Let f be a function drawn at random from F under probability distribution P F Consider running a search algorithm for n steps to obtain the set of (point,cost) pairs: S = {x 0,y 0 ), (x 1,y 1 ), …, (x n,y n )} Let P (f (x i )=y | S ) denote the conditional probability that f (x i )=y, given our knowledge of S

Bayesian formulation of NFL No Free Lunch result holds for (F,P F ) if and only if: P (f (x i )=y | S ) = P (f (x j )=y | S ) for all x i, x j, S, and y, where x i, x j have not yet been visited

Proof (If) By induction: –Base case: If S={}, P (f (x i )=y) = P (f (x j )=y), so the search algorithm has no influence on the first cost (fitness) value obtained –Inductive step: If S = {(x 0,y 0 ), (x 1,y 1 ), …, (x k,y k )} and search algorithm cannot influence first k cost values, P (f (x i )=y | S ) = P (f (x j )=y | S ) so search algorithm cannot influence y k+1 Because list of cost values is independent of search algorithm, performance is independent of search algorithm

Proof (Only if) Assume for some x i, x j, S, y, P (f (x i )=y | S ) ≠ P (f (x j )=y | S ) where S = {(x 0,y 0 ), (x 1,y 1 ), …, (x n,y n )} Consider performance measure M that only rewards performance vectors beginning with the prefix y 0, y 1, …, y n, y We can construct search algorithms A and B that behave identically, except that A samples x i after observing S whereas B samples x j It follows that M O (A) > M O (B ), so NFL does not hold

Analysis P (f (x i )=y | S ) = P (f(x j )=y | S ) implies that –E[|f (x i )-f (x j )|] = E[|f (x k )-f (x l )|]: No correlation between genotypic similarity and similarity of fitness –P (y min  f (x i )  y max | S ) = P (y min  f (x j )  y max | S ): We cannot use S to estimate probability of unseen points lying in some fitness interval [y min, y max ]

Contrast with Holland’s assumptions Holland assumed that by sampling a schema, you can extrapolate about the fitness of unseen instances of that schema; this implies P(y min  f (x i )  y max | S ) ≠ P(y min  f (x j )  y max | S ) Objection has been made before that NFL assumptions violate Holland’s assumption (Droste, Jansen, and Wegener 1999), but I just showed that NFL holds only when you violate this assumption

No Free Lunch and Description Length

Definitions Description length of a function is the length of the shortest program that implements that function (a.k.a. Kolmogorov complexity) Description length depends on which Turing machine you use, but differences between any two TMs are bound by a constant (Compiler Theorem)

Previous work on NFL and problem description length Schumacher 2001: NFL can hold for sets of functions that are highly compressible (e.g., set of needle-in-the-haystack functions) Droste, Jansen, and Wegener 1999: NFL does not hold for certain sets with highly restricted description length (“Perhaps Not a Free Lunch, but at Least a Free Appetizer”)

My results concerning description length NFL does not hold in general for sets with bounded description length For sets defined by bound on description length, any reasonable bound rules out NFL

Proof outline F k  all functions f:X  Y with description length k or less If k is sufficiently large, F k will contain a hashing function like h (x ) = x mod |Y | If the number of permutations of h (x ) is more than 2 k+1 -1, F cannot be closed under permutation, therefore: –NFL result will not hold for (F,P ) where P is uniform (Schumacher, Vose, and Whitley 2001)

Illustration Let X consist of all n-bit chromosomes, Y consist of all m-bit fitness values, and F k be defined as before It turns out the number of permutations of h (x ) = x mod |Y | is:

Illustration Chromosome length n Fitness value length m NFL does not hold when: 161k mod ≤ k ≤ 6.55* *88k mod ≤ k ≤ 9.26* *832k mod ≤ k ≤ 4.22* where k mod  description length of h (x )= x mod |Y |

Additional proofs NFL does not hold if P assigns probability that is strictly decreasing w.r.t. description length NFL does not hold if P is a Solomonoff-Levin distribution Same proof technique (counting argument, permutation closure)

No Free Lunch and infinite sets Original NFL proof depends on P assigning equal probability to every f in F; can’t do this for infinite F Bayesian formulation still works for infinite F NFL only holds when P (f ) can be expressed solely as a function of f ’s domain and range Unlike case of finite F, there is no aesthetically pleasing argument for assuming P Is such that NFL holds

Limitations of this work When we show that a NFL results does not hold for certain (F,P ), all this means is that for some performance measure, some algorithms perform better than others The fact that NFL does not hold for (F,P ) does not necessarily mean (F,P ) is “searchable” For stronger results under more restrictive assumptions, see Christensen and Oppacher 2001

Conclusions Lots of statement have been made with NFL as the justification: –All search algorithms are equally robust –All search algorithms are equally specialized –Performance of algorithms on benchmark problems is not predictive of performance in general

Conclusions However, these statements are only true if you assume that: –The (F,P ) of interest to the GA community is such that parent and offspring fitness are uncorrelated, and –there is no correlation between genotypic similarity and similarity of fitness

Conclusions Moreover, there are theoretically motivated choices of (F,P ) for which NFL does not hold –P (f ) as a decreasing function of f ’s description length –P (f ) as a Solomonoff-Levin distribution

Conclusions If anything, proponents of NFL are on shakier ground when F is infinite