Two Broad Classes of Functions for Which a No Free Lunch Result Does Not Hold Matthew J. Streeter Genetic Programming, Inc. Mountain View, California GECCO 2003, July Chicago
Overview Definitions & prior work Bayesian formulation of No Free Lunch & its implications No Free Lunch and description length No Free Lunch and infinite sets
Definitions (search algorithm framework) X = set of points (individuals) in search space Y = set of cost (fitness) values x i = an individual y i = the fitness of x i Search algorithm takes as input a list {(x 0,y 0 ), (x 1,y 1 ), …, (x n,y n )} of previously visited points and their cost values and produces next point x n+1 as output We consider only deterministic, non-retracing search algorithms, but Conclusions can be extended to stochastic, potentially retracing algorithms using arguments of (Wolpert and Macready 1997)
Definitions (search algorithm framework) A = a search algorithm F = a set of functions to be optimized P = a probability distribution over F f = a particular function in F Performance vector V (A,f ) vector of cost (fitness) values that A receives when run against f. Performance measure M (V (A,f )) function that assigns scores to performance vectors Overall performance M O (A)
Definitions (search algorithm framework) No Free Lunch result applies to (F,P ) iff., for any M, A and B, M O (A) = M O (B)
Definitions (permutations of functions) A permutation f of a function f is a rearrangement of f ’s y values with respect to its x values A set of function F is closed under permutation iff., for any f F and any permutation f, f F
Some No Free Lunch results NFL holds for (F,P ) where P is uniform and F is set of all functions f:X Y where X and Y are finite (Wolpert and Macready 1997) NFL holds for (F,P ) where P is uniform iff. F is closed under permutation (Schumacher, Vose and Whitley 2001)
Known limitations of NFL Does not hold for some classes of extremely simple functions (Droste, Jansen, and Wegener 1999) Under certain restrictions on “successor graph”, S UBMEDIAN -S EEKER outperforms random search (Christensen and Oppacher 2001) Does not hold when F has less than the maximum number of local optima or less than the maximum steepness (Igel and Toussaint 2001)
Bayesian formulation of NFL Let f be a function drawn at random from F under probability distribution P F Consider running a search algorithm for n steps to obtain the set of (point,cost) pairs: S = {x 0,y 0 ), (x 1,y 1 ), …, (x n,y n )} Let P (f (x i )=y | S ) denote the conditional probability that f (x i )=y, given our knowledge of S
Bayesian formulation of NFL No Free Lunch result holds for (F,P F ) if and only if: P (f (x i )=y | S ) = P (f (x j )=y | S ) for all x i, x j, S, and y, where x i, x j have not yet been visited
Proof (If) By induction: –Base case: If S={}, P (f (x i )=y) = P (f (x j )=y), so the search algorithm has no influence on the first cost (fitness) value obtained –Inductive step: If S = {(x 0,y 0 ), (x 1,y 1 ), …, (x k,y k )} and search algorithm cannot influence first k cost values, P (f (x i )=y | S ) = P (f (x j )=y | S ) so search algorithm cannot influence y k+1 Because list of cost values is independent of search algorithm, performance is independent of search algorithm
Proof (Only if) Assume for some x i, x j, S, y, P (f (x i )=y | S ) ≠ P (f (x j )=y | S ) where S = {(x 0,y 0 ), (x 1,y 1 ), …, (x n,y n )} Consider performance measure M that only rewards performance vectors beginning with the prefix y 0, y 1, …, y n, y We can construct search algorithms A and B that behave identically, except that A samples x i after observing S whereas B samples x j It follows that M O (A) > M O (B ), so NFL does not hold
Analysis P (f (x i )=y | S ) = P (f(x j )=y | S ) implies that –E[|f (x i )-f (x j )|] = E[|f (x k )-f (x l )|]: No correlation between genotypic similarity and similarity of fitness –P (y min f (x i ) y max | S ) = P (y min f (x j ) y max | S ): We cannot use S to estimate probability of unseen points lying in some fitness interval [y min, y max ]
Contrast with Holland’s assumptions Holland assumed that by sampling a schema, you can extrapolate about the fitness of unseen instances of that schema; this implies P(y min f (x i ) y max | S ) ≠ P(y min f (x j ) y max | S ) Objection has been made before that NFL assumptions violate Holland’s assumption (Droste, Jansen, and Wegener 1999), but I just showed that NFL holds only when you violate this assumption
No Free Lunch and Description Length
Definitions Description length of a function is the length of the shortest program that implements that function (a.k.a. Kolmogorov complexity) Description length depends on which Turing machine you use, but differences between any two TMs are bound by a constant (Compiler Theorem)
Previous work on NFL and problem description length Schumacher 2001: NFL can hold for sets of functions that are highly compressible (e.g., set of needle-in-the-haystack functions) Droste, Jansen, and Wegener 1999: NFL does not hold for certain sets with highly restricted description length (“Perhaps Not a Free Lunch, but at Least a Free Appetizer”)
My results concerning description length NFL does not hold in general for sets with bounded description length For sets defined by bound on description length, any reasonable bound rules out NFL
Proof outline F k all functions f:X Y with description length k or less If k is sufficiently large, F k will contain a hashing function like h (x ) = x mod |Y | If the number of permutations of h (x ) is more than 2 k+1 -1, F cannot be closed under permutation, therefore: –NFL result will not hold for (F,P ) where P is uniform (Schumacher, Vose, and Whitley 2001)
Illustration Let X consist of all n-bit chromosomes, Y consist of all m-bit fitness values, and F k be defined as before It turns out the number of permutations of h (x ) = x mod |Y | is:
Illustration Chromosome length n Fitness value length m NFL does not hold when: 161k mod ≤ k ≤ 6.55* *88k mod ≤ k ≤ 9.26* *832k mod ≤ k ≤ 4.22* where k mod description length of h (x )= x mod |Y |
Additional proofs NFL does not hold if P assigns probability that is strictly decreasing w.r.t. description length NFL does not hold if P is a Solomonoff-Levin distribution Same proof technique (counting argument, permutation closure)
No Free Lunch and infinite sets Original NFL proof depends on P assigning equal probability to every f in F; can’t do this for infinite F Bayesian formulation still works for infinite F NFL only holds when P (f ) can be expressed solely as a function of f ’s domain and range Unlike case of finite F, there is no aesthetically pleasing argument for assuming P Is such that NFL holds
Limitations of this work When we show that a NFL results does not hold for certain (F,P ), all this means is that for some performance measure, some algorithms perform better than others The fact that NFL does not hold for (F,P ) does not necessarily mean (F,P ) is “searchable” For stronger results under more restrictive assumptions, see Christensen and Oppacher 2001
Conclusions Lots of statement have been made with NFL as the justification: –All search algorithms are equally robust –All search algorithms are equally specialized –Performance of algorithms on benchmark problems is not predictive of performance in general
Conclusions However, these statements are only true if you assume that: –The (F,P ) of interest to the GA community is such that parent and offspring fitness are uncorrelated, and –there is no correlation between genotypic similarity and similarity of fitness
Conclusions Moreover, there are theoretically motivated choices of (F,P ) for which NFL does not hold –P (f ) as a decreasing function of f ’s description length –P (f ) as a Solomonoff-Levin distribution
Conclusions If anything, proponents of NFL are on shakier ground when F is infinite