Symbolic Regression via Genetic Programming AI Project #2 Biointelligence lab Cho, Dong-Yeon
© 2005 SNU CSE Biointelligence Lab 2 Example (1/2) Data Relationship between A and P AP
© 2005 SNU CSE Biointelligence Lab 3 Example (2/2) Kepler’s Third Law Square of any planet's orbital period (sidereal) is proportional to cube of its mean distance (semi-major axis) from Sun PlanetAP Mercury Venus Earth1.00 Mars Jupiter Saturn Uranus
© 2005 SNU CSE Biointelligence Lab 4 Koza’s Algorithm 1. Choose a set of possible functions and terminals for the program. F = {+, - *, /, }, T = {A} 2. Generate an initial population of random trees (programs) using the set of possible functions and terminals. 3. Calculate the fitness of each program in the population by running it on a set of “fitness cases” (a set of input for which the correct output is known). 4. Apply selection, crossover, and mutation to the population to form a new population. 5. Steps 3 and 4 are repeated for some number of generations. Evolving the Programs (1/2)
© 2005 SNU CSE Biointelligence Lab 5 Evolving Lisp Programs (2/2) Kepler’s Third Law: P 2 = cA 3 FORTRAN LISP PROGRAM ORBITAL_PERIORD C# Mars # A = 1.52 P = SQRT(A * A * A) PRINT P END ORBITAL_PERIORD (defun orbital_period () ; Mars ; (setf A 1.52) (sqrt (* A (* A A)))) Parse tree
© 2005 SNU CSE Biointelligence Lab 6 Symbolic Regression by GP Objective Find the function f for the given data (x, y) Data Sets Set 1 and 2: 11 pairs Set 3: 50 pairs
© 2005 SNU CSE Biointelligence Lab 7 Functions and Terminals Functions Numerical operators {+, -, *, /, exp, log, sin, cos, sqrt} Some operators should be protected from the illegal operation. Terminals Input and constants {x, R} where R [a, b]
© 2005 SNU CSE Biointelligence Lab 8 Initialization Maximum initial depth of trees D max is set. Full method (each branch has depth = D max ): nodes at depth d < D max randomly chosen from function set F nodes at depth d = D max randomly chosen from terminal set T Grow method (each branch has depth D max ): nodes at depth d < D max randomly chosen from F T nodes at depth d = D max randomly chosen from T Common GP initialisation: ramped half-and-half, where gr ow and full method each deliver half of initial population
© 2005 SNU CSE Biointelligence Lab 9 Fitness Functions Relative Squared Error The number of outputs that are within % of the correct value
© 2005 SNU CSE Biointelligence Lab 10 Selection (1/2) Fitness proportional (roulette wheel) selection The roulette wheel can be constructed as follows. Calculate the total fitness for the population. Calculate selection probability p k for each chromosome v k. Calculate cumulative probability q k for each chromosome v k.
© 2005 SNU CSE Biointelligence Lab 11 Procedure: Proportional_Selection Generate a random number r from the range [0,1]. If r q 1, then select the first chromosome v 1 ; else, select the kth chromosome v k (2 k pop_size) such that q k-1 < r q k. pkpk qkqk
© 2005 SNU CSE Biointelligence Lab 12 Selection (2/2) Tournament selection Tournament size q Ranking-based selection 2 POP_SIZE 1 + 2 and - = 2 - +
© 2005 SNU CSE Biointelligence Lab 13 GP Flowchart GA loopGP loop
© 2005 SNU CSE Biointelligence Lab 14 Bloat Bloat = “ survival of the fattest ”, i.e., the tree sizes in the population are increasing over time Ongoing research and debate about the reasons Needs countermeasures, e.g. Prohibiting variation operators that would deliver “ too big ” children Parsimony pressure: penalty for being oversized
© 2005 SNU CSE Biointelligence Lab 15
© 2005 SNU CSE Biointelligence Lab 16 Experiments At least three problems (+ your own data) Various experimental setup Termination condition: maximum_generation 2 Models 3 settings 20 runs Polynomial and general Effects of the penalty term Selection methods and their parameters Crossover p c and mutation p m
© 2005 SNU CSE Biointelligence Lab 17 Results For each problem Result table and your analysis Present the optimal function. Readable form and predicted function graph with data Draw a learning curve for the run where the best solution was found. You can draw all learning curves in one plot. PolynomialGeneral Average SD BestWorst Average SD BestWorst Setting 1 Setting 2 Setting 3
© 2005 SNU CSE Biointelligence Lab 18 Generation Fitness (Error)
© 2005 SNU CSE Biointelligence Lab 19 References Source Codes GP libraries (C, C++, JAVA, …) MATLAB Tool box Web sites e.html e.html …
© 2005 SNU CSE Biointelligence Lab 20 Pay Attention! Due: May 3, 2005 Submission Source code and executable file(s) Proper comments in the source code Via Report: Hardcopy!! Running environments Results for many experiments with various parameter settings Analysis and explanation about the results in your own way