Project 2: Classification Using Genetic Programming Kim, MinHyeok Biointelligence laboratory Artificial Intelligence
Contents Project outline Description on the data set Genetic Programming Brief overview Fitness function & Selection methods Classification with GP (in this project) Guide to writing reports Style & contents Submission guide / Marking scheme 2 (C) 2008, SNU Biointelligence Laboratory
3 Outline Goal Understand the Genetic Programming (GP) deeper Practice researching and writing a paper Forest Fires problem (classification) To predict whether a fire occurs or not Using Genetic Programming Estimating several statistics on the dataset Data set Variation of the ‘Forest Fires data set’
Forest Fires Data Set Description Database of 517 samples You can use at most 500 samples for training 17 samples for prediction 12 attributes X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,label Integer or real value Label (Class) Two classes –0 : a fire does not occur –1 : a fire occurs 4 (C) 2008, SNU Biointelligence Laboratory
Brief Summary of GP A kind of evolutionary algorithms It is represented with a tree structure You need to set up following elements for GP run The set of terminals (input attributes, the class variable, constants) The set of functions (numerical / condition operators) The fitness measure The algorithm parameters population size, maximum number of generations crossover rate and mutation rate maximum depth of GP trees etc. The method for designating a result and the criterion for terminating a run. 5 (C) 2008, SNU Biointelligence Laboratory
6 GP Flowchart GA loopGP loop
Initialization Maximum initial depth of trees D max is set. Full method (each branch has depth = D max ): nodes at depth d < D max randomly chosen from function set F nodes at depth d = D max randomly chosen from terminal set T Grow method (each branch has depth D max ): nodes at depth d < D max randomly chosen from F T nodes at depth d = D max randomly chosen from T Common GP initialisation: ramped half-and-half, where grow and full method each deliver half of initial population 7 (C) 2008, SNU Biointelligence Laboratory
Fitness Functions Relative squared error The number of outputs that are within % of the correct value And you can try other fitness functions which are well- defined to solve problems
Selection methods (1/2) Fitness proportional (roulette wheel) selection The roulette wheel can be constructed as follows. Calculate the total fitness for the population. Calculate selection probability p k for each chromosome v k. Calculate cumulative probability q k for each chromosome v k.
Procedure: Proportional_Selection Generate a random number r from the range [0,1]. If r q 1, then select the first chromosome v 1 ; else, select the kth chromosome v k (2 k pop_size) such that q k-1 < r q k. pkpk qkqk
Selection methods (2/2) Tournament selection Tournament size q Ranking-based selection 2 POP_SIZE 1 + 2 and - = 2 - + Elitism To preserve n good solutions until the next generation
Classification with GP (in this project) Function Regression Search a function f(x) s.t. f(x) ≥ threshold twhen y=1 f(x) < threshold twhen y=0 Converting to Boolean value ∧ ¬∨ = >< 0 rainRH 50 wind + FFMCISI IF > 1 0 f(x)t
What to do for the experiment? Select a library that implements GP You can find various libraries written in C++/Java/Matlab See the list of recommended libraries on the next page Build up your own code for the experiment Check sample codes and tutorials of libraries for quick start Add comments to explain the flow of your program Caution Running GP may take much time 13 (C) 2008, SNU Biointelligence Laboratory
Recommended Libraries for GP C++ GPLib: Java JGAP: ECJ: Matlab toolbox GPLAB: More References Implementations section in Wiki – Genetic Programming: (C) 2008, SNU Biointelligence Laboratory
Reports Style English only!! Scientific journal-style How to Write A Paper in Scientific Journal Style and Format (C) 2008, SNU Biointelligence Laboratory Experimental process Section of Paper What did I do in a nutshell? Abstract What is the problem?Introduction How did I solve the problem? Materials and Methods What did I find out? Results What does it mean? Discussion Who helped me out? Acknowledgments (optional) Whose work did I refer to? Literature Cited Extra InformationAppendices (optional)
Report Contents (1/3) System description Used programming language and running environments Result tables Analysis & discussion (Very Important!!) 16 (C) 2008, SNU Biointelligence Laboratory Training Average SD BestWorst Setting 1 % % % Setting 2 % % % Setting 3 % % % Your prediction 12…1617Equation
Report Contents (2/3) Graph Avg., Max. Fitness versus Generation Tree size versus Generation 17 (C) 2008, SNU Biointelligence Laboratory
Report Contents (3/3) Basic experiments Changing parameters for the crossover and mutation Various function sets: arithmetic, numerical Optional experiments Various selection methods Depth limitation Population size, generation numbers Comparison to Neural Network … References 18 (C) 2008, SNU Biointelligence Laboratory
19 (C) 2008, SNU Biointelligence Laboratory Submission Guide Due date: Nov. 19 (Wed) 18:00 Submit both ‘hardcopy’ and ‘ ’ Hardcopy submission to the office ( ) submission to Subject : [AI Project1 Report] Student number, Name Report + your source code with comments + executable file(s) Length: report should be summarized within 12 pages. We are NOT interested in the accuracy and your programming skill, but your creativity and research ability. If your major is not a C.S, team project with a C.S major student is possible (Use the class board to find your partner and notice the information of your team to TA by Nov.
Marking Scheme 5 points for programming 5 points for result prediction 30 points for experiment & analysis 15 pts for experiments, 15pts for analysis 10 points for report Late work - 10% per one day Maximum 7 days 20 (C) 2008, SNU Biointelligence Laboratory
QnA 21 (C) 2008, SNU Biointelligence Laboratory
Test Data XYmonthdayFFMCDMCDCISItempRHwindrain Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data