Biological data mining by Genetic Programming AI Project #2 Biointelligence lab Cho, Dong-Yeon
© 2006 SNU CSE Biointelligence Lab 2 Project Purpose Medical Diagnosis To predict the presence or absence of a disease given the results of various medical tests carried out on a patient Human experts (M.D.) vs Machine (GP) Two Data Sets Heart Disease Diabetes
© 2006 SNU CSE Biointelligence Lab 3 Heart Disease Data Description Number of patients (270) Absence (150) Presence (120) 13 attributes age sex chest pain type (4 values) resting blood pressure serum cholestoral in mg/dl fasting blood sugar > 120 mg/dl resting electrocardiographic results (values 0,1,2) maximum heart rate achieved exercise induced angina oldpeak = ST depression induced by exercise relative to rest the slope of the peak exercise ST segment number of major vessels (0-3) colored by flourosopy thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
© 2006 SNU CSE Biointelligence Lab 4 Learning a Classifier GP settings Functions Numerical and condition operators {+, -, *, /, exp, log, sin, cos, sqrt, iflte ifltz, …} Some operators should be protected from the illegal operation. Terminals Input attributes and constants {x 1, x 2, … x 13, R} where R [a, b] Additional parameters Threshold value For preprocessing (normalization)
© 2006 SNU CSE Biointelligence Lab 5 Cross Validation (1/3) K-fold Cross Validation The data set is randomly divided into k subsets. One of the k subsets is used as the test set and the other k-1 subsets are put together to form a training set. 45 D1D1 D2D2 D3D3 D4D4 D5D5 D6D6 D1D1 D2D2 D3D3 D4D4 D6D6 D5D5 D2D2 D3D3 D4D4 D5D5 D6D6 D1D1
© 2006 SNU CSE Biointelligence Lab 6 Cross Validation (2/3) Confusion Matrix for test data sets Number of patients = p + q + r + s Accuracy True Predict PositiveNegative Positivepq Negativers
© 2006 SNU CSE Biointelligence Lab 7 Cross Validation (3/3) Cross validation and Confusion Matrix At least 10 runs for your k value. Show the confusion matrix for the best result of your experiments. RunAccuracy 1 2 10 Average
© 2006 SNU CSE Biointelligence Lab 8 Initialization Maximum initial depth of trees D max is set. Full method (each branch has depth = D max ): nodes at depth d < D max randomly chosen from function set F nodes at depth d = D max randomly chosen from terminal set T Grow method (each branch has depth D max ): nodes at depth d < D max randomly chosen from F T nodes at depth d = D max randomly chosen from T Common GP initialisation: ramped half-and-half, where gr ow and full method each deliver half of initial population
© 2006 SNU CSE Biointelligence Lab 9 Fitness Function Maximization problem Number of the correctly classified patients Minimization problem Number of the incorrectly classified patients Mean Squared Error N: number of training data
© 2006 SNU CSE Biointelligence Lab 10 Selection (1/2) Fitness proportional (roulette wheel) selection The roulette wheel can be constructed as follows. Calculate the total fitness for the population. Calculate selection probability p k for each chromosome v k. Calculate cumulative probability q k for each chromosome v k.
© 2006 SNU CSE Biointelligence Lab 11 Procedure: Proportional_Selection Generate a random number r from the range [0,1]. If r q 1, then select the first chromosome v 1 ; else, select the kth chromosome v k (2 k pop_size) such that q k-1 < r q k. pkpk qkqk
© 2006 SNU CSE Biointelligence Lab 12 Selection (2/2) Tournament selection Tournament size q Ranking-based selection 2 POP_SIZE 1 + 2 and - = 2 - +
© 2006 SNU CSE Biointelligence Lab 13 GP Flowchart GA loopGP loop
© 2006 SNU CSE Biointelligence Lab 14 Bloat Bloat = “ survival of the fattest ”, i.e., the tree sizes in the population are increasing over time Ongoing research and debate about the reasons Needs countermeasures, e.g. Prohibiting variation operators that would deliver “ too big ” children Parsimony pressure: penalty for being oversized
© 2006 SNU CSE Biointelligence Lab 15
© 2006 SNU CSE Biointelligence Lab 16 Experiments Two problems Heart Disease Pima Indian diabetes Various experimental setup Termination condition: maximum_generation Various settings Effects of the penalty term Different function and terminal sets Selection methods and their parameters Crossover and mutation probabilities
© 2006 SNU CSE Biointelligence Lab 17 Results For each problem Result table and your analysis Present the optimal classifier Draw a learning curve for the run where the best solution was found. Compare with the results of neural networks (optional). Different k for cross validation (optional) TrainingTest Average SD BestWorst Average SD BestWorst Setting 1 Setting 2 Setting 3
© 2006 SNU CSE Biointelligence Lab 18 Generation Fitness (Error)
© 2006 SNU CSE Biointelligence Lab 19 References Source Codes GP libraries (C, C++, JAVA, …) MATLAB Tool box Web sites e.html e.html …
© 2006 SNU CSE Biointelligence Lab 20 Pay Attention! Due: Nov. 16, 2006 Submission Source code and executable file(s) Proper comments in the source code Via Report: Hardcopy!! Running environments and libraries (or packages) which you used. Results for many experiments with various parameter settings Analysis and explanation about the results in your own way