Download presentation
Presentation is loading. Please wait.
1
Medical Diagnosis via Genetic Programming
AI Project #2 Biointelligence lab Cho, Dong-Yeon
2
© 2006 SNU CSE Biointelligence Lab
Project Purpose Medical Diagnosis To predict the presence or absence of a disease given the results of various medical tests carried out on a patient Human experts (M.D.) vs Machine (GP) Two Data Sets Heart Disease Diabetes © 2006 SNU CSE Biointelligence Lab
3
© 2006 SNU CSE Biointelligence Lab
Heart Disease Data Description Number of patients (270) Absence (150) Presence (120) 13 attributes age sex chest pain type (4 values) resting blood pressure serum cholestoral in mg/dl fasting blood sugar > 120 mg/dl resting electrocardiographic results (values 0,1,2) maximum heart rate achieved exercise induced angina oldpeak = ST depression induced by exercise relative to rest the slope of the peak exercise ST segment number of major vessels (0-3) colored by flourosopy thal: 3 = normal; 6 = fixed defect; 7 = reversable defect © 2006 SNU CSE Biointelligence Lab
4
© 2006 SNU CSE Biointelligence Lab
Learning a Classifier GP settings Functions Numerical and condition operators {+, -, *, /, exp, log, sin, cos, sqrt, iflte ifltz, …} Some operators should be protected from the illegal operation. Terminals Input attributes and constants {x1, x2, … x13, R} where R [a, b] Additional parameters Threshold value For preprocessing (normalization) © 2006 SNU CSE Biointelligence Lab
5
© 2006 SNU CSE Biointelligence Lab
Cross Validation (1/3) K-fold Cross Validation The data set is randomly divided into k subsets. One of the k subsets is used as the test set and the other k-1 subsets are put together to form a training set. D1 D2 D3 D4 D5 D6 45 45 45 45 45 45 D1 D2 D3 D4 D6 D5 45 45 45 45 45 45 D2 D3 D4 D5 D6 D1 45 45 45 45 45 45 © 2006 SNU CSE Biointelligence Lab
6
© 2006 SNU CSE Biointelligence Lab
Cross Validation (2/3) Confusion Matrix for test data sets Number of patients = p + q + r + s Accuracy True Predict Positive Negative p q r s © 2006 SNU CSE Biointelligence Lab
7
© 2006 SNU CSE Biointelligence Lab
Cross Validation (3/3) Cross validation and Confusion Matrix At least 10 runs for your k value. Show the confusion matrix for the best result of your experiments. Run Accuracy 1 2 10 Average © 2006 SNU CSE Biointelligence Lab
8
© 2006 SNU CSE Biointelligence Lab
Initialization Maximum initial depth of trees Dmax is set. Full method (each branch has depth = Dmax): nodes at depth d < Dmax randomly chosen from function set F nodes at depth d = Dmax randomly chosen from terminal set T Grow method (each branch has depth Dmax): nodes at depth d < Dmax randomly chosen from F T nodes at depth d = Dmax randomly chosen from T Common GP initialisation: ramped half-and-half, where grow and full method each deliver half of initial population © 2006 SNU CSE Biointelligence Lab
9
© 2006 SNU CSE Biointelligence Lab
Fitness Function Maximization problem Number of the correctly classified patients Minimization problem Number of the incorrectly classified patients Mean Squared Error N: number of training data © 2006 SNU CSE Biointelligence Lab
10
© 2006 SNU CSE Biointelligence Lab
Selection (1/2) Fitness proportional (roulette wheel) selection The roulette wheel can be constructed as follows. Calculate the total fitness for the population. Calculate selection probability pk for each chromosome vk. Calculate cumulative probability qk for each chromosome vk. © 2006 SNU CSE Biointelligence Lab
11
© 2006 SNU CSE Biointelligence Lab
Procedure: Proportional_Selection Generate a random number r from the range [0,1]. If r q1, then select the first chromosome v1; else, select the kth chromosome vk (2 k pop_size) such that qk-1 < r qk. pk qk 1 2 3 4 5 6 7 8 9 10 © 2006 SNU CSE Biointelligence Lab
12
© 2006 SNU CSE Biointelligence Lab
Selection (2/2) Tournament selection Tournament size q Ranking-based selection 2 POP_SIZE 1 + 2 and - = 2 - + © 2006 SNU CSE Biointelligence Lab
13
© 2006 SNU CSE Biointelligence Lab
GP Flowchart GA loop GP loop © 2006 SNU CSE Biointelligence Lab
14
© 2006 SNU CSE Biointelligence Lab
Bloat Bloat = “survival of the fattest”, i.e., the tree sizes in the population are increasing over time Ongoing research and debate about the reasons Needs countermeasures, e.g. Prohibiting variation operators that would deliver “too big” children Parsimony pressure: penalty for being oversized © 2006 SNU CSE Biointelligence Lab
15
© 2006 SNU CSE Biointelligence Lab
16
© 2006 SNU CSE Biointelligence Lab
Experiments Two problems Heart Disease Pima Indian diabetes Various experimental setup Termination condition: maximum_generation Various settings Effects of the penalty term Different function and terminal sets Selection methods and their parameters Crossover and mutation probabilities © 2006 SNU CSE Biointelligence Lab
17
© 2006 SNU CSE Biointelligence Lab
Results For each problem Result table and your analysis Present the optimal classifier Draw a learning curve for the run where the best solution was found. Compare with the results of neural networks (optional). Different k for cross validation (optional) Training Test Average SD Best Worst Setting 1 Setting 2 Setting 3 © 2006 SNU CSE Biointelligence Lab
18
© 2006 SNU CSE Biointelligence Lab
Fitness (Error) Generation © 2006 SNU CSE Biointelligence Lab
19
© 2006 SNU CSE Biointelligence Lab
References Source Codes GP libraries (C, C++, JAVA, …) MATLAB Tool box Web sites … © 2006 SNU CSE Biointelligence Lab
20
© 2006 SNU CSE Biointelligence Lab
Pay Attention! Due: May 11, 2006 Submission Source code and executable file(s) Proper comments in the source code Via Report: Hardcopy!! Running environments and libraries (or packages) which you used. Results for many experiments with various parameter settings Analysis and explanation about the results in your own way © 2006 SNU CSE Biointelligence Lab
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.