Breeding Decision Trees Using Evolutionary Techniques Written by Athanasios Papagelis Dimitris Kalles Presented by Alexandre Temporel
Abstract The idea is to make use of evolutionary techniques to evolve decision trees. Can the learner: search efficiently simple & complex hypotheses space? Discoverconditionally dependant attributes? irrelevant Tests using standard concept learning and also compared to two known algorithms: – C4.5 (Quinlan, 1993) – OneR (Holte, 1993) GOAL: demonstrates the potential advantages of this evolutionary techniques compared to other classifiers.
Outline 1)Problem Statement 2)Construction of GATree System. -Operators -Fitness function -Advanced features 3)Experiments -1 st exp: search efficiently simple & complex hypotheses space -2 nd exp: conditionally dependant attributes irrelevant attributes -3 rd exp: search target concepts on standard databases 4)Discussion on the search type of GATree 5)Conclusion
Related work Gas have widely been used for classification and concept learning tasks: J.Bala, J.Huang and H.Valaie (1995) Hybrid Learning Using Genetic Algorithms and Decision Trees for Pattern Classification School of Information and Engineering Work on their ability as a tool to evolve decision trees: Burnett C.Nathan (may 2001) Evolutionary Induction of Decision Trees Kenneth A.De Jong, William M.Spears and Dianna F.Gordon Use Genetic Algorithms for concept learning Naval Research Laboratory Koza R. Koza Concept formation and decision tree induction using the genetic programming paradigm Martijn C.J. Bot and William B.Langdon Application of Genetic Programming to Induction of Linear Classification Trees H.Kennedy, C.Chinniah, P.Bradbeer, L.Morss The construction and Evaluation of Decision Trees: a comparison of Evolutionary and Concept Learning Methods Napier University
Problem statement (1/2) Also if presence of irrelevant attributes in a data set... Mislead the impurity functions... Produce bigger, less comprehensible and lower performance tree. Using these Evolutionary techniques allow us to: overcome the use of greedy heuristics search the hypotheses space in a natural way evolve simple, accurate, robust and simple decision trees
Problem statement (2/2) Decision trees used in many domains (i.e.: pattern classification) Proven to be NP complete (Murthy, 1998) Current inductive learning algorithms use: Information gain, gain ratio (Quinlan, 1986) Gini Index (Breiman 1984) Assumption: attributes are conditionally independents Poor performance on ‘strong conditional dependence’ data-set
GATree system (1/3) : Operators GATree program uses GALIB (Wall, 1996), a robust C++ library of Genetic Algorithm components. Initial population: we use minimal binary decision trees Crossover Operation Mutation Operator
GATree system (2/3) : Fitness function Factor Size x is a constant (i.e. x=1000) If x is small, the fitness figure decreases with bigger trees. »SMALLER TREES->BETTER FITNESS If x is large, we search bigger search space. »ONLY ACCURACY REALLY MATTERS
GATree system (3/3) : Advanced features Overcrowding problem (Goldberg) use of a scaled payoff function ( which reduces fitness of similar trees in a population ) Use of alternative crossover and mutation operators More Accurate sub-trees have less chance to be selected for crossover or mutation To Speed Up evolution, use of: Limited Error Fitness (LEF) (GatherCole&Rose, 1997) CPU timesaving with insignificant accuracy losses
Experiments: foreword (1/2) 1 st experiment: we use DataGen (Melli,1999) to generate artificial data set using random rules (to ensure complexity variety). The goal is to reconstruct the underlying knowledge. 2 nd experiment: We use more or less complicated target concepts (Xor, parity....) and see how GATree performs against C rd experiment: We use WEKA (Witten & Frank, 2000) to test C4.5 and OneR algorithms. ==> Use of 5 fold cross-validation
Experiments: foreword (2/2) Evolution Type: Generational Init. Population: 200 Generations: 800 Generation Gap: 25% Mutation Prob.: Crossover Prob.: 0.93 Size factor: 1000 Random Seed:
1 st exp: Simple concept 3 Rules to extract: (31.0%) c1 <- B=(f or g or j) & C=(a or g or j) (28.0%) c2 <- C=(b or e) (41.0%) c3 <- B=(b or i) & C=(d or i) size Best individual Fitness-accuracy Size Number of generations
1 st exp: Complex concept 8 Rules to extract: (15.0%) c1 <- B=(f or l or s or w) & C=(c or e or f or k) (14.0%) c2 <- A= (a or b or t) & B=(a or h or q or k) Etc.... size Best individual Fitness-accuracy Size Number of generations
2 nd exp (irrelevant attribute) A1A2A3Class TFTT TFFT FTFT FTTT FFTF FFFF TTTF TTFF Easy concept but C4.5 falsely estimates the contribution of A3 GATree produces the optimal solution irrelevant
2 nd exp (conditionally. Dependant attribute) NameAttribClass FunctionNoiseInstances Random attributes XOR110(A1 xor A2) or (A3 xor A4)No1006 XOR210(A1 xor A2) xor (A3 xor A4)No1006 XOR310 (A1 xor A2) or (A3 and A4) or (A5 and A6) 10% Class Error 1004 Par1103 attrib. parity problemNo1007 Par2104 attributes parity problemNo1006 Greedy heuristics are not good when attributes are conditionally dependants. GATree outperformed at a very good level GATree can be disturbed by class noise C4.5GATree XOR167± ±0 XOR253± ±17.32 XOR379±6.5278±8.37 Par170± ±0 Par263±6.7185±7.91
3rd exp (results on standard sets) Greedy heuristics are not good when attributes are conditionally dependants. GATree outperformed at a very good level GATree can be disturbed by class noise AccuracySize C4.5OneRGATreeC4.5GATree Colic 83.84± ± ± Heart-Statlog 74.44± ± ± Diabetes 66.27± ± ± Credit 83.77± ± ± Hepatitis 77.42± ± ± Iris 92± ± ± Labor 85.26± ± ± Lymph 65.52± ± ± Breast-Cancer 71.93± ± ± Zoo 90± ± ± Vote 96.09± ± Glass 55.24± ± ± Balance-Scale 78.24± ± ± AVERAGES
3rd exp (observations) GA trees performs as well as or a bit better than C4.5 on the standard data sets. But trees size produces by GA trees are 6 times smaller than when using C4.5. OneR is good for noisy datasets but performs substantially worse overall.
Discussion on Search type of GATree GATree adopts a less greedy strategy than other learners It tries to minimise the size of the tree and maximise the accuracy GATree are not hill climbing searcher exhaustive But more a type of beam search (exploration / exploitation) However when tuned properly they have the same characteristics
Conclusion Derived hypotheses of standard algorithms can substantially deviate from the optimum. Due to their greedy strategy Can be solved by using global metrics of tree quality Compared to greedy induction, GATree produces: Accurate trees Small size Comprehensible GA adapt themselves dynamically on a variety of different target concepts
Encoding It is important for a problem to select the proper encoding. Encoding represents the mapping of solved problem to one of the following dimensional space: Value encoding chromosome A: Binary encoding Chromosomes in binary encoding are strings of 0 or 1. They can look like that ones shown on next example : chromosome A: Tree encoding GAs may also be used for program designing and construction. In that case chromosome genes represent programming language commands, mathematical operations and other components of program. We use Natural representation of the search space using actual decision trees and not binary strings....
Bias Without Bias, we have no basis for classifying unseen examples, the best we can do: -Memorise training samples -Classify new examples at random The problem with biased algorithm is that we decrease the hypothesis space and we might miss the optimal hypothesis which matches the optimum. Preference bias is based on the learner’s behaviour. C4.5 is biased towards small and accurate trees (preference bias) but uses gain ratio metric / minimum error pruning (procedural bias) desirable when it determines the characteristics of the produced tree. if inadequate, it may affect the quality of the output Procedural bias is based on the learner’s design. GA has a new weak procedural bias it considers a relative large number of hypotheses in a relative efficient manner it employs global metrics of tree quality. a set of minimum numerical performance measurements related to a goal: FITNESS