Intelligent Systems On the efficacy of Occam's Razor as a model selection criterion for classification learning Geoff Webb Monash University
2 Intelligent Systems Occam's razor Principle of parsimony Non sunt multiplicanda entia praeter necessitatem entities are not to be multiplied beyond necessity. Modern interpretation: of multiple explanations that are equal in all other respects, prefer the least complex Pervasive in Western thought Frequently invoked in machine learning
3 Intelligent Systems Some observations Not propositional! Complex can mean many things (Bunge, 1963): syntactic: number of words or other syntactic elements required to express the theory semantic: complexity of the meaning of the theory / number of presuppositions it requires epistemological: number of transcendent terms required by the theory pragmatic: complexity of applying of the theory
4 Intelligent Systems The Occam Thesis Blumer, Ehrenfeucht, Haussler and Warmuth (1987): to wield Occam’s razor is to adopt the goal of discovering “the simplest hypothesis that is consistent with the sample data” in the expectation that the simplest hypothesis will “perform well on further observations taken from the same source”. Quinlan, (1986): “Given a choice between two decision trees, each of which is correct on the training set, it seems preferable to prefer the simpler one on the grounds that it is more likely to capture structure inherent in the problem. The simpler tree would therefore be expected to classify correctly more objects outside the training set.”
5 Intelligent Systems My personal de-Occamization In the early nineties I lost faith in the Occam Thesis developed a rule learner that found substantially simpler rule sets but did not improve accuracy worked with specific to general search and hence open to finding complex variants of rules worked with disjunctive rules
6 Intelligent Systems Objections to the Occam Thesis There is no theoretical relationship between syntactic complexity and classifier accuracy. Equivalent classifiers expressed in different languages will have different levels of complexity. It is only possible to judge a selection criterion in the context of a performance objective. Conservation law of generalisation performance: there are no universal learning biases.
7 Intelligent Systems How to convince the community? Logic didn’t work Murphy and Pazzani (1994): for a number of classification learning tasks, the simplest consistent decision trees have lower predictive accuracy than slightly more complex consistent trees. but most accurate were close to the simplest had same complexity as the ‘true’ class! Boosting and bagging Bayesian averaging of many simple models
8 Intelligent Systems What about … A systematic process for adding complexity to the dominant model (decision trees) while improving accuracy without changing resubstitution performance! Decision tree grafting
9 Intelligent Systems Outline Take decision tree formed by conventional learning Look for regions of instance space that are not occupied by training examples Look for evidence supporting a change in class Graft tests and leaves that reclassify the regions appropriately To maximize likelihood of improving performance, select only the best such graft for each leaf
10 Intelligent Systems Example Instance Space
11 Intelligent Systems Guess the Class
12 Intelligent Systems C4.5's Partitions
13 Intelligent Systems Evidence supporting a change of class? During learning there will often be multiple potential cuts of which one is selected on a (fairly) arbitrary basis Look for how such a cut would have projected across the empty region and the evidence it would have provided for a different classification
14 Intelligent Systems Alternative cuts at root
15 Intelligent Systems Evidence for alternative classifications Use Laplace accuracy estimate for the alternative leaves that project through the empty region Laplace = (correct + 1) / (total + 2) (4+1)/(5+2) vs (9+1)/(9+2)
16 Intelligent Systems Algorithm PDF
17 Intelligent Systems Visit each leaf in turn
18 Intelligent Systems Consider each ancestor
19 Intelligent Systems Consider each cut that projects across empty regions of the leaf A≤7A≤6A≤5A≤4 A≤3A>6A>5A>4 A>3A>2B≤10B≤9 B≤8B≤7B≤6B≤5 B≤4B≤3B≤2B≤1 B>0B>1B>2B>3 B>4
20 Intelligent Systems Consider each cut that projects across empty regions of the leaf
21 Intelligent Systems Next ancestor
22 Intelligent Systems Root
23 Intelligent Systems A stronger cut, which is selected in preference to the weaker
24 Intelligent Systems Final tree
25 Intelligent Systems Features All new partitions define regions with volume > zero containing no objects from training set. new cuts are not simple duplications of existing cuts at ancestor nodes. every modification adds non-redundant complexity to the tree.
26 Intelligent Systems Experiments 100 x 80% / 20% holdout evaluation All 11 locally held UCI datasets containing continuous attributes 2 variants of hypothyroid subsequently added to examine why its results differed from rest
27 Intelligent Systems UCI data sets used for experimentation No. of % %No. of defaultNo. of NameAttrs.continmissingobjectsacc %classes breast cancer Wisconsin < Cleveland heart disease1346< credit rating discordant results echocardiogram glass type hepatitis Hungarian heart disease hypothyroid iris new thyroid Pima indians diabetes sick euthyroid
28 Intelligent Systems Percentage predictive accuracy for unpruned decision trees Data C4.5 C4.5X tp breast cancer Wisconsin94.1± ± Cleveland heart disease72.8± ± credit rating82.2± ± discordant results98.6± ± echocardiogram 72.0± ± glass type74.0± ± hepatitis79.6± ± Hungarian heart disease77.0± ± hypothyroid99.5± ± iris95.4± ± new thyroid89.9± ± Pima indians diabetes70.2± ± sick euthyroid98.7± ±
29 Intelligent Systems Percentage accuracy for pruned decision trees. Data C4.5 C4.5Xtp breast cancer Wisconsin95.1± ± Cleveland heart disease74.1± ± credit rating84.1± ± discordant results98.8± ± echocardiogram74.2± ± glass type74.4± ± hepatitis79.9± ± Hungarian heart disease79.2± ± hypothyroid99.5± ± iris95.4± ± new thyroid89.6± ± Pima indians diabetes72.2± ± sick euthyroid98.7± ±
30 Intelligent Systems Size of pruned trees DataC4.5C4.5X breast cancer Wisconsin19.2± ± Cleveland heart disease44.6± ± credit rating51.2± ± discordant results24.9± ± echocardiogram 10.4± ± glass type36.6± ± hepatitis13.7± ± Hungarian heart disease26.8± ± hypothyroid23.6± ± iris8.2± ± new thyroid14.1± ± Pima indians diabetes112.0± ± sick euthyroid46.5± ±
31 Intelligent Systems Vindication! Substantial increases in complexity No change in performance on training data Accuracy increased significantly more often than not
32 Intelligent Systems Hey, this might actually be useful! IJCAI-97 Allow grafts to correct misclassifications Also graft discrete valued attributes Add all grafts that pass a significance test Graft onto empty nodes by treating them as if occupied by items at parent
33 Intelligent Systems Example
34 Intelligent Systems Summary of results Substantial increase in complexity Small increase in accuracy Prune+graft is more effective than graft alone
35 Intelligent Systems All-tests-but-one-partition (ATBOP) Original approach is computationally expensive must consider every value of every attribute for every ancestor of every leaf Instead form a single partition and test grafts within it Partition contains all training instances that fail no more than one test on the path to the leaf
36 Intelligent Systems All-tests-but-one-partition
37 Intelligent Systems Resulting tree
38 Intelligent Systems Data Sets
39 Intelligent Systems Experimental treatments Grafting improves both pruned and unpruned trees. Prune & graft provides highest average accuracy. C4.5: C4.5 release 8 pruned trees. C4.5x: C4.5 with grafting. C4.5a: C4.5 with grafting from ATBOP.
40 Intelligent Systems Experimental design 10 unstratified 3-fold cross validation experiments for each data set. Allows estimation of Kohavi-Wolpert bias and variance using a similar technique all training objects used 20 times for training and 10 times for testing.
41 Intelligent Systems ATBOP Error
42 Intelligent Systems ATBOP Bias
43 Intelligent Systems ATBOP Variance
44 Intelligent Systems Compare Bagging t=10 (Error)
45 Intelligent Systems Compare Bag t=10 (nodes)
46 Intelligent Systems Conclusions Grafting provides strong evidence against the Occam Thesis Grafting achieves bagging-like variance reduction without forming a committee. Grafting forms less complex classifiers than bagging fewer nodes single directly interpretable structure
47 Intelligent Systems Complexity Merriam-Webster: Main Entry: 2 com·plex Pronunciation: käm-'pleks, k&m-', 'käm-" Function: adjective Etymology: Latin complexus, past participle of complecti to embrace, comprise (a multitude of objects), from com- + plectere to braid -- more at PLY 1 a : composed of two or more parts : COMPOSITE b (1) of a word : having a bound form as one or more of its immediate constituents (2) of a sentence : consisting of a main clause and one or more subordinate clauses 2 : hard to separate, analyze, or solve 3 : of, concerned with, being, or containing complex numbers PLYcomposedCOMPOSITE