© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 1 5-Slide Example: Gene Chip Data
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 2 Decision Trees in One Picture
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 3 Example: Gene Expression Decision tree: AD_X57809_at <= : myeloma (74) AD_X57809_at > : normal (31) Leave-one-out cross-validation accuracy estimate: 97.1% X57809: IGL (immunoglobulin lambda locus)
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 4 Problem with Result Easy to predict accurately with genes related to immune function, such as IGL, but this gives us no new insight. Eliminate these genes prior to training. Possible of comprehensibility of decision trees.
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 5 Ignoring Genes Associated with Immune function Decision tree: AD_X04898_rna1_at <= : normal (30) AD_X04898_rna1_at > : myeloma (74/1) X04898: APOA2 (Apolipoprotein AII) Leave-one-out accuracy estimate: 98.1%.
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 6 Another Tree AD_M15881_at > 992: normal (28) AD_M15881_at <= 992: AC_D82348_at = A: normal (3) AC_D82348_at = P: myeloma (74)
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 7 A Measure of Node Purity Let f + = fraction of positive examples Let f - = fraction of negative examples f + = p / (p + n), f - = n / (p + n), p=#pos, n=#neg Under an optimal code, the information needed (expected number of bits) to label one example is Info( f +, f - ) = - f + lg (f + ) - f - lg (f - ) This is also called the entropy of the set of examples (derived later)
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 8 Another Commonly-Used Measure of Node Purity Gini Index: (f + ) ( f - )Gini Index: (f + ) ( f - ) Used in CART (Classification and Regression Trees, Breiman et al., 1984)Used in CART (Classification and Regression Trees, Breiman et al., 1984)
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 9 All same class (+, say)All same class (+, say) Info(1, 0) = -1 lg(1) + -0 lg(0) mixture50-50 mixture Info(½, ½) = 2[ -½ lg(½)] = 1 Info(f +, f - ) : Consider the Extreme Cases (by def) f+f+ I(f +, 1-f + )
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 10 Evaluating a Feature How much does it help to know the value of attribute/feature A ?How much does it help to know the value of attribute/feature A ? Assume A divides the current set of examples into N groupsAssume A divides the current set of examples into N groups Let q i = fraction of data on branch i f i + = fraction of +’s on branch i f i - = fraction of –’s on branch i
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 11 E(A ) Σ q i x I (f i +, f i - ) E(A ) Σ q i x I (f i +, f i - ) Info needed after determining the value of attribute AInfo needed after determining the value of attribute A Another expected value calcAnother expected value calcPictorally Evaluating a Feature (con’t) i= 1 N A v1v1 vNvN I (f N +, f N - ) I (f +, f - ) I (f 1 +, f 1 - )
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 12 Info Gain Gain(A) I(f +, f - ) – E(A) Gain(A) I(f +, f - ) – E(A) Our scoring function in our hill-climbing (greedy) algorithm So pick A with smallest E(A) Constant for all features That is, choose the feature that statistically tells us the most about the class of another example drawn from this distribution
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 13 Example Info-Gain Calculation +BIGRed +BIGRed -SMALLYellow -SMALLRed +BIGBlueClassSizeShapeColor
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 14 Info-Gain Calculation (cont.) Note that “Size” provides complete classification, so done
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 15
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 16 ID3 Info Gain Measure Justified (Ref. C4.5, J. R. Quinlan, Morgan Kaufmann, 1993, pp21-22) Definition of Information Info conveyed by message M depends on its probability, i.e., Info conveyed by message M depends on its probability, i.e., (due to Shannon) (due to Shannon) Select example from a set S and announce it belongs to class C The probability of this occurring is the fraction of C ’s in S Hence info in this announcement is, by definition,
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 17 Let there be K different classes in set S, the classes are: What is expected info from a msg about the class of an example in set S ? is the average number of bits of information (by looking at feature values) needed to classify a member of set S is the average number of bits of information (by looking at feature values) needed to classify a member of set S ID3 Info Gain Measure (cont.)
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 18 Handling Hierarchical Features in ID3 Define a new feature for each level in hierarchy, e.g., Let ID3 choose the appropriate level of abstraction! Let ID3 choose the appropriate level of abstraction! Shape CircularPolygonal Shape 1 = {Circular, Polygonal} Shape2 = {,,,, }
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 19 Handling Numeric Features in ID3 On the fly create binary features and choose best Step 1: Plot current examples (green=pos, red=neg) Step 2: Divide midway between every consecutive pair of points with different categories to create new binary features, eg feature new1 = F<8 and feature new2 = F<10 Step 3: Choose split with best info gain Value of Feature
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 20 Handling Numeric Features (cont.) Note Note F<10 F< T TF F Cannot discard numeric feature after use in one portion of d-tree
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 21 Characteristic Property of Using Info-Gain Measure FAVORS FEATURES WITH HIGH BRANCHING FACTORS (i.e. many possible values) (i.e. many possible values) Extreme Case: At most one example per leaf and all I(.,.) scores for leafs equals zero, so gets perfect score! But generalizes very poorly (ie, memorizes data) Student ID
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 22 Fix: Method 1 Convert all features to binary e.g., Color = {Red, Blue, Green} From 1 N-valued feature to N binary features From 1 N-valued feature to N binary features Color = Red? {True, False} Color = Red? {True, False} Color = Blue? {True, False} Color = Blue? {True, False} Color = Green? {True, False} Color = Green? {True, False} Used in Neural Nets and SVMs Used in Neural Nets and SVMs D-tree readability probably less, but not necessarily D-tree readability probably less, but not necessarily
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 23 Fix: Method 2 Find info content in answer to: What is value of feature A ignoring output category? What is value of feature A ignoring output category? fraction of all examples with A=i Choose A that maximizes: Read text (Mitchell) for exact details!
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 24 Fix: Method 3 Group values of nominal features vs. vs. Done in CART (Breiman et.al. 1984)Done in CART (Breiman et.al. 1984) Breiman et.al. proved for the 2-category case, optimal binary partition can be found be considering only O(N) possibilities instead of O(2 N )Breiman et.al. proved for the 2-category case, optimal binary partition can be found be considering only O(N) possibilities instead of O(2 N ) Color? R B G Y R vs B vs …G vs Y vs …
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 25 Multiple Category Classification – Method 1 (used in SVMs) Approach 1: Learn one tree (ie, model) per category What happens if test ex. is predicted to lie in multiple categories? To none? Pass test ex’s through each tree
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 26 Multiple Category Classification – Method 2 Approach 2: Learn one tree in total Subdivides the full space such that every point belongs to one and only one category (drawing slightly misleading)
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 27 Noise - Major Issue in ML Worst Case of Noise +, - at same point in feature space +, - at same point in feature space Causes of Noise 1. Too few features (“hidden variables”) or too few possible values 1. Too few features (“hidden variables”) or too few possible values 2. Incorrectly reported/measures/judged feature values 2. Incorrectly reported/measures/judged feature values 3. Mis-classified instances 3. Mis-classified instances
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 28 Noise - Major Issue in ML (cont.) Overfitting Producing an “awkward” concept because of a few “noisy” points Producing an “awkward” concept because of a few “noisy” points Bad performance on future ex’s?Better performance?
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 29 Overfitting Viewed in Terms of Function-Fitting Data = Red Line + Noise Model f(x) x Underfitting?
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 30 Definition of Overfitting Assuming large enough test set so that it is representative Concept C overfits the training data if there exists a simpler concept S so that but > < Training set accuracy of C Training set accuracy of S Test set accuracy of C Test set accuracy of S
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 31 Remember! It is easy to learn/fit the training dataIt is easy to learn/fit the training data What’s hard is generalizing well to future (“test set”) data!What’s hard is generalizing well to future (“test set”) data! Overfitting avoidance (reduction, really) is the key issue in MLOverfitting avoidance (reduction, really) is the key issue in ML
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 32 Can One Underfit? Sure, if not fully fitting the training setSure, if not fully fitting the training set Eg, just return majority category (+ or -) in the trainset as the learned model Eg, just return majority category (+ or -) in the trainset as the learned model But also if not enough data to illustrate important distinctionsBut also if not enough data to illustrate important distinctions Eg, color may be important, but all examples seen are red, so no reason to include color and make more complex modelEg, color may be important, but all examples seen are red, so no reason to include color and make more complex model
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 33 ID3 & Noisy Data To avoid overfitting, could allow splitting to stop before all ex’s are of one class Early stopping was Quinlan’s original ideaEarly stopping was Quinlan’s original idea But post-pruning now seen as betterBut post-pruning now seen as better - More robust to weaknesses of greedy algo’s (eg, benefits from seeing the full tree, if node only temporally looked bad)
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 34 ID3 & Noisy Data (cont.) Build complete tree, then use some “spare” (tuning) examples to decide which parts of tree can be pruned - called “Reduced [tuneset] Error Pruning” - called “Reduced [tuneset] Error Pruning”
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 35 ID3 & Noisy Data (cont.) See which subtree has highest tune-set accuracySee which subtree has highest tune-set accuracy Repeat (ie, another greedy algo)Repeat (ie, another greedy algo) Better tuneset accuracy? Discard (replace by leaf)?
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 36 Greedily Pruning D-Trees Sample (Hill Climbing) Search Space best Stop if best is not an improvement
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 37 Greedily Pruning D-trees - Pseudocode 1.Run ID3 to fully fit TRAIN Set, measure accuracy on TUNE 2.Consider all subtrees where ONE interior node removed and replaced by leaf - label with majority category - label with majority category in pruned subtree in pruned subtree Choose best subtree if progress on TUNE Choose best subtree if progress on TUNE If no improvement, quit If no improvement, quit 3. Go to 2 +
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 38 Train/Tune/Test Accuracies (same sort of curves for other tuned param’s in other algo’s) 100% Accuracy Tune Test Train Ideal tree to choose Chosen Pruned Tree Amount of Pruning
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 39 The Tradeoff in Greedy Algorithms Efficiency vs Optimality R AB C D F E Initial Tree True Best Cuts Discard C’s & F’s subtrees Single Best Cut Discard B’s subtrees - irrevocable Greedy Search: Powerful, General- Purpose, Trick–of-the-Trade
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 40 Pruning in C4.5 (Successor to ID3) Works bottom-up in a single pass, so very fastWorks bottom-up in a single pass, so very fast Can replace a subtree rooted at a node with either a leaf or the best child of that nodeCan replace a subtree rooted at a node with either a leaf or the best child of that node Does not use tuning set, yet works surprisingly wellDoes not use tuning set, yet works surprisingly well
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 41 Decision “Stumps” Holte (MLJ) compared:Holte (MLJ) compared: Decision trees with only one decision (decision stumps)Decision trees with only one decision (decision stumps)VS Trees produced by C4.5 (with pruning algorithm used)Trees produced by C4.5 (with pruning algorithm used) Decision “stumps” do remarkably well on UC Irvine data setsDecision “stumps” do remarkably well on UC Irvine data sets Archive too easy? Some datasets seem to be.Archive too easy? Some datasets seem to be. Decision stumps are a “quick and dirty” control for comparing to new algorithmsDecision stumps are a “quick and dirty” control for comparing to new algorithms But C4.5 easy to use and probably a better controlBut C4.5 easy to use and probably a better control
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 42 C4.5 Compared to 1R (“Decision Stumps”) See Holte paper for key (eg, HD=heart disease) DatasetC4.51R BC72.0%68.7% CH99.2%68.7% GL63.2%67.6% G274.3%53.8% HD73.6%72.9% HE81.2%76.3% HO83.6%81.0% HY99.1%97.2% IR93.8%93.5% LA77.2%71.5% LY77.5%70.7% MU100.0%98.4% SE97.7%95.0% SO97.5%81.0% VO95.6%95.2% V189.4%86.8% Testset Accuracy
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 43 Generating IF-THEN Rules from Trees Antecedent: Conjunction of all decisions leading to terminal nodeAntecedent: Conjunction of all decisions leading to terminal node Consequent: Label of terminal nodeConsequent: Label of terminal node ExampleExample Red COLOR ? SIZE ? Blue Big Small Green -
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 44 Generating Rules (cont.) Prev example generates these rules If Color=Green Output = - If Color=Green Output = - If Color=Blue Output = + If Color=Blue Output = + If Color=Red and Size=Big + If Color=Red and Size=Big + If Color=Red and Size=Small - If Color=Red and Size=Small -Note 1. Can “clean up” the rule set (next) 1. Can “clean up” the rule set (next) 2. Decision trees learn disjunctive concepts 2. Decision trees learn disjunctive concepts
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 45 Rule Post-Pruning (Another Greedy Algorithm) 1.Induce a decision tree 2.Convert to rules (see earlier slide) 3.Consider dropping any one rule antecedent Delete the one that improves tuning set accuracy the mostDelete the one that improves tuning set accuracy the most Repeat as long as progress being madeRepeat as long as progress being made
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 46 Rule Post-Pruning (cont) Advantages Allows an intermediate node to be pruned from some rules but retained in othersAllows an intermediate node to be pruned from some rules but retained in others Can correct poor early decisions in tree constructionCan correct poor early decisions in tree construction Final concept more understandableFinal concept more understandable