Presentation is loading. Please wait.

Presentation is loading. Please wait.

6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

Similar presentations


Presentation on theme: "6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)"— Presentation transcript:

1 6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

2 Decision Trees in One Picture © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

3 Example: Gene Expression Decision tree: AD_X57809_at <= 20343.4: myeloma (74) AD_X57809_at > 20343.4: normal (31) Leave-one-out cross-validation accuracy estimate: 97.1% X57809: IGL (immunoglobulin lambda locus) © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

4 Problem with Result Easy to predict accurately with genes related to immune function, such as IGL, but this gives us no new insight. Eliminate these genes prior to training. Possible because of comprehensibility of decision trees. © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

5 Ignoring Genes Associated with Immune function Decision tree: AD_X04898_rna1_at <= -1453.4: normal (30) AD_X04898_rna1_at > -1453.4: myeloma (74/1) X04898: APOA2 (Apolipoprotein AII) Leave-one-out accuracy estimate: 98.1%. © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

6 Another Tree AD_M15881_at > 992: normal (28) AD_M15881_at <= 992: AC_D82348_at = A: normal (3) AC_D82348_at = P: myeloma (74) © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

7 A Measure of Node Purity Let f + = fraction of positive examples Let f - = fraction of negative examples f + = p / (p + n), f - = n / (p + n), p=#pos, n=#neg Under an optimal code, the information needed (expected number of bits) to label one example is Info( f +, f - ) = - f + lg (f + ) - f - lg (f - ) This is also called the entropy of the set of examples (derived later) © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

8 Another Commonly-Used Measure of Node Purity Gini Index: (f + ) ( f - )Gini Index: (f + ) ( f - ) Used in CART (Classification and Regression Trees, Breiman et al., 1984)Used in CART (Classification and Regression Trees, Breiman et al., 1984) © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

9 All same class (+, say)All same class (+, say) Info(1, 0) = -1 lg(1) + -0 lg(0) 0 50-50 mixture50-50 mixture Info(½, ½) = 2[ -½ lg(½)] = 1 Info(f +, f - ) : Consider the Extreme Cases 0 0 (by def) © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

10 Evaluating a Feature How much does it help to know the value of attribute/feature A ?How much does it help to know the value of attribute/feature A ? Assume A divides the current set of examples into N groupsAssume A divides the current set of examples into N groups Let q i = fraction of data on branch i f i + = fraction of +’s on branch i f i - = fraction of –’s on branch i © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

11 E(A ) Σ q i x I (f i +, f i - ) E(A )  Σ q i x I (f i +, f i - ) Info needed after determining the value of attribute AInfo needed after determining the value of attribute A Another expected value calcAnother expected value calcPictorally Evaluating a Feature (con’t) i= 1 N A v1v1 vNvN I (f N +, f N - ) I (f +, f - ) I (f 1 +, f 1 - ) © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

12 Info Gain Gain(A) I(f +, f - ) – E(A) Gain(A)  I(f +, f - ) – E(A) Our scoring function in our hill-climbing (greedy) algorithm So pick A with smallest E(A) Constant for all features That is, choose the feature that statistically tells us the most about the class of another example drawn from this distribution © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

13 Example Info-Gain Calculation +BIGRed +BIGRed -SMALLYellow -SMALLRed +BIGBlueClassSizeShapeColor © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

14 Info-Gain Calculation (cont.) Note that “Size” provides complete classification, so done © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

15 © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

16 ID3 Info Gain Measure Justified (Ref. C4.5, J. R. Quinlan, Morgan Kaufmann, 1993, pp21-22) Definition of Information Info conveyed by message M depends on its probability, i.e., Info conveyed by message M depends on its probability, i.e., (due to Shannon) (due to Shannon) Select example from a set S and announce it belongs to class C The probability of this occurring is the fraction of C ’s in S Hence info in this announcement is, by definition, © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

17 Let there be K different classes in set S, the classes are: What is expected info from a msg about the class of an example in set S ? is the average number of bits of information (by looking at feature values) needed to classify a member of set S is the average number of bits of information (by looking at feature values) needed to classify a member of set S ID3 Info Gain Measure (cont.) © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

18 Handling Hierarchical Features in ID3 Define a new feature for each level in hierarchy, e.g., Let ID3 choose the appropriate level of abstraction! Let ID3 choose the appropriate level of abstraction! Shape CircularPolygonal Shape 1 = {Circular, Polygonal} Shape2 = {,,,, } © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

19 Handling Numeric Features in ID3 On the fly create binary features and choose best Step 1: Plot current examples (green=pos, red=neg) Step 2: Divide midway between every consecutive pair of points with different categories to create new binary features, eg feature new1 = F<8 and feature new2 = F<10 Step 3: Choose split with best info gain Value of Feature 5 79 1113 © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

20 Handling Numeric Features (cont.) Note Note F<10 F< 5- +- T TF F Cannot discard numeric feature after use in one portion of d-tree © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

21 Characteristic Property of Using Info-Gain Measure FAVORS FEATURES WITH HIGH BRANCHING FACTORS (i.e. many possible values) (i.e. many possible values) Extreme Case: At most one example per leaf and all I(.,.) scores for leafs equals zero, so gets perfect score! But generalizes very poorly (ie, memorizes data) Student ID 1+ 0- 0+ 0- 0+ 1- 1 99 999 © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

22 Fix: Method 1 Convert all features to binary e.g., Color = {Red, Blue, Green} From 1 N-valued feature to N binary features From 1 N-valued feature to N binary features Color = Red? {True, False} Color = Red? {True, False} Color = Blue? {True, False} Color = Blue? {True, False} Color = Green? {True, False} Color = Green? {True, False} Used in Neural Nets and SVMs Used in Neural Nets and SVMs D-tree readability probably less, but not necessarily D-tree readability probably less, but not necessarily © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

23 Fix 2: Gain Ratio Find info content in answer to: What is value of feature A ignoring output category? What is value of feature A ignoring output category? fraction of all examples with A=i Choose A that maximizes: Read text (Mitchell) for exact details! © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

24 Fix: Method 3 Group values of nominal features vs. vs. Done in CART (Breiman et.al. 1984)Done in CART (Breiman et.al. 1984) Breiman et.al. proved for the 2-category case, optimal binary partition can be found be considering only O(N) possibilities instead of O(2 N )Breiman et.al. proved for the 2-category case, optimal binary partition can be found be considering only O(N) possibilities instead of O(2 N ) Color? R B G Y R vs B vs …G vs Y vs … © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

25 Multiple Category Classification – Method 1 (used in SVMs) Approach 1: Learn one tree (ie, model) per category What happens if test ex. is predicted to lie in multiple categories? To none? Pass test ex’s through each tree © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

26 Multiple Category Classification – Method 2 Approach 2: Learn one tree in total Subdivides the full space such that every point belongs to one and only one category (drawing slightly misleading) © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

27 Where DT Learners Fail Correlation-immune functions like XOR (more in later lecture)Correlation-immune functions like XOR (more in later lecture) Target concepts where all/most features are needed (so large tree is required), but we have limited dataTarget concepts where all/most features are needed (so large tree is required), but we have limited data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

28 Key Lessons from DT Research of 1980s In general, simpler trees are betterIn general, simpler trees are better Avoid overfittingAvoid overfitting Link to MDL/MML principle and Occam’s RazorLink to MDL/MML principle and Occam’s Razor Pruning is better than Early StoppingPruning is better than Early Stopping © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

29 Advantages of Decision Tree Learning 1. Output is easily human- comprehensible1. Output is easily human- comprehensible Can convert tree to rules – one rule per branch leading to a positive classCan convert tree to rules – one rule per branch leading to a positive class Gives insight into taskGives insight into task Let’s us find errors in our approachLet’s us find errors in our approach Using a “noise peak” in mass spec dataUsing a “noise peak” in mass spec data Using antibody (IG, HLA) genes in myelomaUsing antibody (IG, HLA) genes in myeloma © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

30 Advantages of DT Learning 2. Good at Junta-Learning problems2. Good at Junta-Learning problems Tree learners focus on finding smallest number of features that can be used to classify accuratelyTree learners focus on finding smallest number of features that can be used to classify accurately Tree learners ignore all features they don’t selectTree learners ignore all features they don’t select © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

31 Advantages of DT Learning 3. Really FAST!3. Really FAST! Time to learn is fast: O(nm) to make a split, where nm is size of data set (n is number of features, m is number of data points)Time to learn is fast: O(nm) to make a split, where nm is size of data set (n is number of features, m is number of data points) Typically few splits required, data analyzed for recursive splits is even smallerTypically few splits required, data analyzed for recursive splits is even smaller Time to label is linear in tree depth onlyTime to label is linear in tree depth only © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

32 Advantages of DT Learners 4. Easily extended4. Easily extended Numerical prediction: put average value at leaf, score split by reduction in variance (regression trees)Numerical prediction: put average value at leaf, score split by reduction in variance (regression trees) Can instead have linear regression models at leaves (model trees)Can instead have linear regression models at leaves (model trees) Easy to do multi-class prediction and multiple instance learning (later lecture)Easy to do multi-class prediction and multiple instance learning (later lecture) © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

33 Advantages of DT Learning 5. Complete representation language5. Complete representation language DTs can represent any Boolean functionDTs can represent any Boolean function Note: this does not guarantee they will effectively learn any functionNote: this does not guarantee they will effectively learn any function © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

34 Disadvantages of DT Learners 1. Greedy algorithm means they fail on correlation immune functions1. Greedy algorithm means they fail on correlation immune functions 2. They settle for minimum feature set required for discrimination: discriminate rather than characterize2. They settle for minimum feature set required for discrimination: discriminate rather than characterize 3. Tendency to overfit if more features than data… little data left for as we get deeper in tree3. Tendency to overfit if more features than data… little data left for as we get deeper in tree © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

35 Error Assume some probability distribution D over possible examples (e.g., all truth assignments over x1, …, xN)Assume some probability distribution D over possible examples (e.g., all truth assignments over x1, …, xN) For a training set, assume D uniformFor a training set, assume D uniform Assume a hypothesis space H (e.g., all decision trees over x1, …, xN)Assume a hypothesis space H (e.g., all decision trees over x1, …, xN) The error of a hypothesis h in H is the probability h misclassifies an example drawn randomly according to DThe error of a hypothesis h in H is the probability h misclassifies an example drawn randomly according to D © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

36 Overfitting Given hypothesis space H, probability distribution D over examples, and training set TGiven hypothesis space H, probability distribution D over examples, and training set T Hypothesis h in H overfits T if there exists another h’ in H such that:Hypothesis h in H overfits T if there exists another h’ in H such that: h has smaller error than h’ on Th has smaller error than h’ on T h’ has smaller error than h over Dh’ has smaller error than h over D © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

37 How we usually observe overfitting Accuracy (1 – error) of our hypothesis drops substantially on a new test set compared to the training setAccuracy (1 – error) of our hypothesis drops substantially on a new test set compared to the training set We observe this to greater extent as decision tree size growsWe observe this to greater extent as decision tree size grows How can we combat overfitting in DTs?How can we combat overfitting in DTs? © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

38 Reduced Error Pruning Hold aside a tuning set (or validation set) from training setHold aside a tuning set (or validation set) from training set For each internal node in the treeFor each internal node in the tree Try replacing that subtree at that node with a leaf that predicts majority class at that node in training setTry replacing that subtree at that node with a leaf that predicts majority class at that node in training set Keep this change if performance on tune set grows no worse than without changeKeep this change if performance on tune set grows no worse than without change Kearns & Mansour: do this bottom-upKearns & Mansour: do this bottom-up © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

39 From Mitchell (p. 70) © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

40 Rule Post-Pruning Build full decision treeBuild full decision tree Convert each root-leaf path into an IF- THEN rule, where tests along the path are PRECONDITIONS and leaf label is CONSEQUENTConvert each root-leaf path into an IF- THEN rule, where tests along the path are PRECONDITIONS and leaf label is CONSEQUENT For each rule, remove preconditions that improve estimated accuracyFor each rule, remove preconditions that improve estimated accuracy Sort rules by accuracy estimates, and consider them in order: decision listSort rules by accuracy estimates, and consider them in order: decision list © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

41 Estimating Accuracy By “accuracy” here, we really mean precision: probability the rule is correct when it firesBy “accuracy” here, we really mean precision: probability the rule is correct when it fires Could use validation (tune) set if enough data is availableCould use validation (tune) set if enough data is available C4.5: evaluate on training set and be pessimistic: use binomial distribution to get 95% confidence intervalC4.5: evaluate on training set and be pessimistic: use binomial distribution to get 95% confidence interval Example to follow…Example to follow… © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

42 The Binomial Distribution Distribution over the number of successes in a fixed number n of independent trials (with same probability of success p in each)Distribution over the number of successes in a fixed number n of independent trials (with same probability of success p in each) © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

43 Using the Binomial Here Let each data point (example) be a trial, and let a success be a correct predictionLet each data point (example) be a trial, and let a success be a correct prediction Can exactly compute probability that error rate estimate p is off by more than some amount, say 0.025, in either directionCan exactly compute probability that error rate estimate p is off by more than some amount, say 0.025, in either direction © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

44 Example Rule: color = red & size = small -> class = positive (50/14)Rule: color = red & size = small -> class = positive (50/14) Want x s.t. Pr(precision<x) < 0.05Want x s.t. Pr(precision<x) < 0.05 b(50,0.72) says x is 60%b(50,0.72) says x is 60% Revised: color = red -> class = positive (100/20)Revised: color = red -> class = positive (100/20) b(100,0.8) says x is 72%b(100,0.8) says x is 72% Win by raising precision or raising coverage without big loss in precisionWin by raising precision or raising coverage without big loss in precision © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)


Download ppt "6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)"

Similar presentations


Ads by Google