Download presentation
Presentation is loading. Please wait.
Published byMaude O’Connor’ Modified over 9 years ago
1
Today’s Topics Learning Decision Trees (Chapter 18) –We’ll use d-trees to introduce/motivate many general issues in ML (eg, overfitting reduction) “Forests” of Decision Trees very Successful ML approach, arguably the best on many tasks Expected-Value Calculations (a topic we’ll revisit a few times) Information Gain Advanced Topic: Regression Trees Coding Tips for HW1 9/15/15CS 540 - Fall 2015 (© Jude Shavlik), Lecture 4, Week 21
2
9/15/15CS 540 - Fall 2015 (© Jude Shavlik), Lecture 4, Week 2 Learning Decision Trees (Ch 18): The ID3 Algorithm (Quinlan 1979; Machine Learning 1:1 1986) Induction of Decision Trees (top-down) –Based on Hunt’s CLS psych model (1963) –Handles noisy & missing feature values –C4.5 and C5.0 successors; CART very similar COLOR? SIZE? Red Blue Big Small - + - 2
3
9/15/15CS 540 - Fall 2015 (© Jude Shavlik), Lecture 4, Week 2 Main Hypothesis of ID3 The simplest tree that classifies training examples will work best on future examples (Occam’s Razor) COLOR? SIZE? Red Blue Big Small - + - SIZE? Big Small - + VS. NP-Hard to find the smallest tree (Hyafil +Rivest, 1976) 3 Ross Quinlan
4
9/15/15CS 540 - Fall 2015 (© Jude Shavlik), Lecture 4, Week 2 Why Occam’s Razor? (Occam lived 1285 – 1349) There are fewer short hypotheses (small trees in ID3) than long ones Short hypothesis that fits training data unlikely to be coincidence Long hypothesis that fits training data might be (since many more possibilities) COLT community formally addresses these issues (ML theory) 4
5
9/15/15CS 540 - Fall 2015 (© Jude Shavlik), Lecture 4, Week 2 Finding Small Decision Trees ID3 - Generate small trees with greedy algorithm: –Find a feature that “best” divides the data –Recur on each subset of the data that the feature creates What does “best” mean? –We’ll briefly postpone answering this 5
6
9/15/15CS 540 - Fall 2015 (© Jude Shavlik), Lecture 4, Week 2 Overview of ID3 (Recursion!) +3+3 +5+5 +1+1 +2+2 +4+4 -1 -2-2 -3-3 A1A1 A2A2 A3A3 A4A4 ID3 +3+3 +5+5 +4+4 A1A1 A3A3 A2A2 +1+1 +2+2 A1A1 A3A3 A4A4 A1A1 A3A3 A4A4 -1 -2-2 +3+3 +5+5 +4+4 A1A1 A3A3 A4A4 -3-3 A1A1 A3A3 A4A4 + - -1 -2-2 A1A1 A3A3 - + A4A4 Splitting Attribute (aka Feature) ? Use Majority class at parent node (+) - why? Dataset Resulting d-tree shown in red 6
7
9/15/15CS 540 - Fall 2015 (© Jude Shavlik), Lecture 4, Week 2 ID3 Algorithm (Figure 18.5 of textbook) Given E, a set of classified examples F, a set of features not yet in decision tree If |E| = 0 then return majority class at parent Else if All_Examples_Same_Class, Return Else if |F| = 0 return majority class (have +/- ex’s with same feature values) Else Let bestF = FeatureThatGainsMostInfo(E, F) Let leftF = F – bestF Add node bestF to decision tree For each possible value, v, of bestF do Add arc (labeled v) to decision tree And connect to result of ID3({ex in E| ex has value v for feature bestF}, leftF) 7
8
9/15/15CS 540 - Fall 2015 (© Jude Shavlik), Lecture 4, Week 2 Venn Diagram View of ID3 Question: How do decision trees divide feature space? + - + + - - + F2 F1 8
9
9/15/15CS 540 - Fall 2015 (© Jude Shavlik), Lecture 4, Week 2 Venn Diagram View of ID3 Question: How do decision trees divide the feature space? + - + + - - + F1 - F2 - + - - ++ + ‘Axis- parallel splits’ 9
10
9/15/15CS 540 - Fall 2015 (© Jude Shavlik), Lecture 4, Week 210 Use this as a guide on how to print d- trees in ASCII
11
9/15/15CS 540 - Fall 2015 (© Jude Shavlik), Lecture 4, Week 2 Main Issue How to choose next feature to place in decision tree? –Random choice? [works better than you’d expect] –Feature with largest number of values? –Feature with fewest? –Information theoretic measure (Quinlan’s approach) General-purpose tool, eg often used for “feature selection” 11
12
9/15/15CS 540 - Fall 2015 (© Jude Shavlik), Lecture 4, Week 2 Expected Value Calculations: Sample Task Imagine you invest $1 in a lottery ticket It says odds are –1 in 10 times you’ll win $5 –1 in 1,000,00 times you’ll win $100,000 How much do you expect to get back? 0.1 x $5 + 0.000001 x $100,000 = $0.60 12
13
More Generally Assume event A has N discrete and disjoint random outcomes Expected value (event A ) = prob (outcome i occurs) value(outcome i ) 9/15/15CS 540 - Fall 2015 (© Jude Shavlik), Lecture 4, Week 213 i = 1 N
14
9/15/15CS 540 - Fall 2015 (© Jude Shavlik), Lecture 4, Week 2 Scoring the Features (so we can pick the best one) Let f + = fraction of positive examples Let f - = fraction of negative examples f + = p / (p + n), f - = n / (p + n) where p = #pos, n = #neg The expected information needed to determine the category of one these examples is InfoNeeded( f +, f - ) = - f + lg (f + ) - f - lg (f - ) This is also called the entropy of the set of examples (derived later) 14 From where will we get this info?
15
9/15/15CS 540 - Fall 2015 (© Jude Shavlik), Lecture 4, Week 2 Consider the Extreme Cases of InfoNeeded(f +, f - ) 1 1 0.5 0 f+f+ InfoNeeded(f +, 1-f + ) All same class (+, say) InfoNeeded(1, 0) = -1 lg(1) - 0 lg(0) = 0 50-50 mixture InfoNeeded(½, ½) = 2 [ -½ lg(½) ] = 1 0 (by def’n) 15
16
9/15/15CS 540 - Fall 2015 (© Jude Shavlik), Lecture 4, Week 2 Evaluating a Feature How much does it help to know the value of attribute/feature A ? Assume A divides the current set of examples into N groups Let q i = fraction of data on branch i f i + = fraction of +’s on branch i f i - = fraction of –’s on branch i 16
17
9/15/15CS 540 - Fall 2015 (© Jude Shavlik), Lecture 4, Week 2 InfoRemaining(A ) Σ q i x InfoNeeded(f i +, f i - ) –Info still needed after determining the value of attribute A –Another expected value calc Pictorially Evaluating a Feature (cont.) i= 1 N A v1v1 vNvN InfoNeeded(f N +, f N - ) InfoNeeded(f +, f - ) InfoNeeded(f 1 +, f 1 - ) 17
18
9/15/15CS 540 - Fall 2015 (© Jude Shavlik), Lecture 4, Week 2 Info Gain Gain(A) InfoNeeded(f +, f - ) – InfoRemaining(A) Our scoring function in our hill-climbing (greedy) algorithm So pick A with smallest Remainder(A) Constant for all features That is, choose the feature that statistically tells us the most about the class of another example drawn from this distribution 18
19
9/15/15CS 540 - Fall 2015 (© Jude Shavlik), Lecture 4, Week 2 Sample Info-Gain Calculation InfoNeeded( f +, f - ) = - f + lg (f + ) - f - lg (f - ) +BIGRed +BIGRed -SMALLYellow -SMALLRed + BIG BlueClassSizeShapeColor 19
20
9/15/15CS 540 - Fall 2015 (© Jude Shavlik), Lecture 4, Week 2 Info-Gain Calculation (cont.) Note that “Size” provides complete classification, so done 20
21
Recursive Methods You’ll Need to Write The d-tree learning algo (pseudocode appeared above) Classifying a ‘testset’ example Leaf nodes: return leaf’s label (ie, the predicted category) Interior nodes: determine which feature value to lookup in ex return result of recursive call on the ‘left’ or ‘right’ branch Printing the d-tree in ‘plain ASCII’ (you need not follow verbatim) Tip: pass in ‘currentDepthOfRecursion’ (initially 0) Leaf nodes: print LABEL (and maybe # training ex’s reaching here) + LINEFEED Interior nodes: for each outgoing arc print LINEFEED and 3 x currentDepthOfRecursion spaces print FEATURE NAME +“ = “ + the arc’s value + “: “ make recursive call on arc, with currentDepthOfRecursion + 1 9/15/15CS 540 - Fall 2015 (© Jude Shavlik), Lecture 4, Week 221
22
Suggested Approach Randomly choose a feature –Get tree building to work –Get tree printing to work –Get tree traversal (for test ex’s) to work Add in code for infoGain –Test on simple, handcrafted datasets Train and test on SAME file (why?) –Should get ALL correct (except if extreme noise) Produce what the HW requests 9/15/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 4, Week 2 Lecture 1, Slide 22
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.