Presentation is loading. Please wait.

Presentation is loading. Please wait.

Decision trees and empirical methodology Sec 4.3, 5.1-5.4.

Similar presentations


Presentation on theme: "Decision trees and empirical methodology Sec 4.3, 5.1-5.4."— Presentation transcript:

1 Decision trees and empirical methodology Sec 4.3, 5.1-5.4

2 Review... Goal: want to find/replicate target function f() Candidates from hypothesis space, H “best” candidate measured by accuracy (for the moment) Decision trees built by greedy, recursive search Produces piecewise constant, axis- orthagonal, hyperrectangular model Can handle continuous or categorical attributes Only categorical class labels Learning bias for small, well balanced trees

3 Splitting criteria What properties do we want our getBestSplitAttribute() function to have? Increase the purity of the data After split, new sets should be closer to uniform labeling than before the split Want the subsets to have roughly the same purity Want the subsets to be as balanced as possible These choices are designed to produce small trees Definition: Learning bias == tendency to find one class of solution out of H in preference to another

4 Entropy We’ll use entropy Consider a set of true/false labels Want our measure to be small when the set is pure (all true or all false), and large when set is almost split between the classes Expresses the amount of information in the set (Later we’ll use the negative of this function, so it’ll be better if the set is almost pure)

5 Entropy, cont’d Define: class fractions (a.k.a., class prior probabilities) Define: entropy of a set In general, for classes :

6 The entropy curve

7 Entropy of a split A split produces a number of sets (one for each branch) Need a corresponding entropy of a split (i.e., entropy of a collection of sets) Definition: entropy of a split

8 Information gain The last, easy step: Want to pick the attribute that decreases the information content of the data as much as possible Q: Why decrease? Define: gain of splitting data set [X,Y] on attribute a :

9 Final algorithm Now we have a complete alg for the getBestSplitAttribute() function: Input: InstanceSet X, LabelSet Y Output: Attribute baseInfo=entropy(Y); foreach a in (X.attributes) { [X1,...,Xk,Y1,...,Yk]=splitData(X,Y,a); gain[a]=baseInfo-splitEntropy(Y1,...,Yk); } return argmax(gain);

10 DTs in practice... Growing to purity is bad (overfitting)

11 DTs in practice... Growing to purity is bad (overfitting) x1: petal length x2: sepal width

12 DTs in practice... Growing to purity is bad (overfitting) x1: petal length x2: sepal width

13 DTs in practice... Growing to purity is bad (overfitting) Terminate growth early Grow to purity, then prune back Multiway splits are a pain Entropy is biased in favor of more splits Correct w/ gain ratio

14 DTs in practice... Growing to purity is bad (overfitting) Terminate growth early Grow to purity, then prune back Multiway splits are a pain Entropy is biased in favor of more splits Correct w/ gain ratio Real-valued attributes rules of form if (x1<3.4) {... } How to pick the “3.4”?

15 Measuring accuracy So now you have a DT -- what now? Usually, want to use it to classify new data (previously unseen) Want to know how well you should expect it to perform How do you estimate such a thing?

16 Measuring accuracy So now you have a DT -- what now? Usually, want to use it to classify new data (previously unseen) Want to know how well you should expect it to perform How do you estimate such a thing? Theoretically -- prove that you have the “right” tree Very, very hard in practice Measure it Trickier than it sounds!....

17 Testing with training data So you have a data set: and corresponding labels: You build your decision tree: tree=buildDecisionTree(X,Y) What happens if you just do this: acc=0.0; for (i=1;i<=N;++i) { acc+=(tree.classify(X[i])==Y[i]); } acc/=N; return acc; ?

18 Testing with training data Answer: you tend to overestimate real accuracy (possibly drastically) x2: sepal width ? ? ? ? ? ?

19 Separation of train & test Fundamental principle (1st amendment of ML): Don’t evaluate accuracy (performance) of your classifier (learning system) on the same data used to train it!

20 Holdout data Usual to “hold out” a separate set of data for testing; not used to train classifier A.k.a., test set, holdout set, evaluation set, etc. E.g., is training set accuracy is test set (or generalization) accuracy

21 Gotchas... What if you’re unlucky when you split data into train/test? E.g., all train data are class A and all test are class B? No “red” things show up in training data Best answer: stratification Try to make sure class (+feature) ratios are same in train/test sets (and same as original data) Why does this work? Almost as good: randomization Shuffle data randomly before split Why does this work?


Download ppt "Decision trees and empirical methodology Sec 4.3, 5.1-5.4."

Similar presentations


Ads by Google