Presentation is loading. Please wait.

Presentation is loading. Please wait.

The joy of Entropy.

Similar presentations


Presentation on theme: "The joy of Entropy."— Presentation transcript:

1 The joy of Entropy

2 Administrivia Reminder: HW 1 due next week
No other news. No noose is good noose...

3 Time wings on... Last time: Hypothesis spaces Intro to decision trees
This time: Loss matrices Learning bias The getBestSplitFeature function Entropy

4 Loss For problem 8.11, you need cost values A.k.a. loss values
Introduced in DH&S ch. 2.2 Basic idea: some mistakes are more expensive than others

5 Loss Example: classifying computer network traffic
Traffic is either normal or intrusive There’s way more normal traffic than intrusive Data is normal, but classifier says “intrusive”? Data is intrusive, but classifier says “normal”?

6 Cost of mistakes Normal Intrusion $0 $5 $5,000 True class Predicted

7 Cost of mistakes True class ω1 ω2 λ11 λ12 λ21 λ22 Predicted class

8 In general... ω1 ω2 ... ωk λ11 λ12 λ21 λ22 λkk True class Predicted

9 Cost-based criterion For the misclassification error case, we wrote the risk of a classifer f as: For the cost-based case, this becomes:

10 Back to decision trees... Reminders: Hypothesis space for DT:
Data struct view: All trees with single test per internal node and constant leaf value Geometric view: Sets of axis-orthagonal hyper- rectangles; piecewise constant approximation Open question: getBestSplitFeature function

11 Splitting criteria What properties do we want our getBestSplitFeature() function to have? Increase the purity of the data After split, new sets should be closer to uniform labeling than before the split Want the subsets to have roughly the same purity Want the subsets to be as balanced as possible

12 Bias These choices are designed to produce small trees
May miss some other, better trees that are: Larger Require a non-greedy split at the root Definition: Learning bias == tendency of an algorithm to find one class of solution out of H in preference to another

13 Bias: the pretty picture
Space of all functions on

14 Bias: the algebra Bias also seen as expected difference between true concept and induced concept: Note: expectation taken over all possible data sets Don’t actually know that distribution either :-P Can (sometimes) make a prior assumption

15 More on Bias Bias can be a property of: Risk/loss function
How you measure “distance” to best solution Search strategy How you move through H to find

16 Back to splitting... Consider a set of true/false labels
Want our measure to be small when the set is pure (all true or all false), and large when set is almost evenly divided between the classes In general: we call such a function impurity, i(y) We’ll use entropy Expresses the amount of information in the set (Later we’ll use the negative of this function, so it’ll be better if the set is almost pure)

17 Entropy, cont’d Define: class fractions (a.k.a., class prior probabilities)

18 Entropy, cont’d Define: class fractions (a.k.a., class prior probabilities) Define: entropy of a set

19 Entropy, cont’d Define: class fractions (a.k.a., class prior probabilities) Define: entropy of a set In general, for classes :

20 The entropy curve

21 Properties of entropy Maximum when class fractions equal
Minimum when data is pure Smooth Differentiable; continuous Convex Intuitively: entropy of a dist tells you how “predictable” that dist is.

22 From: Andrew Moore’s tutorial on information gain:
Entropy in a nutshell From: Andrew Moore’s tutorial on information gain:

23 Entropy in a nutshell Low entropy distribution
data values (location of soup) sampled from tight distribution (bowl) -- highly predictable

24 Entropy in a nutshell High entropy distribution
data values (location of soup) sampled from loose distribution (uniformly around dining room) -- highly unpredictable

25 Entropy of a split A split produces a number of sets (one for each branch) Need a corresponding entropy of a split (i.e., entropy of a collection of sets) Definition: entropy of a split where:

26 Information gain The last, easy step:
Want to pick the attribute that decreases the information content of the data as much as possible Q: Why decrease? Define: gain of splitting data set [X,y] on attribute a:

27 The splitting method Feature getBestSplitFeature(X,Y) {
// Input: instance set X, label set Y double baseInfo=entropy(Y); double[] gain=new double[]; for (a : X.getFeatureSet()) { [X0,...,Xk,Y0,...,Yk]=a.splitData(X,Y); gain[a]=baseInfo-splitEntropy(Y0,...,Yk); } return argmax(gain);

28 DTs in practice... Growing to purity is bad (overfitting)

29 DTs in practice... Growing to purity is bad (overfitting)
x2: sepal width x1: petal length

30 DTs in practice... Growing to purity is bad (overfitting)
x2: sepal width x1: petal length

31 DTs in practice... Growing to purity is bad (overfitting)
Terminate growth early Grow to purity, then prune back

32 DTs in practice... Growing to purity is bad (overfitting)
Not statistically supportable leaf Remove split & merge leaves x2: sepal width x1: petal length

33 DTs in practice... Multiway splits are a pain
Entropy is biased in favor of more splits Correct w/ gain ratio (DH&S Ch , Eqn 7)

34 DTs in practice... Real-valued attributes
rules of form if (x1<3.4) { ... } How to pick the “3.4”?

35 Measuring accuracy So now you have a DT -- what now?
Usually, want to use it to classify new data (previously unseen) Want to know how well you should expect it to perform How do you estimate such a thing?

36 Measuring accuracy So now you have a DT -- what now?
Usually, want to use it to classify new data (previously unseen) Want to know how well you should expect it to perform How do you estimate such a thing? Theoretically -- prove that you have the “right” tree Very, very hard in practice Measure it Trickier than it sounds!....

37 Testing with training data
So you have a data set: and corresponding labels: You build your decision tree: tree=buildDecisionTree(X,y) What happens if you just do this: double acc=0.0; for (int i=1;i<=N;++i) { acc+=(tree.classify(X[i])==y[i]); } acc/=N; return acc;

38 Testing with training data
Answer: you tend to overestimate real accuracy (possibly drastically) ? ? ? ? ? x2: sepal width ?

39 Separation of train & test
Fundamental principle (1st amendment of ML): Don’t evaluate accuracy (performance) of your classifier (learning system) on the same data used to train it!


Download ppt "The joy of Entropy."

Similar presentations


Ads by Google