Download presentation
Presentation is loading. Please wait.
Published byAustin O’Neal’ Modified over 9 years ago
1
Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012
2
Information Theory 2
3
Entropy Information theoretic measure Measures information in model Conceptually, lower bound on # bits to encode Entropy: H(X): X is a random var, p: prob fn 3
4
Cross-Entropy Comparing models Actual distribution unknown p Use simplified model to estimate m Closer match will have lower cross-entropy 4
5
Cross-Entropy Comparing models Actual distribution unknown p Use simplified model to estimate m Closer match will have lower cross-entropy 5
6
Cross-Entropy Comparing models Actual distribution unknown p Use simplified model to estimate m Closer match will have lower cross-entropy 6
7
Cross-Entropy Comparing models Actual distribution unknown p Use simplified model to estimate m Closer match will have lower cross-entropy 7
8
Cross-Entropy Comparing models Actual distribution unknown p Use simplified model to estimate m Closer match will have lower cross-entropy 8
9
Cross-Entropy Comparing models Actual distribution unknown p Use simplified model to estimate m Closer match will have lower cross-entropy 9
10
Relative Entropy Commonly known as Kullback-Liebler divergence Expresses difference between probability distributions 10
11
Relative Entropy Commonly known as Kullback-Liebler divergence Expresses difference between probability distributions 11
12
Relative Entropy Commonly known as Kullback-Liebler divergence Expresses difference between probability distributions Not a proper distance metric: 12
13
Relative Entropy Commonly known as Kullback-Liebler divergence Expresses difference between probability distributions Not a proper distance metric: asymmetric KL(p||q) != KL(q||p) 13
14
Joint & Conditional Entropy Joint entropy: 14
15
Joint & Conditional Entropy Joint entropy: Conditional entropy: 15
16
Joint & Conditional Entropy Joint entropy: Conditional entropy: 16
17
Joint & Conditional Entropy Joint entropy: Conditional entropy: 17
18
Perplexity and Entropy Given that Consider the perplexity equation: PP(W) = P(W) -1/N = = = 2 H(L,P) Where H is the entropy of the language L 18
19
Mutual Information Measure of information in common between two distributions 19
20
Mutual Information Measure of information in common between two distributions 20
21
Mutual Information Measure of information in common between two distributions 21
22
Mutual Information Measure of information in common between two distributions 22
23
Mutual Information Measure of information in common between two distributions Symmetric: I(X;Y) = I(Y;X) 23
24
Mutual Information Measure of information in common between two distributions Symmetric: I(X;Y) = I(Y;X) I(X;Y) = KL(p(x,y)||p(x)p(y)) 24
25
Decision Trees 25
26
Classification Task Task: C is a finite set of labels (aka categories, classes) Given x, determine its category y in C 26
27
Classification Task Task: C is a finite set of labels (aka categories, classes) Given x, determine its category y in C Instance: (x,y) x: thing to be labeled/classified y: label/class 27
28
Classification Task Task: C is a finite set of labels (aka categories, classes) Given x, determine its category y in C Instance: (x,y) x: thing to be labeled/classified y: label/class Data: set of instances labeled data: y is known unlabeled data: y is unknown 28
29
Classification Task Task: C is a finite set of labels (aka categories, classes) Given x, determine its category y in C Instance: (x,y) x: thing to be labeled/classified y: label/class Data: set of instances labeled data: y is known unlabeled data: y is unknown Training data, test data 29
30
Two Stages Training: Learner: training data classifier 30
31
Two Stages Training: Learner: training data classifier Testing: Decoder: test data + classifier classification output 31
32
Two Stages Training: Learner: training data classifier Classifier: f(x) =y: x is input; y in C Testing: Decoder: test data + classifier classification output Also Preprocessing Postprocessing Evaluation 32
33
33
34
Roadmap Decision Trees: Sunburn example Decision tree basics From trees to rules Key questions Training procedure? Decoding procedure? Overfitting? Different feature type? Analysis: Pros & Cons 34
35
Sunburn Example 35
36
Learning about Sunburn Goal: Train on labeled examples Predict Burn/None for new instances Solution?? Exact match: same features, same output Problem: 2*3^3 feature combinations Could be much worse Same label as ‘most similar’ Problem: What’s close? Which features matter? Many match on two features but differ on result 36
37
Learning about Sunburn Better Solution: Decision tree Training: Divide examples into subsets based on feature tests Sets of samples at leaves define classification Prediction: Route NEW instance through tree to leaf based on feature tests Assign same value as samples at leaf 37
38
Sunburn Decision Tree Hair Color Lotion Used Blonde Red Brown Alex: None John: None Pete: None Emily: Burn NoYes Sarah: Burn Annie: Burn Katie: None Dana: None 38
39
Decision Tree Structure Internal nodes: Each node is a test Generally tests a single feature E.g. Hair == ? Theoretically could test multiple features Branches: Each branch corresponds to outcome of test E.g Hair == Red; Hair != Blond Leaves: Each leaf corresponds to a decision Discrete class: Classification / Decision Tree Real value: Regression 39
40
From Trees to Rules Tree: Branches from root to leaves = Tests => classifications Tests = if antecedents; Leaf labels= consequent All Decision trees-> rules Not all rules as trees 40
41
From ID Trees to Rules Hair Color Lotion Used Blonde Red Brown Alex: None John: None Pete: None Emily: Burn NoYes Sarah: Burn Annie: Burn Katie: None Dana: None (if (equal haircolor blonde) (equal lotionused yes) (then None)) (if (equal haircolor blonde) (equal lotionused no) (then Burn)) (if (equal haircolor red) (then Burn)) (if (equal haircolor brown) (then None)) 41
42
Which Tree? Many possible decision trees for any problem How can we select among them? What would be the ‘best’ tree? Smallest? Shallowest? Most accurate on unseen data? 42
43
Simplicity Occam’s Razor: Simplest explanation that covers the data is best Occam’s Razor for decision trees: Smallest tree consistent with samples will be best predictor for new data Problem: Finding all trees & finding smallest: Expensive! Solution: Greedily build a small tree 43
44
Building Trees: Basic Algorithm Goal: Build a small tree such that all samples at leaves have same class Greedy solution: At each node, pick test using ‘best’ feature Split into subsets based on outcomes of feature test Repeat process until stopping criterion i.e. until leaves have same class 44
45
Key Questions Splitting: How do we select the ‘best’ feature? Stopping: When do we stop splitting to avoid overfitting? Features: How do we split different types of features? Binary? Discrete? Continuous? 45
46
Building Decision Trees: I Goal: Build a small tree such that all samples at leaves have same class Greedy solution: At each node, pick test such that branches are closest to having same class Split into subsets where most instances in uniform class 46
47
Picking a Test Hair Color Blonde Red Brown Alex: N Pete: N John: N Emily: B Sarah: B Dana: N Annie: B Katie: N HeightWeightLotion Short Average Tall Alex:N Annie:B Katie:N Sarah:B Emily:B John:N Dana:N Pete:N Sarah:B Katie:N Light Average Heavy Dana:N Alex:N Annie:B Emily:B Pete:N John:N No Yes Sarah:B Annie:B Emily:B Pete:N John:N Dana:N Alex:N Katie:N 47
48
Picking a Test HeightWeightLotion Short Average Tall Annie:B Katie:N Sarah:B Dana:N Sarah:B Katie:N Light Average Heavy Dana:N Annie:B No Yes Sarah:B Annie:B Dana:N Katie:N 48
49
Measuring Disorder Problem: In general, tests on large DB’s don’t yield homogeneous subsets Solution: General information theoretic measure of disorder Desired features: Homogeneous set: least disorder = 0 Even split: most disorder = 1 49
50
Measuring Entropy If split m objects into 2 bins size m1 & m2, what is the entropy? 50
51
Measuring Disorder: Entropy the probability of being in bin i Entropy (disorder) of a split Assume -½ log 2 ½ - ½ log 2 ½ = ½ +½ = 1 ½½ -¼ log 2 ¼ - ¾ log 2 ¾ = 0.5 + 0.311 = 0.811 ¾¼ -1log 2 1 - 0log 2 0 = 0 - 0 = 001 Entropyp2p2 p1p1 51
52
Information Gain InfoGain(Y|X) How many bits can we save if know X? InfoGain(Y|X) = H(Y) – H(Y|X) (equivalent to InfoGain(Y,X)) 52
53
Information Gain InfoGain(S,A): expected reduction in entropy due to A Select A with max InfoGain Resulting in lowest average entropy 53
54
Computing Average Entropy Disorder of class distribution on branch i Fraction of samples down branch i |S| instances Branch1 Branch 2 S a1 a S a1 b S a2 a S a2 b 54
55
Entropy in Sunburn Example Hair color = 0.954-(4/8(-2/4 log 2/4 - 2/4log2/4) + 1/8*0 + 3/8 *0) = 0.954- 0.5 = 0.454 Height = 0.954 - 0.69= 0.264 Weight = 0.954 - 0.94= 0.014 Lotion = 0.954 - 0.61= 0.344 S = [3B,5N] 55
56
Entropy in Sunburn Example Height = 1-2/4(-1/2log1/2-1/2log1/2) + 1/4*0+1/4*0 = 1- 0.5 =0.5 Weight = 1-2/4(-1/2log1/2-1/2log1/2) +2/4(-1/2log1/2-1/2log1/2) = 1 Lotion = 1- 0 = 1 S=[2B,2N] 56
57
Building Decision Trees with Information Gain Until there are no inhomogeneous leaves Select an inhomogeneous leaf node Replace that leaf node by a test node creating subsets that yield highest information gain Effectively creates set of rectangular regions Repeatedly draws lines in different axes 57
58
Alternate Measures Issue with Information Gain: Favors features with more values Option: Gain Ratio S a : elements of S with value A=a 58
59
Overfitting Overfitting: Model fits the training data TOO well Fits noise, irrelevant details Why is this bad? Harms generalization Fits training data too well, fits new data badly For model m, training_error(m), D_error(m) – D=all data If overfit, for another model m’, training_error(m) < training_error(m’), but D_error(m) > D_error(m’) 59
60
Avoiding Overfitting Strategies to avoid overfitting: Early stopping: Stop when InfoGain < threshold Stop when number of instances < threshold Stop when tree depth > threshold Post-pruning Grow full tree and remove branches Which is better? Unclear, both used. For some applications, post-pruning better 60
61
Post-Pruning Divide data into Training set: used to build the original tree Validation set: used to perform pruning Build decision tree based on training data Until pruning does not reduce validation set performance Compute perf. for pruning each nodes (& its children) Greedily remove nodes that do not reduce VS performance Yields smaller tree with best performance 61
62
Performance Measures Compute accuracy on: Validation set k-fold cross-validation Weighted classification error cost: Weight some types of errors more heavily Minimum description length: Favor good accuracy on compact models MDL = error(tree) + model_size(tree) 62
63
Rule Post-Pruning Convert tree to rules Prune rules independently Sort final rule set Probably most widely used method (toolkits) 63
64
Modeling Features Different types of features need different tests Binary: Test branches on true/false Discrete: Branches for each discrete value Continuous? Need to discretize Enumerate all values not possible or desirable Pick value x Branches: value = x How can we pick split points? 64
65
Picking Splits Need split useful, sufficient split points What’s a good strategy? Approach: Sort all values for the feature in training data Identify adjacent instances of different classes Candidate split points between those instances Select candidate with highest information gain 65
66
Features in Decision Trees: Pros Feature selection: Tests features that yield low disorder E.g. selects features that are important! Ignores irrelevant features Feature type handling: Discrete type: 1 branch per value Continuous type: Branch on >= value Absent features: Distribute uniformly 66
67
Features in Decision Trees: Cons Features Assumed independent If want group effect, must model explicitly E.g. make new feature AorB Feature tests conjunctive 67
68
Decision Trees Train: Build tree by forming subsets of least disorder Predict: Traverse tree based on feature tests Assign leaf node sample label Pros: Robust to irrelevant features, some noise, fast prediction, perspicuous rule reading Cons: Poor feature combination, dependency, optimal tree build intractable 68
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.