Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.

Slides:



Advertisements
Similar presentations
Introduction to Artificial Intelligence CS440/ECE448 Lecture 21
Advertisements

Decision Trees Decision tree representation ID3 learning algorithm
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
Evolutionary Computing Systems Lab (ECSL), University of Nevada, Reno 1.
Classification Techniques: Decision Tree Learning
Machine Learning Decision Trees. Exercise Solutions.
Decision Tree Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Machine learning learning... a fundamental aspect of intelligent systems – not just a short-cut to kn acquisition / complex behaviour.
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Lecture outline Classification Decision-tree classification.
Decision Tree Rong Jin. Determine Milage Per Gallon.
Part 7.3 Decision Trees Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting.
Decision Tree Algorithm
Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics.
Decision Tree Learning
Three kinds of learning
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.
Classification.
Decision Tree Learning
Decision tree LING 572 Fei Xia 1/16/06.
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Machine Learning Chapter 3. Decision Tree Learning
Decision Trees Advanced Statistical Methods in NLP Ling572 January 10, 2012.
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
Mohammad Ali Keyvanrad
For Wednesday No new reading Homework: –Chapter 18, exercises 3, 4, 7.
EIE426-AICV1 Machine Learning Filename: eie426-machine-learning-0809.ppt.
Chapter 9 – Classification and Regression Trees
Chapter 4 Classification. 2 Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of.
For Friday No reading No homework. Program 4 Exam 2 A week from Friday Covers 10, 11, 13, 14, 18, Take home due at the exam.
CpSc 810: Machine Learning Decision Tree Learning.
Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionSplitting Function Issues in Decision-Tree LearningIssues in Decision-Tree Learning.
CS Decision Trees1 Decision Trees Highly used and successful Iteratively split the Data Set into subsets one attribute at a time, using most informative.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
Learning with Decision Trees Artificial Intelligence CMSC February 20, 2003.
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
Decision Trees. What is a decision tree? Input = assignment of values for given attributes –Discrete (often Boolean) or continuous Output = predicated.
Searching by Authority Artificial Intelligence CMSC February 12, 2008.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.
Decision Trees, Part 1 Reading: Textbook, Chapter 6.
Decision Tree Learning
Learning with Decision Trees Artificial Intelligence CMSC February 18, 2003.
Outline Decision tree representation ID3 learning algorithm Entropy, Information gain Issues in decision tree learning 2.
Medical Decision Making Learning: Decision Trees Artificial Intelligence CMSC February 10, 2005.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Decision Tree Learning DA514 - Lecture Slides 2 Modified and expanded from: E. Alpaydin-ML (chapter 9) T. Mitchell-ML.
Decision Tree Learning
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Ch9: Decision Trees 9.1 Introduction A decision tree:
Issues in Decision-Tree Learning Avoiding overfitting through pruning
Decision Tree Saed Sayad 9/21/2018.
Introduction to Data Mining, 2nd Edition by
Introduction to Data Mining, 2nd Edition by
Introduction to Data Mining, 2nd Edition by
Learning with Identification Trees
Machine Learning Chapter 3. Decision Tree Learning
Machine Learning: Lecture 3
Machine Learning Chapter 3. Decision Tree Learning
Statistical Learning Dong Liu Dept. EEIS, USTC.
Machine Learning in Practice Lecture 17
Presentation transcript:

Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

Information Theory 2

Entropy Information theoretic measure Measures information in model Conceptually, lower bound on # bits to encode Entropy: H(X): X is a random var, p: prob fn 3

Cross-Entropy Comparing models Actual distribution unknown p Use simplified model to estimate m Closer match will have lower cross-entropy 4

Cross-Entropy Comparing models Actual distribution unknown p Use simplified model to estimate m Closer match will have lower cross-entropy 5

Cross-Entropy Comparing models Actual distribution unknown p Use simplified model to estimate m Closer match will have lower cross-entropy 6

Cross-Entropy Comparing models Actual distribution unknown p Use simplified model to estimate m Closer match will have lower cross-entropy 7

Cross-Entropy Comparing models Actual distribution unknown p Use simplified model to estimate m Closer match will have lower cross-entropy 8

Cross-Entropy Comparing models Actual distribution unknown p Use simplified model to estimate m Closer match will have lower cross-entropy 9

Relative Entropy Commonly known as Kullback-Liebler divergence Expresses difference between probability distributions 10

Relative Entropy Commonly known as Kullback-Liebler divergence Expresses difference between probability distributions 11

Relative Entropy Commonly known as Kullback-Liebler divergence Expresses difference between probability distributions Not a proper distance metric: 12

Relative Entropy Commonly known as Kullback-Liebler divergence Expresses difference between probability distributions Not a proper distance metric: asymmetric KL(p||q) != KL(q||p) 13

Joint & Conditional Entropy Joint entropy: 14

Joint & Conditional Entropy Joint entropy: Conditional entropy: 15

Joint & Conditional Entropy Joint entropy: Conditional entropy: 16

Joint & Conditional Entropy Joint entropy: Conditional entropy: 17

Perplexity and Entropy Given that Consider the perplexity equation: PP(W) = P(W) -1/N = = = 2 H(L,P) Where H is the entropy of the language L 18

Mutual Information Measure of information in common between two distributions 19

Mutual Information Measure of information in common between two distributions 20

Mutual Information Measure of information in common between two distributions 21

Mutual Information Measure of information in common between two distributions 22

Mutual Information Measure of information in common between two distributions Symmetric: I(X;Y) = I(Y;X) 23

Mutual Information Measure of information in common between two distributions Symmetric: I(X;Y) = I(Y;X) I(X;Y) = KL(p(x,y)||p(x)p(y)) 24

Decision Trees 25

Classification Task Task: C is a finite set of labels (aka categories, classes) Given x, determine its category y in C 26

Classification Task Task: C is a finite set of labels (aka categories, classes) Given x, determine its category y in C Instance: (x,y) x: thing to be labeled/classified y: label/class 27

Classification Task Task: C is a finite set of labels (aka categories, classes) Given x, determine its category y in C Instance: (x,y) x: thing to be labeled/classified y: label/class Data: set of instances labeled data: y is known unlabeled data: y is unknown 28

Classification Task Task: C is a finite set of labels (aka categories, classes) Given x, determine its category y in C Instance: (x,y) x: thing to be labeled/classified y: label/class Data: set of instances labeled data: y is known unlabeled data: y is unknown Training data, test data 29

Two Stages Training: Learner: training data  classifier 30

Two Stages Training: Learner: training data  classifier Testing: Decoder: test data + classifier  classification output 31

Two Stages Training: Learner: training data  classifier Classifier: f(x) =y: x is input; y in C Testing: Decoder: test data + classifier  classification output Also Preprocessing Postprocessing Evaluation 32

33

Roadmap Decision Trees: Sunburn example Decision tree basics From trees to rules Key questions Training procedure? Decoding procedure? Overfitting? Different feature type? Analysis: Pros & Cons 34

Sunburn Example 35

Learning about Sunburn Goal: Train on labeled examples Predict Burn/None for new instances Solution?? Exact match: same features, same output Problem: 2*3^3 feature combinations Could be much worse Same label as ‘most similar’ Problem: What’s close? Which features matter? Many match on two features but differ on result 36

Learning about Sunburn Better Solution: Decision tree Training: Divide examples into subsets based on feature tests Sets of samples at leaves define classification Prediction: Route NEW instance through tree to leaf based on feature tests Assign same value as samples at leaf 37

Sunburn Decision Tree Hair Color Lotion Used Blonde Red Brown Alex: None John: None Pete: None Emily: Burn NoYes Sarah: Burn Annie: Burn Katie: None Dana: None 38

Decision Tree Structure Internal nodes: Each node is a test Generally tests a single feature E.g. Hair == ? Theoretically could test multiple features Branches: Each branch corresponds to outcome of test E.g Hair == Red; Hair != Blond Leaves: Each leaf corresponds to a decision Discrete class: Classification / Decision Tree Real value: Regression 39

From Trees to Rules Tree: Branches from root to leaves = Tests => classifications Tests = if antecedents; Leaf labels= consequent All Decision trees-> rules Not all rules as trees 40

From ID Trees to Rules Hair Color Lotion Used Blonde Red Brown Alex: None John: None Pete: None Emily: Burn NoYes Sarah: Burn Annie: Burn Katie: None Dana: None (if (equal haircolor blonde) (equal lotionused yes) (then None)) (if (equal haircolor blonde) (equal lotionused no) (then Burn)) (if (equal haircolor red) (then Burn)) (if (equal haircolor brown) (then None)) 41

Which Tree? Many possible decision trees for any problem How can we select among them? What would be the ‘best’ tree? Smallest? Shallowest? Most accurate on unseen data? 42

Simplicity Occam’s Razor: Simplest explanation that covers the data is best Occam’s Razor for decision trees: Smallest tree consistent with samples will be best predictor for new data Problem: Finding all trees & finding smallest: Expensive! Solution: Greedily build a small tree 43

Building Trees: Basic Algorithm Goal: Build a small tree such that all samples at leaves have same class Greedy solution: At each node, pick test using ‘best’ feature Split into subsets based on outcomes of feature test Repeat process until stopping criterion i.e. until leaves have same class 44

Key Questions Splitting: How do we select the ‘best’ feature? Stopping: When do we stop splitting to avoid overfitting? Features: How do we split different types of features? Binary? Discrete? Continuous? 45

Building Decision Trees: I Goal: Build a small tree such that all samples at leaves have same class Greedy solution: At each node, pick test such that branches are closest to having same class Split into subsets where most instances in uniform class 46

Picking a Test Hair Color Blonde Red Brown Alex: N Pete: N John: N Emily: B Sarah: B Dana: N Annie: B Katie: N HeightWeightLotion Short Average Tall Alex:N Annie:B Katie:N Sarah:B Emily:B John:N Dana:N Pete:N Sarah:B Katie:N Light Average Heavy Dana:N Alex:N Annie:B Emily:B Pete:N John:N No Yes Sarah:B Annie:B Emily:B Pete:N John:N Dana:N Alex:N Katie:N 47

Picking a Test HeightWeightLotion Short Average Tall Annie:B Katie:N Sarah:B Dana:N Sarah:B Katie:N Light Average Heavy Dana:N Annie:B No Yes Sarah:B Annie:B Dana:N Katie:N 48

Measuring Disorder Problem: In general, tests on large DB’s don’t yield homogeneous subsets Solution: General information theoretic measure of disorder Desired features: Homogeneous set: least disorder = 0 Even split: most disorder = 1 49

Measuring Entropy If split m objects into 2 bins size m1 & m2, what is the entropy? 50

Measuring Disorder: Entropy the probability of being in bin i Entropy (disorder) of a split Assume -½ log 2 ½ - ½ log 2 ½ = ½ +½ = 1 ½½ -¼ log 2 ¼ - ¾ log 2 ¾ = = ¾¼ -1log log 2 0 = = 001 Entropyp2p2 p1p1 51

Information Gain InfoGain(Y|X) How many bits can we save if know X? InfoGain(Y|X) = H(Y) – H(Y|X) (equivalent to InfoGain(Y,X)) 52

Information Gain InfoGain(S,A): expected reduction in entropy due to A Select A with max InfoGain Resulting in lowest average entropy 53

Computing Average Entropy Disorder of class distribution on branch i Fraction of samples down branch i |S| instances Branch1 Branch 2 S a1 a S a1 b S a2 a S a2 b 54

Entropy in Sunburn Example Hair color = (4/8(-2/4 log 2/4 - 2/4log2/4) + 1/8*0 + 3/8 *0) = = Height = = Weight = = Lotion = = S = [3B,5N] 55

Entropy in Sunburn Example Height = 1-2/4(-1/2log1/2-1/2log1/2) + 1/4*0+1/4*0 = =0.5 Weight = 1-2/4(-1/2log1/2-1/2log1/2) +2/4(-1/2log1/2-1/2log1/2) = 1 Lotion = 1- 0 = 1 S=[2B,2N] 56

Building Decision Trees with Information Gain Until there are no inhomogeneous leaves Select an inhomogeneous leaf node Replace that leaf node by a test node creating subsets that yield highest information gain Effectively creates set of rectangular regions Repeatedly draws lines in different axes 57

Alternate Measures Issue with Information Gain: Favors features with more values Option: Gain Ratio S a : elements of S with value A=a 58

Overfitting Overfitting: Model fits the training data TOO well Fits noise, irrelevant details Why is this bad? Harms generalization Fits training data too well, fits new data badly For model m, training_error(m), D_error(m) – D=all data If overfit, for another model m’, training_error(m) < training_error(m’), but D_error(m) > D_error(m’) 59

Avoiding Overfitting Strategies to avoid overfitting: Early stopping: Stop when InfoGain < threshold Stop when number of instances < threshold Stop when tree depth > threshold Post-pruning Grow full tree and remove branches Which is better? Unclear, both used. For some applications, post-pruning better 60

Post-Pruning Divide data into Training set: used to build the original tree Validation set: used to perform pruning Build decision tree based on training data Until pruning does not reduce validation set performance Compute perf. for pruning each nodes (& its children) Greedily remove nodes that do not reduce VS performance Yields smaller tree with best performance 61

Performance Measures Compute accuracy on: Validation set k-fold cross-validation Weighted classification error cost: Weight some types of errors more heavily Minimum description length: Favor good accuracy on compact models MDL = error(tree) + model_size(tree) 62

Rule Post-Pruning Convert tree to rules Prune rules independently Sort final rule set Probably most widely used method (toolkits) 63

Modeling Features Different types of features need different tests Binary: Test branches on true/false Discrete: Branches for each discrete value Continuous? Need to discretize Enumerate all values  not possible or desirable Pick value x Branches: value = x How can we pick split points? 64

Picking Splits Need split useful, sufficient split points What’s a good strategy? Approach: Sort all values for the feature in training data Identify adjacent instances of different classes Candidate split points between those instances Select candidate with highest information gain 65

Features in Decision Trees: Pros Feature selection: Tests features that yield low disorder E.g. selects features that are important! Ignores irrelevant features Feature type handling: Discrete type: 1 branch per value Continuous type: Branch on >= value Absent features: Distribute uniformly 66

Features in Decision Trees: Cons Features Assumed independent If want group effect, must model explicitly E.g. make new feature AorB Feature tests conjunctive 67

Decision Trees Train: Build tree by forming subsets of least disorder Predict: Traverse tree based on feature tests Assign leaf node sample label Pros: Robust to irrelevant features, some noise, fast prediction, perspicuous rule reading Cons: Poor feature combination, dependency, optimal tree build intractable 68