The joy of Entropy.

Slides:



Advertisements
Similar presentations
DECISION TREES. Decision trees  One possible representation for hypotheses.
Advertisements

Decision Trees Decision tree representation ID3 learning algorithm
Evaluating Classifiers
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Decision Tree Approach in Data Mining
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Decision Tree.
Classification Techniques: Decision Tree Learning
Chapter 7 – Classification and Regression Trees
Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei Han.
Decision Tree Rong Jin. Determine Milage Per Gallon.
Supervised Learning I, Cont’d Reading: DH&S, Ch
More Methodology; Nearest-Neighbor Classifiers Sec 4.7.
Decision Tree Algorithm
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Supervised Learning I, Cont’d. Administrivia Machine learning reading group Not part of/related to this class We read advanced (current research) papers.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Decision Trees an Introduction.
Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.
Decision trees and empirical methodology Sec 4.3,
The joy of data Plus, bonus feature: fun with differentiation Reading: DH&S Ch
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Supervised Learning I, Cont’d Reading: Bishop, Ch 14.4, 1.6, 1.5.
The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...
Ensemble Learning (2), Tree and Forest
Learning Chapter 18 and Parts of Chapter 20
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Machine Learning Chapter 3. Decision Tree Learning
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
Evaluating Hypotheses Reading: Coursepack: Learning From Examples, Section 4 (pp )
Chapter 9 – Classification and Regression Trees
Classification I. 2 The Task Input: Collection of instances with a set of attributes x and a special nominal attribute Y called class attribute Output:
Computational Intelligence: Methods and Applications Lecture 19 Pruning of decision trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Machine Learning Queens College Lecture 2: Decision Trees.
Learning from observations
1 Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.3: Decision Trees Rodney Nielsen Many of.
Decision Tree Learning R&N: Chap. 18, Sect. 18.1–3.
CS690L Data Mining: Classification
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Decision Trees Example of a Decision Tree categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting.
Decision Tree Learning Presented by Ping Zhang Nov. 26th, 2007.
Machine Learning Recitation 8 Oct 21, 2009 Oznur Tastan.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 4-Inducción de árboles de decisión (1/2) Eduardo Poggi.
Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
CS-424 Gregory Dudek Lecture 14 Learning –Inductive inference –Probably approximately correct learning.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
DECISION TREES An internal node represents a test on an attribute.
Decision Trees an introduction.
Computational Intelligence: Methods and Applications
C4.5 - pruning decision trees
Artificial Intelligence
Ch9: Decision Trees 9.1 Introduction A decision tree:
Data Science Algorithms: The Basic Methods
Decision Tree Saed Sayad 9/21/2018.
Statistical Learning Dong Liu Dept. EEIS, USTC.
Learning Chapter 18 and Parts of Chapter 20
CS639: Data Management for Data Science
Lecture 14 Learning Inductive inference
Data Mining CSCI 307, Spring 2019 Lecture 15
Presentation transcript:

The joy of Entropy

Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...

Time wings on... Last time: Hypothesis spaces Intro to decision trees This time: Loss matrices Learning bias The getBestSplitFeature function Entropy

Loss For problem 8.11, you need cost values A.k.a. loss values Introduced in DH&S ch. 2.2 Basic idea: some mistakes are more expensive than others

Loss Example: classifying computer network traffic Traffic is either normal or intrusive There’s way more normal traffic than intrusive Data is normal, but classifier says “intrusive”? Data is intrusive, but classifier says “normal”?

Cost of mistakes Normal Intrusion $0 $5 $5,000 True class Predicted

Cost of mistakes True class ω1 ω2 λ11 λ12 λ21 λ22 Predicted class

In general... ω1 ω2 ... ωk λ11 λ12 λ21 λ22 λkk True class Predicted

Cost-based criterion For the misclassification error case, we wrote the risk of a classifer f as: For the cost-based case, this becomes:

Back to decision trees... Reminders: Hypothesis space for DT: Data struct view: All trees with single test per internal node and constant leaf value Geometric view: Sets of axis-orthagonal hyper- rectangles; piecewise constant approximation Open question: getBestSplitFeature function

Splitting criteria What properties do we want our getBestSplitFeature() function to have? Increase the purity of the data After split, new sets should be closer to uniform labeling than before the split Want the subsets to have roughly the same purity Want the subsets to be as balanced as possible

Bias These choices are designed to produce small trees May miss some other, better trees that are: Larger Require a non-greedy split at the root Definition: Learning bias == tendency of an algorithm to find one class of solution out of H in preference to another

Bias: the pretty picture Space of all functions on

Bias: the algebra Bias also seen as expected difference between true concept and induced concept: Note: expectation taken over all possible data sets Don’t actually know that distribution either :-P Can (sometimes) make a prior assumption

More on Bias Bias can be a property of: Risk/loss function How you measure “distance” to best solution Search strategy How you move through H to find

Back to splitting... Consider a set of true/false labels Want our measure to be small when the set is pure (all true or all false), and large when set is almost evenly divided between the classes In general: we call such a function impurity, i(y) We’ll use entropy Expresses the amount of information in the set (Later we’ll use the negative of this function, so it’ll be better if the set is almost pure)

Entropy, cont’d Define: class fractions (a.k.a., class prior probabilities)

Entropy, cont’d Define: class fractions (a.k.a., class prior probabilities) Define: entropy of a set

Entropy, cont’d Define: class fractions (a.k.a., class prior probabilities) Define: entropy of a set In general, for classes :

The entropy curve

Properties of entropy Maximum when class fractions equal Minimum when data is pure Smooth Differentiable; continuous Convex Intuitively: entropy of a dist tells you how “predictable” that dist is.

From: Andrew Moore’s tutorial on information gain: Entropy in a nutshell From: Andrew Moore’s tutorial on information gain: http://www.cs.cmu.edu/~awm/tutorials

Entropy in a nutshell Low entropy distribution data values (location of soup) sampled from tight distribution (bowl) -- highly predictable

Entropy in a nutshell High entropy distribution data values (location of soup) sampled from loose distribution (uniformly around dining room) -- highly unpredictable

Entropy of a split A split produces a number of sets (one for each branch) Need a corresponding entropy of a split (i.e., entropy of a collection of sets) Definition: entropy of a split where:

Information gain The last, easy step: Want to pick the attribute that decreases the information content of the data as much as possible Q: Why decrease? Define: gain of splitting data set [X,y] on attribute a:

The splitting method Feature getBestSplitFeature(X,Y) { // Input: instance set X, label set Y double baseInfo=entropy(Y); double[] gain=new double[]; for (a : X.getFeatureSet()) { [X0,...,Xk,Y0,...,Yk]=a.splitData(X,Y); gain[a]=baseInfo-splitEntropy(Y0,...,Yk); } return argmax(gain);

DTs in practice... Growing to purity is bad (overfitting)

DTs in practice... Growing to purity is bad (overfitting) x2: sepal width x1: petal length

DTs in practice... Growing to purity is bad (overfitting) x2: sepal width x1: petal length

DTs in practice... Growing to purity is bad (overfitting) Terminate growth early Grow to purity, then prune back

DTs in practice... Growing to purity is bad (overfitting) Not statistically supportable leaf Remove split & merge leaves x2: sepal width x1: petal length

DTs in practice... Multiway splits are a pain Entropy is biased in favor of more splits Correct w/ gain ratio (DH&S Ch. 8.3.2, Eqn 7)

DTs in practice... Real-valued attributes rules of form if (x1<3.4) { ... } How to pick the “3.4”?

Measuring accuracy So now you have a DT -- what now? Usually, want to use it to classify new data (previously unseen) Want to know how well you should expect it to perform How do you estimate such a thing?

Measuring accuracy So now you have a DT -- what now? Usually, want to use it to classify new data (previously unseen) Want to know how well you should expect it to perform How do you estimate such a thing? Theoretically -- prove that you have the “right” tree Very, very hard in practice Measure it Trickier than it sounds!....

Testing with training data So you have a data set: and corresponding labels: You build your decision tree: tree=buildDecisionTree(X,y) What happens if you just do this: double acc=0.0; for (int i=1;i<=N;++i) { acc+=(tree.classify(X[i])==y[i]); } acc/=N; return acc;

Testing with training data Answer: you tend to overestimate real accuracy (possibly drastically) ? ? ? ? ? x2: sepal width ?

Separation of train & test Fundamental principle (1st amendment of ML): Don’t evaluate accuracy (performance) of your classifier (learning system) on the same data used to train it!