The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...

Slides:



Advertisements
Similar presentations
DECISION TREES. Decision trees  One possible representation for hypotheses.
Advertisements

1 Machine Learning: Lecture 3 Decision Tree Learning (Based on Chapter 3 of Mitchell T.., Machine Learning, 1997)
CHAPTER 9: Decision Trees
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Decision Tree Approach in Data Mining
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
Classification Techniques: Decision Tree Learning
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei Han.
Supervised Learning I, Cont’d Reading: DH&S, Ch
ID3 Algorithm Abbas Rizvi CS157 B Spring What is the ID3 algorithm? ID3 stands for Iterative Dichotomiser 3 Algorithm used to generate a decision.
Analysis of greedy active learning Sanjoy Dasgupta UC San Diego.
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Supervised Learning I, Cont’d. Administrivia Machine learning reading group Not part of/related to this class We read advanced (current research) papers.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Decision Trees an Introduction.
Decision Trees Chapter 18 From Data to Knowledge.
Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.
Decision trees and empirical methodology Sec 4.3,
ICS 273A Intro Machine Learning
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
The joy of Entropy.
Supervised Learning I, Cont’d Reading: Bishop, Ch 14.4, 1.6, 1.5.
Ensemble Learning (2), Tree and Forest
Learning Chapter 18 and Parts of Chapter 20
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Machine Learning Chapter 3. Decision Tree Learning
Decision Trees Advanced Statistical Methods in NLP Ling572 January 10, 2012.
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
CS-424 Gregory Dudek Today’s outline Administrative issues –Assignment deadlines: 1 day = 24 hrs (holidays are special) –The project –Assignment 3 –Midterm.
Classification I. 2 The Task Input: Collection of instances with a set of attributes x and a special nominal attribute Y called class attribute Output:
Learning from observations
Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.
1 Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.3: Decision Trees Rodney Nielsen Many of.
Decision Tree Learning R&N: Chap. 18, Sect. 18.1–3.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
Machine Learning, Decision Trees, Overfitting Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 14,
CS690L Data Mining: Classification
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.
Decision Trees, Part 1 Reading: Textbook, Chapter 6.
1 CSCI 3202: Introduction to AI Decision Trees Greg Grudic (Notes borrowed from Thomas G. Dietterich and Tom Mitchell) Intro AIDecision Trees.
Decision Trees Example of a Decision Tree categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting.
Decision Tree Learning
Machine Learning Recitation 8 Oct 21, 2009 Oznur Tastan.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 4-Inducción de árboles de decisión (1/2) Eduardo Poggi.
Learning From Observations Inductive Learning Decision Trees Ensembles.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
CS-424 Gregory Dudek Lecture 14 Learning –Inductive inference –Probably approximately correct learning.
Data Mining Chapter 4 Algorithms: The Basic Methods - Constructing decision trees Reporter: Yuen-Kuei Hsueh Date: 2008/7/24.
DECISION TREES An internal node represents a test on an attribute.
Decision Trees an introduction.
Computational Intelligence: Methods and Applications
C4.5 - pruning decision trees
Artificial Intelligence
Ch9: Decision Trees 9.1 Introduction A decision tree:
Data Science Algorithms: The Basic Methods
Decision Tree Saed Sayad 9/21/2018.
CS639: Data Management for Data Science
Decision trees One possible representation for hypotheses
Data Mining CSCI 307, Spring 2019 Lecture 15
Presentation transcript:

The joy of Entropy

Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...

Time wings on... Last time: Hypothesis spaces Intro to decision trees This time: Learning bias The getBestSplitFeature function Entropy

Back to decision trees... Reminders: Hypothesis space for DT: Data struct view: All trees with single test per internal node and constant leaf value Geometric view: Sets of axis-orthagonal hyper-rectangles; piecewise constant approximation Open question: getBestSplitFeature function

Splitting criteria What properties do we want our getBestSplitFeature() function to have? Increase the purity of the data After split, new sets should be closer to uniform labeling than before the split Want the subsets to have roughly the same purity Want the subsets to be as balanced as possible

Bias These choices are designed to produce small trees May miss some other, better trees that are: Larger Require a non-greedy split at the root Definition: Learning bias == tendency of an algorithm to find one class of solution out of H in preference to another

Bias: the pretty picture Space of all functions on

Bias: the algebra Bias also seen as expected difference between true concept and induced concept: Note: expectation taken over all possible data sets Don’t actually know that distribution either :-P Can (sometimes) make a prior assumption

More on Bias Bias can be a property of: Risk/loss function How you measure “distance” to best solution Search strategy How you move through H to find

Back to splitting... Consider a set of true/false labels Want our measure to be small when the set is pure (all true or all false), and large when set is almost evenly divided between the classes In general: we call such a function impurity, i(y) We’ll use entropy Expresses the amount of information in the set (Later we’ll use the negative of this function, so it’ll be better if the set is almost pure)

Entropy, cont’d Define: class fractions (a.k.a., class prior probabilities)

Entropy, cont’d Define: class fractions (a.k.a., class prior probabilities) Define: entropy of a set

Entropy, cont’d Define: class fractions (a.k.a., class prior probabilities) Define: entropy of a set In general, for classes :

The entropy curve

Properties of entropy Maximum when class fractions equal Minimum when data is pure Smooth Differentiable; continuous Convex Intuitively: entropy of a dist tells you how “predictable” that dist is.

Entropy in a nutshell From: Andrew Moore’s tutorial on information gain:

Entropy in a nutshell data values (location of soup) sampled from tight distribution (bowl) -- highly predictable Low entropy distribution

Entropy in a nutshell data values (location of soup) sampled from loose distribution (uniformly around dining room) -- highly unpredictable High entropy distribution

Entropy of a split A split produces a number of sets (one for each branch) Need a corresponding entropy of a split (i.e., entropy of a collection of sets) Definition: entropy of a B -way split where:

Information gain The last, easy step: Want to pick the attribute that decreases the information content of the data as much as possible Q: Why decrease? Define: gain of splitting data set [X,y] on attribute a :

The splitting method Feature getBestSplitFeature(X,Y) { // Input: instance set X, label set Y double baseInfo=entropy(Y); double[] gain=new double[]; for (a : X.getFeatureSet()) { [X0,...,Xk,Y0,...,Yk]=a.splitData(X,Y); gain[a]=baseInfo-splitEntropy(Y0,...,Yk); } return argmax(gain); }

DTs in practice... Growing to purity is bad (overfitting)

DTs in practice... Growing to purity is bad (overfitting) x1: petal length x2: sepal width

DTs in practice... Growing to purity is bad (overfitting) x1: petal length x2: sepal width

DTs in practice... Growing to purity is bad (overfitting) Terminate growth early Grow to purity, then prune back

DTs in practice... Growing to purity is bad (overfitting) x1: petal length x2: sepal width Not statistically supportable leaf Remove split & merge leaves

DTs in practice... Multiway splits are a pain Entropy is biased in favor of more splits Correct w/ gain ratio (DH&S Ch , Eqn 7)

DTs in practice... Real-valued attributes rules of form if (x1<3.4) {... } How to pick the “3.4”?