Machine Learning 10601 Recitation 8 Oct 21, 2009 Oznur Tastan.

Slides:



Advertisements
Similar presentations
COMP3740 CR32: Knowledge Management and Adaptive Systems
Advertisements

DECISION TREES. Decision trees  One possible representation for hypotheses.
1 Machine Learning: Lecture 3 Decision Tree Learning (Based on Chapter 3 of Mitchell T.., Machine Learning, 1997)
Decision Tree Learning - ID3
Decision Trees Decision tree representation ID3 learning algorithm
Machine Learning III Decision Tree Induction
Decision Tree Approach in Data Mining
ICS320-Foundations of Adaptive and Learning Systems
Classification Techniques: Decision Tree Learning
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei Han.
Decision Tree Rong Jin. Determine Milage Per Gallon.
Part 7.3 Decision Trees Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting.
Ensemble Learning: An Introduction
Induction of Decision Trees
ICS 273A Intro Machine Learning
Decision Trees Decision tree representation Top Down Construction
Classification Continued
Decision Trees an Introduction.
Decision Trees Chapter 18 From Data to Knowledge.
End of Chapter 8 Neil Weisenfeld March 28, 2005.
ICS 273A Intro Machine Learning
Classification: Decision Trees 2 Outline  Top-Down Decision Tree Construction  Choosing the Splitting Attribute  Information Gain and Gain Ratio.
Machine Learning Lecture 10 Decision Trees G53MLE Machine Learning Dr Guoping Qiu1.
Ensemble Learning (2), Tree and Forest
Decision Tree Learning
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
ID3 Algorithm Allan Neymark CS157B – Spring 2007.
By Wang Rui State Key Lab of CAD&CG
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Machine Learning Chapter 3. Decision Tree Learning
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
Artificial Intelligence 7. Decision trees
Chapter 9 – Classification and Regression Trees
Classification I. 2 The Task Input: Collection of instances with a set of attributes x and a special nominal attribute Y called class attribute Output:
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
CpSc 810: Machine Learning Decision Tree Learning.
Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.
Scaling up Decision Trees. Decision tree learning.
Data Mining – Algorithms: Decision Trees - ID3 Chapter 4, Section 4.3.
CS690L Data Mining: Classification
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Decision Tree Learning
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 4-Inducción de árboles de decisión (1/2) Eduardo Poggi.
Learning From Observations Inductive Learning Decision Trees Ensembles.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Decision Tree Learning DA514 - Lecture Slides 2 Modified and expanded from: E. Alpaydin-ML (chapter 9) T. Mitchell-ML.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Machine Learning Inductive Learning and Decision Trees
DECISION TREES An internal node represents a test on an attribute.
Decision Trees an introduction.
Decision Tree Learning
Decision trees (concept learnig)
Machine Learning Lecture 2: Decision Tree Learning.
Decision trees (concept learnig)
Classification Algorithms
Artificial Intelligence
Data Science Algorithms: The Basic Methods
Decision Tree Saed Sayad 9/21/2018.
Data Mining – Chapter 3 Classification
Machine Learning: Lecture 3
Decision Trees Decision tree representation ID3 learning algorithm
Decision Trees Decision tree representation ID3 learning algorithm
Ensemble learning Reminder - Bagging of Trees Random Forest
Machine Learning: Decision Tree Learning
Data Mining CSCI 307, Spring 2019 Lecture 15
Presentation transcript:

Machine Learning Recitation 8 Oct 21, 2009 Oznur Tastan

Outline Tree representation Brief information theory Learning decision trees Bagging Random forests

Decision trees Non-linear classifier Easy to use Easy to interpret Succeptible to overfitting but can be avoided.

Anatomy of a decision tree overcast highnormal false true sunny rain No Yes Outlook Humidity Windy Each node is a test on one attribute Possible attribute values of the node Leafs are the decisions

Anatomy of a decision tree overcast highnormal false true sunny rain No Yes Outlook Humidity Windy Each node is a test on one attribute Possible attribute values of the node Leafs are the decisions

Anatomy of a decision tree overcast highnormal false true sunny rain No Yes Outlook Humidity Windy Each node is a test on one attribute Possible attribute values of the node Leafs are the decisions Sample size Your data gets smaller

To ‘play tennis’ or not. overcast highnormal false true sunny rain No Yes Outlook Humidity Windy A new test example: (Outlook==rain) and (not Windy==false) Pass it on the tree -> Decision is yes.

To ‘play tennis’ or not. overcast highnormal false true sunny rain No Yes Outlook Humidity Windy (Outlook ==overcast) -> yes (Outlook==rain) and (not Windy==false) ->yes (Outlook==sunny) and (Humidity=normal) ->yes

Decision trees Decision trees represent a disjunction of conjunctions of constraints on the attribute values of instances. (Outlook ==overcast) OR ((Outlook==rain) and (not Windy==false)) OR ((Outlook==sunny) and (Humidity=normal)) => yes play tennis

Representation 0 A 1 CB falsetruefalse Y =((A and B) or ((not A) and C)) true

Same concept different representation 0 A 1 CB falsetruefalse Y =((A and B) or ((not A) and C)) true 0 C 1 B A falsetruefalse A 0 1 truefalse

Which attribute to select for splitting? This is bad splitting… the distribution of each class (not attribute)

How do we choose the test ? Which attribute should be used as the test? Intuitively, you would prefer the one that separates the training examples as much as possible.

Information Gain Information gain is one criteria to decide on the attribute.

Information Imagine: 1. Someone is about to tell you your own name 2. You are about to observe the outcome of a dice roll 2. You are about to observe the outcome of a coin flip 3. You are about to observe the outcome of a biased coin flip Each situation have a different amount of uncertainty as to what outcome you will observe.

Information Information: reduction in uncertainty (amount of surprise in the outcome) Observing the outcome of a coin flip is head 2.Observe the outcome of a dice is 6 If the probability of this event happening is small and it happens the information is large.

Entropy The expected amount of information when observing the output of a random variable X If there X can have 8 outcomes and all are equally likely bits

Entropy If there are k possible outcomes Equality holds when all outcomes are equally likely The more the probability distribution the deviates from uniformity the lower the entropy

Entropy, purity Entropy measures the purity The distribution is less uniform Entropy is lower The node is purer

Conditional entropy

Information gain IG(X,Y)=H(X)-H(X|Y) Reduction in uncertainty by knowing Y Information gain: (information before split) – (information after split)

Information Gain Information gain: (information before split) – (information after split)

Example X1X2YCount TT+2 TF+2 FT-5 FF+1 Attributes Labels IG(X1,Y) = H(Y) – H(Y|X1) H(Y) = - (5/10) log(5/10) -5/10log(5/10) = 1 H(Y|X1) = P(X1=T)H(Y|X1=T) + P(X1=F) H(Y|X1=F) = 4/10 (1log log 0) +6/10 (5/6log 5/6 +1/6log1/6) = 0.39 Information gain (X1,Y)= =0.61 Which one do we choose X1 or X2?

Which one do we choose? X1X2YCount TT+2 TF+2 FT-5 FF+1 Information gain (X1,Y)= 0.61 Information gain (X2,Y)= 0.12 Pick X1 Pick the variable which provides the most information gain about Y

Recurse on branches X1X2YCount TT+2 TF+2 FT-5 FF+1 One branch The other branch

Caveats The number of possible values influences the information gain. The more possible values, the higher the gain (the more likely it is to form small, but pure partitions)

Purity (diversity) measures Purity (Diversity) Measures: – Gini (population diversity) – Information Gain – Chi-square Test

Overfitting You can perfectly fit to any training data Zero bias, high variance Two approaches: 1. Stop growing the tree when further splitting the data does not yield an improvement 2. Grow a full tree, then prune the tree, by eliminating nodes.

Bagging Bagging or bootstrap aggregation a technique for reducing the variance of an estimated prediction function. For classification, a committee of trees each cast a vote for the predicted class.

Bootstrap The basic idea: randomly draw datasets with replacement from the training data, each sample the same size as the original training set

Bagging N examples Create bootstrap samples from the training data....… M features

Random Forest Classifier N examples Construct a decision tree....… M features

Random Forest Classifier N examples....… Take the majority vote M features

Bagging Z = {(x 1, y 1 ), (x 2, y 2 ),..., (x N, y N )} Z *b where = 1,.., B.. The prediction at input x when bootstrap sample b is used for training (Chapter 8.7)

Bagging : an simulated example Generated a sample of size N = 30, with two classes and p = 5 features, each having a standard Gaussian distribution with pairwise Correlation The response Y was generated according to Pr(Y = 1|x1 ≤ 0.5) = 0.2, Pr(Y = 0|x1 > 0.5) = 0.8.

Bagging Notice the bootstrap trees are different than the original tree

Bagging Hastie Treat the voting Proportions as probabilities Example 8.7.1

Random forest classifier Random forest classifier, an extension to bagging which uses de-correlated trees.

Random Forest Classifier N examples Training Data M features

Random Forest Classifier N examples Create bootstrap samples from the training data....… M features

Random Forest Classifier N examples Construct a decision tree....… M features

Random Forest Classifier N examples....… M features At each node in choosing the split feature choose only among m<M features

Random Forest Classifier Create decision tree from each bootstrap sample N examples....… M features

Random Forest Classifier N examples....… Take he majority vote M features

Random forest Available package: To read more: