Carla P. Gomes CS4700 CS 4700: Foundations of Artificial Intelligence Prof. Carla P. Gomes Module: Extensions of Decision Trees (Reading:

Slides:



Advertisements
Similar presentations
DECISION TREES. Decision trees  One possible representation for hypotheses.
Advertisements

CHAPTER 9: Decision Trees
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Decision Tree Approach in Data Mining
Pavan J Joshi 2010MCS2095 Special Topics in Database Systems
Classification Techniques: Decision Tree Learning
Decision Tree Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Part 7.3 Decision Trees Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting.
Decision Tree Algorithm
Evaluation.
Ensemble Learning: An Introduction
Induction of Decision Trees
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Evaluating Hypotheses
Three kinds of learning
LEARNING DECISION TREES
Learning….in a rather broad sense: improvement of performance on the basis of experience Machine learning…… improve for task T with respect to performance.
ICS 273A Intro Machine Learning
Experimental Evaluation
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Decision Tree Learning
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Machine Learning Chapter 3. Decision Tree Learning
Decision Trees Advanced Statistical Methods in NLP Ling572 January 10, 2012.
Inductive learning Simplest form: learn a function from examples
Chapter 9 – Classification and Regression Trees
CpSc 810: Machine Learning Decision Tree Learning.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Experimental Evaluation of Learning Algorithms Part 1.
Learning from observations
Learning from Observations Chapter 18 Through
Categorical data. Decision Tree Classification Which feature to split on? Try to classify as many as possible with each split (This is a good split)
Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionSplitting Function Issues in Decision-Tree LearningIssues in Decision-Tree Learning.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
Decision Trees. What is a decision tree? Input = assignment of values for given attributes –Discrete (often Boolean) or continuous Output = predicated.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Decision Tree Learning
CSC 8520 Spring Paula Matuszek DecisionTreeFirstDraft Paula Matuszek Spring,
Decision Trees.
Chapter 18 Section 1 – 3 Learning from Observations.
Learning From Observations Inductive Learning Decision Trees Ensembles.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Artificial Intelligence
Data Science Algorithms: The Basic Methods
Issues in Decision-Tree Learning Avoiding overfitting through pruning
Decision Tree Saed Sayad 9/21/2018.
Introduction to Data Mining, 2nd Edition by
Machine Learning Chapter 3. Decision Tree Learning
CS 4700: Foundations of Artificial Intelligence
Machine Learning: Lecture 3
Machine Learning Chapter 3. Decision Tree Learning
Presentation transcript:

Carla P. Gomes CS4700 CS 4700: Foundations of Artificial Intelligence Prof. Carla P. Gomes Module: Extensions of Decision Trees (Reading: Chapter 18)

Carla P. Gomes CS4700 Extensions of the Decision Tree Learning Algorithm (Briefly) Noisy data and overfitting Cross Validation Missing Data (See exercise ) Using gain ratios (See exercise 18.13) Real-valued data Generation of rules

Carla P. Gomes CS4700 Noisy Data and Overfitting

Carla P. Gomes CS4700 Noisy data Many kinds of "noise" that could occur in the examples: –Two examples have same attribute/value pairs, but different classifications  report majority classification for the examples corresponding to the node deterministic hypothesis.  report estimated probabilities of each classification using the relative frequency (if considering stochastic hypotheses) –Some values of attributes are incorrect because of errors in the data acquisition process or the preprocessing phase –The classification is wrong (e.g., + instead of -) because of some error

Carla P. Gomes CS4700 Overfitting Problem of trying to predict the roll of a die. The experiment data include: –Day of the week; (2) Month of the week; (3) Color of the die; …. As long as two examples have identical descriptions, DTL will find an exact hypothesis, with irrelevant attributes. Some attributes are irrelevant to the decision-making process, e.g., color of a die is irrelevant to its outcome but they are used to differentiate examples  Overfitting. Overfitting means fitting the training set “too well”  performance on the test set degrades.

Carla P. Gomes CS4700 Overfitting If the hypothesis space has many dimensions because of a large number of attributes, we may find meaningless regularity in the data that is irrelevant to the true, important, distinguishing features. Fix by pruning lower nodes in the decision tree. For example, if Gain of the best attribute at a node is below a threshold, stop and make this node a leaf rather than generating children nodes. Overfitting is a key problem in learning. Mathematical treatment of overfitting beyond this course.  Flavor of “decision tree pruning”

Carla P. Gomes CS4700 Overfitting Let’s consider D, the entire distribution of data, and T, the training set. Hypothesis h  H overfits D if –  h’  h  H such that error T (h) < error T (h’) but error D (h) > error D (h’)

Carla P. Gomes CS4700 Dealing with Overfitting Decision tree pruning prevent splitting on features that are not clearly relevant. Testing of relevance of features: statistical tests cross-validation Grow full tree, then post-prune rule post-pruning

Carla P. Gomes CS4700 Pruning Decision Trees Idea: prevent splitting on attributes that are not relevant. How do we decide that an attribute is irrelevant? If an attribute is not really relevant, the resulting subsets would have roughly the same proportion of each class (pos/neg) as the original set  Gain would be close to zero. How large should the Gain be, to be considered significant?  Statistical significance test:

Carla P. Gomes CS4700 Statistical Significance Test: Pruning Decision Trees How large should the Gain be to be considered significant?  statistical significance test: Null hypothesis (H0) --- the attribute we are considering to split on is irrelevant, i.e., Gain is zero for a sufficiently large sample. We can measure the deviation (D) by comparing the actual number of pos and negative examples in each of the new v subsets (p i and n i ) with the expected numbers and, assuming true irrelevance of the attribute, therefore using the original set proportion (p pos and n neg examples) applied to the new subsets:

Carla P. Gomes CS4700  2 Test D ~  2 Distribution with (v-1) degrees of freedom v = size of domain of attribute A p i – actual number of positive examples n i – actual number of negative examples – number of positive examples using the original set proportion p of positive examples (therefore assuming Ho, i.e., irrelevant attribute) – number of negative examples using the original set proportion n of positive examples (therefore assuming Ho, i.e., irrelevant attribute)

Carla P. Gomes CS4700 Ho: A is irrelevant:  2 Test: Summary D ~  2 Distribution with (v-1) degrees of freedom   (exp – obs) 2 / exp  - probability of making a mistake (i.e., rejecting H0 when it is true) T  (v-1) – threshold value (i.e., theoretical value for  2 distribution with (v-1) degrees of freedom,  probability in the tail) D < T  (v-1)  do not reject H0 (therefore do not split on A since we do not reject that it is irrelevant) D > T  (v-1)  reject H0; therefore split on A since we reject that it is irrelevant)

 2 Table

Carla P. Gomes CS4700 Cross Validation Another technique to reduce overfitting  it can be applied to any learning algorithm. Key idea: Estimate how well each hypothesis will predict unseen data. Set aside some fraction of the training set and use it to test the prediction performance of a hypothesis induced from the remaining data. K-fold cross-validation – run k experiments by setting each time 1/k of the data to test on, and average the results. Popular values for k: 5 and 10. Leave-one-out cross-validation – extreme case k=n. Note: in order to avoid peeking, another test set should be used.

Cross Validation CV( data S, alg L, int k ) Divide S into k disjoint sets { S 1, S 2, …, S k } For i = 1..k do Run L on S -i = S – S i obtain L(S -i ) = h i Evaluate h i on S i err S i (h i ) = 1/|S i |   x,y   S i I(h i (x)  y) Return Average 1/k  i err S i (h i ) A method for estimating the accuracy (or error) of a learner Note: in order to avoid peeking, another test set should be used.

Carla P. Gomes CS4700 Validation Set Sometimes an additional validation set is used, independent from the training and testing sets. The validation set can be used for example to watch out for overfitting, or to choose the best input parameters for a classifier model. In that case the data is split in three parts: training, validation, and test. 1) Use the training data in a cross-validation scheme like 10-fold or 2/3 - 1/3 to simply estimate the average quality (e.g. error rate (or accuracy) of a classifier)

Carla P. Gomes CS4700 Validation Set 2) Leave an additional subset of data, the validation set, to adjust additional parameters or adjust the structure (such as number of layers or neurons in a neural network, or number of nodes in a decision tree) of the model. For example: you can use the validation set to decide when to stop growing the decision tree, thus to test for overfitting. Repeat this for many parameter/structure choices and select the choice with best quality on the validation set 3) Finally take this best choice of parameters + model from step 2 and use it to estimate the quality on the test data.

Carla P. Gomes CS4700 Validation Set So Training set to compute the model, Validation set to choose the best parameters of this model (in case there are "additional" parameters that cannot be computed based on training) Test set as the final "judge" to get an estimate of the quality on new data that was used neither to train the model, nor to determine its underlying parameters or structure or complexity of this model

Carla P. Gomes CS4700 How to pick best tree? Measure performance over training data Measure performance over separate validation data set MDL (minimal description length): minimize size(tree) + size(misclassifications(tree))

Selecting Algorithm Parameters Real-world Process (x 1,y 1 ), …, (x n,y n ) Learner 1 (x 1,y 1 ),…(x k,y k ) Train Data D train Test Data D test split randomly (50%) split randomly (30%) h1h1 D train Data D drawn randomly Optimal choice of algorithm parameters depends on learning task: (e.g., nodes in decision trees, nodes in neural network, k in k in k-nearest neighbor, etc) (x 1,y 1 ), …, (x l,y l ) Validation Data D val Learner p hphp D val … argmin h i {Err val (h i )} split randomly (20%) h final D train

Carla P. Gomes CS4700 Extensions of Decision Tree Learning Only a flavor, briefly

Carla P. Gomes CS4700 Missing Data Given an example with a missing value to an attribute: Assume that the example has all possible values for the attribute Weigh each value according to its frequency among all of the examples that reach that node in the decision tree. (Note: the classification algorithm should follow all the branches at any node for which a value is missing and should multiply the weights along each path) See exercise 18.12

Carla P. Gomes CS4700 Mulivalued Attributes: Using Gain Ratios The notion of Information Gain favors attributes that have a large number of values. –If we have an attribute A that has a distinct value for each example, then each subset of examples would be a singleton –  Entropy(S,A) is 0, thus Gain(S,A) is maximal. Nevertheless the attribute could be bad for generalization or even irrelevant. Average entropy of each subset induced by attribute A Example: name of restaurant does not generalize well!

Carla P. Gomes CS4700 Using Gain Ratios To compensate for this Quinlan suggests using a Gain ratio instead of Gain: GainRatio(S,A) = Gain(S,A) / SplitInfo(A,S) SplitInfo(A,S) (Information Content of an attribute or intrinsic information of an attribute)  information due to the split of S on the basis of value of categorical attribute A. SplitInfo(A,S) = I(|S1|/|S|, |S2|/|S|,.., |Sm|/|S|) where {S1, S2,.. Sm} is the partition of S induced by values of A

Carla P. Gomes CS4700 Real-valued data Select a set of thresholds defining intervals; –each interval becomes a discrete value of the attribute We can use some simple heuristics –always divide into quartiles We can use domain knowledge –divide age into infant (0-2), toddler (3 - 5), and school aged (5-8) or treat this as another learning problem –try a range of ways to discretize the continuous variable and see which yield “better results” w.r.t. some metric –(e.g., information gain, gain ratio, etc).

Carla P. Gomes CS4700 Converting Trees to Rules Every decision tree corresponds to set of rules: –IF (Patrons = None) THEN WillWait = No –IF (Patrons = Full) & (Hungry = No) &(Type = French) THEN WillWait = Yes –...

Carla P. Gomes CS4700 Decision Trees to Rules It is easy to derive a rule set from a decision tree: write a rule for each path in the decision tree from the root to a leaf. In that rule the left-hand side is easily built from the label of the nodes and the labels of the arcs. The resulting rules set can be simplified: –Let LHS be the left hand side of a rule. –Let LHS' be obtained from LHS by eliminating some conditions. –We can certainly replace LHS by LHS' in this rule if the subsets of the training set that satisfy respectively LHS and LHS' are equal.

Carla P. Gomes CS4700 Fighting Overfitting: Using Rule Post-Pruning

Carla P. Gomes CS4700 C4.5 This is the strategy used by the most successful commercial decision learning method  C4.5 is an extension of ID3 that accounts for unavailable values, continuous attribute value ranges, pruning of decision trees, rule derivation, and so on. C4.5: Programs for Machine Learning J. Ross Quinlan, The Morgan Kaufmann Series in Machine Learning, Pat Langley,

Carla P. Gomes CS4700 Summary: When to use Decision Trees Instances presented as attribute-value pairs Method of approximating discrete-valued functions Target function has discrete values: classification problems Disjunctive descriptions required (propositional logic) Robust to noisy data: Training data may contain –errors –missing attribute values Typical bias: prefer smaller trees (Ockham's razor ) Widely used, practical and easy to interpret results

Carla P. Gomes CS4700 Summary of Decision Tree Learning Inducing decision trees is one of the most widely used learning methods in practice Can outperform human experts in many problems Strengths include –Fast –simple to implement –can convert result to a set of easily interpretable rules –empirically valid in many commercial products –handles noisy data Weaknesses include: –"Univariate" splits/partitioning using only one attribute at a time so limits types of possible trees –large decision trees may be hard to understand –requires fixed-length feature vectors –non-incremental (i.e., batch method)