Carla P. Gomes CS4700 CS 4700: Foundations of Artificial Intelligence Prof. Carla P. Gomes Module: Extensions of Decision Trees (Reading: Chapter 18)
Carla P. Gomes CS4700 Extensions of the Decision Tree Learning Algorithm (Briefly) Noisy data and overfitting Cross Validation Missing Data (See exercise ) Using gain ratios (See exercise 18.13) Real-valued data Generation of rules
Carla P. Gomes CS4700 Noisy Data and Overfitting
Carla P. Gomes CS4700 Noisy data Many kinds of "noise" that could occur in the examples: –Two examples have same attribute/value pairs, but different classifications report majority classification for the examples corresponding to the node deterministic hypothesis. report estimated probabilities of each classification using the relative frequency (if considering stochastic hypotheses) –Some values of attributes are incorrect because of errors in the data acquisition process or the preprocessing phase –The classification is wrong (e.g., + instead of -) because of some error
Carla P. Gomes CS4700 Overfitting Problem of trying to predict the roll of a die. The experiment data include: –Day of the week; (2) Month of the week; (3) Color of the die; …. As long as two examples have identical descriptions, DTL will find an exact hypothesis, with irrelevant attributes. Some attributes are irrelevant to the decision-making process, e.g., color of a die is irrelevant to its outcome but they are used to differentiate examples Overfitting. Overfitting means fitting the training set “too well” performance on the test set degrades.
Carla P. Gomes CS4700 Overfitting If the hypothesis space has many dimensions because of a large number of attributes, we may find meaningless regularity in the data that is irrelevant to the true, important, distinguishing features. Fix by pruning lower nodes in the decision tree. For example, if Gain of the best attribute at a node is below a threshold, stop and make this node a leaf rather than generating children nodes. Overfitting is a key problem in learning. Mathematical treatment of overfitting beyond this course. Flavor of “decision tree pruning”
Carla P. Gomes CS4700 Overfitting Let’s consider D, the entire distribution of data, and T, the training set. Hypothesis h H overfits D if – h’ h H such that error T (h) < error T (h’) but error D (h) > error D (h’)
Carla P. Gomes CS4700 Dealing with Overfitting Decision tree pruning prevent splitting on features that are not clearly relevant. Testing of relevance of features: statistical tests cross-validation Grow full tree, then post-prune rule post-pruning
Carla P. Gomes CS4700 Pruning Decision Trees Idea: prevent splitting on attributes that are not relevant. How do we decide that an attribute is irrelevant? If an attribute is not really relevant, the resulting subsets would have roughly the same proportion of each class (pos/neg) as the original set Gain would be close to zero. How large should the Gain be, to be considered significant? Statistical significance test:
Carla P. Gomes CS4700 Statistical Significance Test: Pruning Decision Trees How large should the Gain be to be considered significant? statistical significance test: Null hypothesis (H0) --- the attribute we are considering to split on is irrelevant, i.e., Gain is zero for a sufficiently large sample. We can measure the deviation (D) by comparing the actual number of pos and negative examples in each of the new v subsets (p i and n i ) with the expected numbers and, assuming true irrelevance of the attribute, therefore using the original set proportion (p pos and n neg examples) applied to the new subsets:
Carla P. Gomes CS4700 2 Test D ~ 2 Distribution with (v-1) degrees of freedom v = size of domain of attribute A p i – actual number of positive examples n i – actual number of negative examples – number of positive examples using the original set proportion p of positive examples (therefore assuming Ho, i.e., irrelevant attribute) – number of negative examples using the original set proportion n of positive examples (therefore assuming Ho, i.e., irrelevant attribute)
Carla P. Gomes CS4700 Ho: A is irrelevant: 2 Test: Summary D ~ 2 Distribution with (v-1) degrees of freedom (exp – obs) 2 / exp - probability of making a mistake (i.e., rejecting H0 when it is true) T (v-1) – threshold value (i.e., theoretical value for 2 distribution with (v-1) degrees of freedom, probability in the tail) D < T (v-1) do not reject H0 (therefore do not split on A since we do not reject that it is irrelevant) D > T (v-1) reject H0; therefore split on A since we reject that it is irrelevant)
2 Table
Carla P. Gomes CS4700 Cross Validation Another technique to reduce overfitting it can be applied to any learning algorithm. Key idea: Estimate how well each hypothesis will predict unseen data. Set aside some fraction of the training set and use it to test the prediction performance of a hypothesis induced from the remaining data. K-fold cross-validation – run k experiments by setting each time 1/k of the data to test on, and average the results. Popular values for k: 5 and 10. Leave-one-out cross-validation – extreme case k=n. Note: in order to avoid peeking, another test set should be used.
Cross Validation CV( data S, alg L, int k ) Divide S into k disjoint sets { S 1, S 2, …, S k } For i = 1..k do Run L on S -i = S – S i obtain L(S -i ) = h i Evaluate h i on S i err S i (h i ) = 1/|S i | x,y S i I(h i (x) y) Return Average 1/k i err S i (h i ) A method for estimating the accuracy (or error) of a learner Note: in order to avoid peeking, another test set should be used.
Carla P. Gomes CS4700 Validation Set Sometimes an additional validation set is used, independent from the training and testing sets. The validation set can be used for example to watch out for overfitting, or to choose the best input parameters for a classifier model. In that case the data is split in three parts: training, validation, and test. 1) Use the training data in a cross-validation scheme like 10-fold or 2/3 - 1/3 to simply estimate the average quality (e.g. error rate (or accuracy) of a classifier)
Carla P. Gomes CS4700 Validation Set 2) Leave an additional subset of data, the validation set, to adjust additional parameters or adjust the structure (such as number of layers or neurons in a neural network, or number of nodes in a decision tree) of the model. For example: you can use the validation set to decide when to stop growing the decision tree, thus to test for overfitting. Repeat this for many parameter/structure choices and select the choice with best quality on the validation set 3) Finally take this best choice of parameters + model from step 2 and use it to estimate the quality on the test data.
Carla P. Gomes CS4700 Validation Set So Training set to compute the model, Validation set to choose the best parameters of this model (in case there are "additional" parameters that cannot be computed based on training) Test set as the final "judge" to get an estimate of the quality on new data that was used neither to train the model, nor to determine its underlying parameters or structure or complexity of this model
Carla P. Gomes CS4700 How to pick best tree? Measure performance over training data Measure performance over separate validation data set MDL (minimal description length): minimize size(tree) + size(misclassifications(tree))
Selecting Algorithm Parameters Real-world Process (x 1,y 1 ), …, (x n,y n ) Learner 1 (x 1,y 1 ),…(x k,y k ) Train Data D train Test Data D test split randomly (50%) split randomly (30%) h1h1 D train Data D drawn randomly Optimal choice of algorithm parameters depends on learning task: (e.g., nodes in decision trees, nodes in neural network, k in k in k-nearest neighbor, etc) (x 1,y 1 ), …, (x l,y l ) Validation Data D val Learner p hphp D val … argmin h i {Err val (h i )} split randomly (20%) h final D train
Carla P. Gomes CS4700 Extensions of Decision Tree Learning Only a flavor, briefly
Carla P. Gomes CS4700 Missing Data Given an example with a missing value to an attribute: Assume that the example has all possible values for the attribute Weigh each value according to its frequency among all of the examples that reach that node in the decision tree. (Note: the classification algorithm should follow all the branches at any node for which a value is missing and should multiply the weights along each path) See exercise 18.12
Carla P. Gomes CS4700 Mulivalued Attributes: Using Gain Ratios The notion of Information Gain favors attributes that have a large number of values. –If we have an attribute A that has a distinct value for each example, then each subset of examples would be a singleton – Entropy(S,A) is 0, thus Gain(S,A) is maximal. Nevertheless the attribute could be bad for generalization or even irrelevant. Average entropy of each subset induced by attribute A Example: name of restaurant does not generalize well!
Carla P. Gomes CS4700 Using Gain Ratios To compensate for this Quinlan suggests using a Gain ratio instead of Gain: GainRatio(S,A) = Gain(S,A) / SplitInfo(A,S) SplitInfo(A,S) (Information Content of an attribute or intrinsic information of an attribute) information due to the split of S on the basis of value of categorical attribute A. SplitInfo(A,S) = I(|S1|/|S|, |S2|/|S|,.., |Sm|/|S|) where {S1, S2,.. Sm} is the partition of S induced by values of A
Carla P. Gomes CS4700 Real-valued data Select a set of thresholds defining intervals; –each interval becomes a discrete value of the attribute We can use some simple heuristics –always divide into quartiles We can use domain knowledge –divide age into infant (0-2), toddler (3 - 5), and school aged (5-8) or treat this as another learning problem –try a range of ways to discretize the continuous variable and see which yield “better results” w.r.t. some metric –(e.g., information gain, gain ratio, etc).
Carla P. Gomes CS4700 Converting Trees to Rules Every decision tree corresponds to set of rules: –IF (Patrons = None) THEN WillWait = No –IF (Patrons = Full) & (Hungry = No) &(Type = French) THEN WillWait = Yes –...
Carla P. Gomes CS4700 Decision Trees to Rules It is easy to derive a rule set from a decision tree: write a rule for each path in the decision tree from the root to a leaf. In that rule the left-hand side is easily built from the label of the nodes and the labels of the arcs. The resulting rules set can be simplified: –Let LHS be the left hand side of a rule. –Let LHS' be obtained from LHS by eliminating some conditions. –We can certainly replace LHS by LHS' in this rule if the subsets of the training set that satisfy respectively LHS and LHS' are equal.
Carla P. Gomes CS4700 Fighting Overfitting: Using Rule Post-Pruning
Carla P. Gomes CS4700 C4.5 This is the strategy used by the most successful commercial decision learning method C4.5 is an extension of ID3 that accounts for unavailable values, continuous attribute value ranges, pruning of decision trees, rule derivation, and so on. C4.5: Programs for Machine Learning J. Ross Quinlan, The Morgan Kaufmann Series in Machine Learning, Pat Langley,
Carla P. Gomes CS4700 Summary: When to use Decision Trees Instances presented as attribute-value pairs Method of approximating discrete-valued functions Target function has discrete values: classification problems Disjunctive descriptions required (propositional logic) Robust to noisy data: Training data may contain –errors –missing attribute values Typical bias: prefer smaller trees (Ockham's razor ) Widely used, practical and easy to interpret results
Carla P. Gomes CS4700 Summary of Decision Tree Learning Inducing decision trees is one of the most widely used learning methods in practice Can outperform human experts in many problems Strengths include –Fast –simple to implement –can convert result to a set of easily interpretable rules –empirically valid in many commercial products –handles noisy data Weaknesses include: –"Univariate" splits/partitioning using only one attribute at a time so limits types of possible trees –large decision trees may be hard to understand –requires fixed-length feature vectors –non-incremental (i.e., batch method)