Overfitting and Evaluation

Overfitting and Evaluation
Truong-Huy Nguyen Overfitting and Evaluation

Outline Overfitting Evaluation Examples Causes of overfitting
Techniques to avoid overfitting Evaluation Performance metrics Scalar Visualization Metric estimation: Cross-validation

Overfitting

Overfitting Example The issue of overfitting had been known long before decision trees and data mining In electrical circuits, Ohm's law: Current (I) is directly proportional to the potential difference or voltage (V), and inversely proportional to the resistance (R). Perfect fit to training data with a 9th degree polynomial (can fit n points exactly with an n-1 degree polynomial) Experimentally measure 10 points Fit a curve to the Resulting data. current (I) I = V/R or V = IR voltage (V) Ohm was wrong, we have found a more accurate function!

Overfitting Example Testing Ohms Law: I = V/R (or V = IR)
voltage (V) current (I) Better generalization with a linear function that fits training data less accurately.

Underfitting and Overfitting
Decision Tree: Error rates versus Model complexity Overfitting How many decision tree nodes (x-axis) would you use? Underfitting: when model is too simple, both training and test errors are large Overfitting: when model is too complex, training error is low but test error rate is high

The Right Fit Overfitting
Best generalization performance seems to be achieved with around 130 nodes

Causes of Overfitting Noise Insufficient data Model complexity

Cause 1: Noise Decision boundary is distorted by noise point

Cause 2: Insufficient Examples
Hollow red circles are test data Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region Insufficient number of training records in the region leads to a wrong model due to sampling error (i.e., training subset not representative of the population)

Cause 3: Model complexity
Decision Tree: Growing to purity is bad (overfitting) x2: sepal width x1: petal length

Cause 3: Model complexity
Decision Tree: Growing to purity is bad (overfitting) Not statistically supportable leaf Remove split & merge leaves x2: sepal width x1: petal length

Avoid Overfitting Split training data into different sets Training set: Build the model Resubstitution error: error rate on training set Bad indicator of performance on new data Overfitting of training data: Good resubstitution error, but bad predictive accuracy Test set: Evaluate the model Model did not see test data, so evaluation is fair Validation set: Tune a model or choose between alternative models Often used for overfitting avoidance All three data sets may be generated from a single labeled data set

Avoid Overfitting Idea: Occam’s Razor, by William of Ockham ( ) Among competing hypotheses, the one with the fewest assumptions should be selected. Other words: Simpler models are preferred For complex models, there is a greater chance that it was fitted accidentally by errors in data (such as noise) One should include model complexity when evaluating a model

How to Avoid Overfitting?
Pre-pruning Stop growing the tree before it reaches the point where it perfectly classifies the training data (prepruning) Correct estimation on when to stop is difficult Post-pruning: More popular Allow the tree to overfit the data, then prune the tree back Both need a way to determine satisfactory tree size

Pre-pruning Early Stopping Rule
Stop the algorithm before it becomes a fully-grown tree Typical stopping conditions for a node: Stop if all instances belong to the same class Stop if all the attribute values are the same More restrictive conditions: Stop if number of instances is less than some user-specified threshold Stop if class distribution of instances are independent of the available features (e.g., using  2 test) Stop if expanding the current node does not significantly improve impurity measures (e.g., Gini or information gain). Assign some penalty for model complexity when deciding whether to continue refining the model (e.g., a penalty for each leaf node in a decision tree)

Post-pruning Steps Grow decision tree to its entirety
Trim the nodes of the decision tree in bottom-up fashion: Two approaches Optimizing Minimum Description Length Represents the combination of model’s accuracy and complexity Does not use Validation Set Reduced Error Pruning: If generalization error improves after trimming (validation set), replace sub-tree by a leaf node. Class label of leaf node is determined from majority class of instances in the sub-tree

Minimum Description Length (MDL)
Cost(Model,Data) = Cost(Data|Model) + Cost(Model) Cost(Data|Model) encodes the misclassification errors. If you have the model, you only need to remember the examples that do not agree with the model. Cost(Model) is the cost of encoding the model (in bits) General idea is to trade off model complexity and number of errors while assigning objective costs to both Costs are based on bit encoding

Reduced Error Pruning Properties
When pruning begins, tree is at maximum size and lowest accuracy over test set As pruning proceeds number of nodes is reduced and accuracy over test set increases At some point, the accuracy in test set starts to decrease, which means the model starts to become underfitting Should stop pruning then Disadvantage of Validation Set Wastage of training data: When data is limited, number of samples available for training is further reduced

Wastage of Training data
The problem with Validation set is that it potentially “wastes” training data on the validation set. Severity of this problem depends where we are on the learning curve test accuracy number of training examples

Model Evaluation

Why? Performance Evaluation Model Comparison
Evaluate the performance of a model Model Comparison Compare the relative performance among competing models

How to Evaluate Performance?
Focus on the predictive capability of a model Rather than how fast it takes to classify or build models, scalability, etc. Scalar Measures: make comparisons easy since only a single number involved Accuracy Expected cost Visualization Techniques ROC Curves Area under the ROC curve: A scalar measure Lift Chart

Metrics for Performance Evaluation
Confusion Matrix PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No a b c d a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative)

Scalar Metric 1: Accuracy
PREDICTED CLASS ACTUAL CLASS Class=P Class=N a (TP) b (FN) c (FP) d (TN) Error Rate = 1 - accuracy

Limitation of Accuracy
Consider a 2-class problem Number of Class 0 examples = 9990 Number of Class 1 examples = 10 If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 % Accuracy is misleading because model does not detect any class 1 example

Scalar Metric 2: F-Measure
PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No a (TP) b (FN) c (FP) d (TN) Positive predictive value Unbiased measure! Not dependent on class ratio True Positive Rate False Positive Rate = 𝑐 𝑐+𝑑

Cost-sensitive scenarios
The error rate is an inadequate measure of the performance of an algorithm, it doesn’t take into account the cost of making wrong decisions. Example: Based on chemical analysis of the water try to detect an oil slick in the sea. False positive: wrongly identifying an oil slick if there is none. False negative: fail to identify an oil slick if there is one. Here, false negatives (environmental disasters) are much more costly than false negatives (false alarms). We have to take that into account when we evaluate our model.

Computing Cost of Classification
Cost Matrix PREDICTED CLASS ACTUAL CLASS C(i|j) + - -1 100 1 Expected Cost = Weighted Sum of Cost = 𝑖,𝑗 𝐶 𝑖 𝑗 𝐴(𝑖|𝑗) Confusion Matrix Confusion Matrix Model M1 PREDICTED CLASS ACTUAL CLASS A1(i|j) + - 150 40 60 250 Model M2 PREDICTED CLASS ACTUAL CLASS A2(i|j) + - 250 45 5 200 Accuracy = 80% Cost = 3910 Accuracy = 90% Cost = 4255

Cost-Sensitive Learning
Cost sensitive learning algorithms can utilize the cost matrix to try to find an optimal classifier given those costs In practice this can be implemented in several ways Simulate the costs by modifying the training distribution Modify the probability threshold for making a decision if the costs are 2:1 you can modify the class estimation threshold from 0.5 to 0.33 Weka uses these two methods to allow you to do cost-sensitive learning

What’s Wrong with Scalars?
A scalar does not tell the whole story. There are fundamentally two numbers of interest (FP and TP), a single number invariably loses some information. How are errors distributed across the classes ? How will each classifier perform in different testing conditions (costs or class ratios other than those measured in the experiment) ? A scalar imposes a linear ordering on classifiers Inflexible what we want is to identify the conditions under which each is better. Why Visualization techniques are useful Shape of curves more informative than a single number

Receiver Operating Characteristic (ROC) Curves
Summarize & present performance of any binary classification model Models ability to distinguish between false & true positives

ROC Curve Analysis Signal Detection Technique
Traditionally used to evaluate diagnostic tests Now employed to identify subgroups of a population at differential risk for a specific outcome (clinical decline, treatment response)

ROC Analysis: Historical Development (1)
Derived from early radar in WW2 Battle of Britain Problem at hand Accurately identifying the signals on the radar scan to predict the outcome of interest – Enemy planes – when there were many extraneous signals (e.g. Geese)? 1. ROC is a signal detection technique (Exploratory or hypothesis generating) 2. Traditionally most use of it is in evaluation of medical tests: refer to helens book 3. Very easy to understand the ROC through the idea of how well a diagnositc test identifes an illness. 4. But of course it is not restricted to test evaluation 5. Same technique can be Also used outcome of many sorts etc. Where we think of diagnostic test=predictor Disease=outcome of interest (Binary outcome) SO WHY NOT DO A SIMPLE LOGISTIC REGRESSION WHICH IS A REGRESSION WHICH CAN DEAL WITH BINARY OUTCOMES? Great article by Helena, etc.comparing logistic regression and ROC analysis entilted OR

ROC Analysis: Historical Development (2)
True Positives = Radar Operator interpreted signal as Enemy Planes and there were Enemy planes Good Result: No wasted Resources True Negatives = Radar Operator said no planes and there were none Good Result: No wasted resources False Positives = Radar Operator said planes, but there were none Geese: wasted resources False Negatives = Radar Operator said no plane, but there were planes Bombs dropped: very bad outcome Hand out handout on these

Example: 3 classifiers Classifier 1 TP = 0.4 FP = 0.3 Classifier 2
True Predicted pos neg 40 60 30 70 True Predicted pos neg 70 30 50 True Predicted pos neg 60 40 20 80 Classifier 1 TP = 0.4 FP = 0.3 Classifier 2 TP = 0.7 FP = 0.5 Classifier 3 TP = 0.6 FP = 0.2

ROC plot for the 3 Classifiers
Ideal classifier always positive chance always negative

ROC Curves more generally, ranking models produce a range of possible (FP,TP) tradeoffs Plots performance of models (true positive rate) when adjusting thresholds to increase false positive rate Generated by starting with best “rule” and progressively adding more rules Last case is when always predict positive class TP =1 and FP = 1 WHY?

Using ROC for Model Comparison
No model consistently outperform the other M1 is better for small FPR M2 is better for large FPR Area Under the ROC curve Ideal: Area = 1 Random guess: Area = 0.5

Cumulative Response Curve
More intuitive than ROC curve Plots TP rate (% of positives targeted) on the y-axis vs. percentage of population targeted (x-axis) Formed by ranking the classification “rules” from most to least accurate Start with most accurate and plot point, add next most accurate, etc. Eventually include all rules and cover all examples Common in marketing applications

Lift Chart Generated by dividing the cumulative response curve by the baseline curve for each x-value. A lift of 3 means that your prediction is 3X better than baseline (guessing)

Learning Curve Learning curve shows how accuracy changes with varying sample size Requires a sampling schedule for creating learning curve: Arithmetic sampling: Linear increment Geometric sampling: Exponential increment Sampling Schedule S = {n0, n1, …, nk} ni : the size of a sample

Methods of Estimation Problem: How to estimate the performance metrics reliably? Holdout Reserve 2/3 for training and 1/3 for testing Random subsampling Repeated holdout Cross validation Partition data into k disjoint subsets k-fold: train on k-1 partitions, test on the remaining one Leave-one-out: k=n

Holdout validation: Cross-validation (CV)
Partition data into k “folds” (randomly) Run training/test evaluation k times

Cross Validation Example: data set with 20 instances, 5-fold cross validation training test d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 d13 d14 d15 d16 d17 d18 d19 d20 d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 d13 d14 d15 d16 d17 d18 d19 d20 d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 d13 d14 d15 d16 d17 d18 d19 d20 compute error rate for each fold  then compute average error rate d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 d13 d14 d15 d16 d17 d18 d19 d20 d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 d13 d14 d15 d16 d17 d18 d19 d20

Leave-one-out Cross Validation
Leave-one-out cross validation is simply k-fold cross validation with k set to n, the number of instances in the data set. The test set only consists of a single instance, which will be classified either correctly or incorrectly. Advantages: maximal use of training data, i.e., training on n−1 instances. The procedure is deterministic, no sampling involved. Disadvantages: unfeasible for large data sets: large number of training runs required, high computational cost.

Multiple Comparisons Beware the multiple comparisons problem
If you flip 1000 fair coins many times each, one of them will have come up more heads than tails, but it’s still a fair coin. Multiple Comparisons Beware the multiple comparisons problem The example in “Data Science for Business” is telling: Create 1000 stock funds by randomly choosing stocks See how they do and liquidate all but the top 3 Now you can report that these top 3 funds perform very well (and hence you might infer they will in the future). But the stocks were randomly picked! If you generate large numbers of models then the ones that do really well may just be due to luck or statistical variations. If you picked the top fund after this weeding out process and then evaluated it over the next year and reported that performance, that would be fair. Note: stock funds actually use this trick. If a stock fund does poorly at the start it is likely to be terminated while good ones will not be.

Overfitting and Evaluation

Similar presentations

Presentation on theme: "Overfitting and Evaluation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Overfitting and Evaluation

Similar presentations

Presentation on theme: "Overfitting and Evaluation"— Presentation transcript:

Similar presentations

About project

Feedback