CSE 4705 Artificial Intelligence

Slides:



Advertisements
Similar presentations
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Advertisements

Lecture Notes for Chapter 4 (2) Introduction to Data Mining
Lecture Notes for Chapter 4 Part III Introduction to Data Mining
1 Machine Learning Support Vector Machines. 2 Perceptron Revisited: Linear Separators Binary classification can be viewed as the task of separating classes.
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
Model Evaluation Metrics for Performance Evaluation
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Classification II (continued) Model Evaluation
Evaluating Classifiers
CSE 4705 Artificial Intelligence
Evaluation – next steps
CpSc 810: Machine Learning Evaluation of Classifier.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
Evaluating Results of Learning Blaž Zupan
Model Evaluation l Metrics for Performance Evaluation –How to evaluate the performance of a model? l Methods for Performance Evaluation –How to obtain.
Bab /57 Bab 4 Classification: Basic Concepts, Decision Trees & Model Evaluation Part 2 Model Overfitting & Classifier Evaluation.
Practical Issues of Classification Underfitting and Overfitting –Training errors –Generalization (test) errors Missing Values Costs of Classification.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
1 Data Mining Lecture 4: Decision Tree & Model Evaluation.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Data Science Credibility: Evaluating What’s Been Learned
Ensemble Classifiers.
Machine Learning: Ensemble Methods
Computational Biology
7. Performance Measurement
CEE 6410 Water Resources Systems Analysis
Data Mining Overfitting and Evaluation
Evaluating Classifiers
Lecture Notes for Chapter 4 Introduction to Data Mining
Overfitting and Evaluation
Performance Evaluation 02/15/17
Lecture Notes for Chapter 4 Introduction to Data Mining
CSE 4705 Artificial Intelligence
Evaluating Results of Learning
Computer Science and Engineering Dept.
Dipartimento di Ingegneria «Enzo Ferrari»,
Introduction to Data Mining, 2nd Edition by
Lecture Notes for Chapter 4 Introduction to Data Mining
Lecture Notes for Chapter 4 Introduction to Data Mining
Introduction to Data Mining, 2nd Edition by
Data Mining Classification: Alternative Techniques
Lecture Notes for Chapter 4 Introduction to Data Mining
Introduction to Data Mining, 2nd Edition by
Data Mining Practical Machine Learning Tools and Techniques
LECTURE 05: THRESHOLD DECODING
Genetic Algorithms (GA)
Lecture Notes for Chapter 4 Introduction to Data Mining
Evaluation and Its Methods
Scalable Decision Tree Induction Methods
Lecture Notes for Chapter 4 Introduction to Data Mining
Lecture Notes for Chapter 4 Introduction to Data Mining
Lecture Notes for Chapter 4 Introduction to Data Mining
آبان 96. آبان 96 Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan,
Chapter 4 Classification
Model Evaluation and Selection
Lecture Notes for Chapter 4 Introduction to Data Mining
Lecture Notes for Chapter 4 Introduction to Data Mining
Lecture Notes for Chapter 4 Introduction to Data Mining
Data Mining Class Imbalance
Lecture Notes for Chapter 4 Introduction to Data Mining
LECTURE 05: THRESHOLD DECODING
Lecture Notes for Chapter 4 Introduction to Data Mining
Introduction to Machine learning
COSC 4368 Intro Supervised Learning Organization
Lecture Notes for Chapter 4 Introduction to Data Mining
ECE – Pattern Recognition Lecture 8 – Performance Evaluation
Presentation transcript:

CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering http://www.engr.uconn.edu/~jinbo

Machine learning (1) Supervised learning algorithms

A bit review of last class Fisher’s discriminant analysis Assuming each class follows a Gaussian distribution X|Y= k ~ N(k, k), we can use maximum a posterior to derive a quadratic classification rule (quadratic in terms of x) C(X) = arg mink {(X - k)’ k-1 (X - k) + log| k |} If we further assume that the two classes have the same covariance matrix, k = the discriminant rule is linear (Linear discriminant analysis, or LDA; FLDA for k = 2): Then, we give another separate derivation of the linear discriminant rule using geometric interpretation where

A bit review of last class Using geometric interpretation, we maximize the signal-to-noise ratio That gives us the exactly same discriminant rule Hence, we can conclude Fisher’s linear discriminant rule does find a direction along which the data shows the best separability Between-class separation Within-class cohesion where

Geometric interpretation (two-class) The geometric interpretation version of linear discriminant analysis can be extended to multi-class classification (not covered in class, and interested students can see additional slides in last lecture)

Supervised learning: classification Underfitting or Overfitting can also happen in classification approaches We now illustrate these practical issues on classification problem We will cover neural networks in the next week

Underfitting and overfitting 500 circular and 500 triangular data points. Circular points: 0.5  sqrt(x12+x22)  1 Triangular points: sqrt(x12+x22) > 1 or sqrt(x12+x22) < 0.5

Overfitting 500 circular and 500 triangular data points. Circular points: 0.5  sqrt(x12+x22)  1 Triangular points: sqrt(x12+x22) > 1 or sqrt(x12+x22) < 0.5

Overfitting due to noise Decision boundary is distorted by noise point

Overfitting due to insufficient examples Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region - Insufficient number of training records in the region causes the neural nets to predict the test examples using other training records that are irrelevant to the classification task

Notes on over-fitting Overfitting results in classifiers (a neural net, or a support vector machine) that are more complex than necessary Even the training error is small, but the test error could be large. Training error no longer provides a good estimate of how well the classifier will perform on previously unseen records Need new ways for estimating errors – again regularized errors are one kind of solution

Regularization and Occam’s razor Regularization means to add in a penalty on the model complexity, such as a 2-vector norm or a 1-vector norm It fundamentally similar to Occam’s razor Given two models of similar test errors, one should prefer the simpler model over the complex model For complex models, there is a greater chance that it was fitted accidentally by noises in data Therefore, one should include model complexity when finding the best model

Regularization and Occam’s razor Regularization means to add in a penalty on the model complexity, such as a 2-vector norm or a 1-vector norm It fundamentally similar to Occam’s razor Given two models of similar test errors, one should prefer the simpler model over the complex model For complex models, there is a greater chance that it was fitted accidentally by noises in data Therefore, one should include model complexity when finding the best model

How to address over-fitting Minimize training error no longer guarantees a good model (a classifier or a regressor) Need better estimate of the error on the true population – generalization error Ppopulation( f(x) not equal to y ) In practice, design a procedure that gives better estimate of the error than training error In theoretical analysis, find an analytical bound to bound the generalization error or use Bayesian formula

How to evaluate a model or an algorithm Even we use regularization or schemes to overcome overfitting, how do we know the resultant model is better or worse This motivates us to study model evaluation

Model evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to compare the relative performance among competing models?

Metric for performance evaluation Regression (accuracy-related performance) Sum of squares Root mean square error (RMSE) Sum of deviation Exponential function of the deviation

Metric for performance evaluation Classification Focus on the predictive capability of a model Rather than how fast it takes to classify or build models, scalability, etc. Confusion Matrix (for a two-class classification): PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No a b c d a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative)

Metric for performance evaluation Most widely-used metric: PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No a (TP) b (FN) c (FP) d (TN)

Limitation of accuracy Consider a 2-class problem Number of Class 0 examples = 9990 Number of Class 1 examples = 10 If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 % Accuracy is misleading because model does not detect any class 1 example

Cost matrix PREDICTED CLASS C(i|j) ACTUAL CLASS Class=Yes Class=No C(Yes|Yes) C(No|Yes) C(Yes|No) C(No|No) C(i|j): Cost of misclassifying class j example as class i

Cost function of classification Cost Matrix PREDICTED CLASS ACTUAL CLASS C(i|j) + - -1 100 1 Model M1 PREDICTED CLASS ACTUAL CLASS + - 150 40 60 250 Model M2 PREDICTED CLASS ACTUAL CLASS + - 250 45 5 200 Accuracy = 80% Cost = 3910 Accuracy = 90% Cost = 4255

Cost versus accuracy Count Cost a b c d p q PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No a b c d N = a + b + c + d Accuracy = (a + d)/N Cost = p (a + d) + q (b + c) = p (a + d) + q (N – a – d) = q N – (q – p)(a + d) = N [q – (q-p)  Accuracy] Accuracy is proportional to cost if 1. C(Yes|No)=C(No|Yes) = q 2. C(Yes|Yes)=C(No|No) = p Cost PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No p q

Cost-sensitive measures Count PREDICTED CLASS ACTUAL CLASS Class= Yes Class= No a b c d Precision is biased towards C(Yes|Yes) & C(Yes|No) Recall is biased towards C(Yes|Yes) & C(No|Yes) A model that declares every record to be the positive class: b = d = 0 A model that assigns a positive class to the (sure) test record: c is small Recall is high Precision is high

Cost-sensitive measures Count PREDICTED CLASS ACTUAL CLASS Class= Yes Class= No a (TP) B (FN) C (FP) D (TN)

A more comprehensive metric When a classifier outputs a membership probability (or a numerical score) of a class for each example (and most classifiers do output in such a way) Another more comprehensive metric can be used Receiver operating characteristic curve In short ROC curve

ROC curve PREDICTED CLASS ACTUAL CLASS TPR = TP/(TP+FN) Class =Yes Class= No a (TP) b (FN) Class =No c (FP) d (TN) At threshold t: TP=50, FN=50, FP=12, TN=88 TPR = TP/(TP+FN) FPR = FP/(FP+TN)

ROC curve PREDICTED CLASS (TPR,FPR): (0,0): declare everything to be negative class TP=0, FP = 0 (1,1): declare everything to be positive class FN = 0, TN = 0 (1,0): ideal FN = 0, FP = 0 PREDICTED CLASS ACTUAL CLASS Class =Yes Class= No a (TP) b (FN) Class =No c (FP) d (TN) TPR = TP/(TP+FN) FPR = FP/(FP+TN)

ROC curve (TPR,FPR): (0,0): declare everything to be negative class (1,1): declare everything to be positive class (1,0): ideal Diagonal line: Random guessing Below diagonal line: prediction is opposite of the true class

How to construct an ROC curve Use classifier that produces posterior probability for each test instance P(+|X) Sort the instances according to P(+|X) in decreasing order Apply threshold at each unique value of P(+|X) Count the number of TP, FP, TN, FN at each threshold TP rate, TPR = TP/(TP+FN) FP rate, FPR = FP/(FP + TN) Instance P(+|X) True Class 1 0.95 + 2 0.93 3 0.87 - 4 0.85 5 6 7 0.76 8 0.53 9 0.43 10 0.25

How to construct an ROC curve Use classifier that produces posterior probability for each test instance P(+|X) Sort the instances according to P(+|X) in decreasing order Pick a threshold 0.85 p>= 0.85, predicted to P p< 0.85, predicted to N TP = 3, FP=3, TN=2, FN=2 TP rate, TPR = 3/5=60% FP rate, FPR = 3/5=60% Instance P(+|X) True Class 1 0.95 + 2 0.93 3 0.87 - 4 0.85 5 6 7 0.76 8 0.53 9 0.43 10 0.25

How to construct an ROC curve Threshold >= ROC Curve:

Model evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to compare the relative performance among competing models?

Methods for performance evaluation How to obtain a reliable estimate of performance? Performance of a model may depend on other factors besides the learning algorithm: Class distribution Cost of misclassification Size of training and test sets

Methods for performance evaluation How to obtain a reliable estimate of performance? Performance of a model may depend on other factors besides the learning algorithm: Class distribution Cost of misclassification Size of training and test sets

Methods of estimation Holdout For instance, reserve 2/3 for training and 1/3 for testing Random subsampling Repeated holdout Cross validation Partition data into k disjoint subsets k-fold: train on k-1 partitions, test on the remaining one Leave-one-out: k=n Stratified sampling oversampling vs undersampling Bootstrap Sampling with replacement

Methods of estimation Holdout method Given data is randomly partitioned into two independent sets Training set (e.g., 2/3) for model construction Test set (e.g., 1/3) for accuracy estimation Random sampling: a variation of holdout Repeat holdout k times, accuracy = avg. of the accuracies obtained Cross-validation (k-fold, where k = 10 is most popular) Randomly partition the data into k mutually exclusive subsets, each approximately equal size At i-th iteration, use Di as test set and others as training set Leave-one-out: k folds where k = # of tuples, for small sized data Stratified cross-validation: folds are stratified so that class dist. in each fold is approx. the same as that in the initial data

Methods of estimation Bootstrap Works well with small data sets Samples the given training tuples uniformly with replacement i.e., each time a tuple is selected, it is equally likely to be selected again and re-added to the training set Several boostrap methods, and a common one is .632 boostrap Suppose we are given a data set of d examples. The data set is sampled d times, with replacement, resulting in a training set of d samples. The data points that did not make it into the training set end up forming the test set. About 63.2% of the original data will end up in the bootstrap, and the remaining 36.8% will form the test set (since (1 – 1/d)d ≈ e-1 = 0.368) Repeat the sampling procedure k times, overall accuracy of the model:

Model evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to compare the relative performance among competing models?

Methods for model comparison Compute a specific accuracy metric for all available models using the same test dataset Compare the values For instance, compute RMSE for two regression models RMSE(M1) = 0.5 RMSE(M2) = 0.3 Which one is better?? For instance (classification), compute sensitivity-specificity Sen(M1) = 0.7 Spec(M1) = 0.8 Sen(M1) = 0.6 Spec(M2) = 0.65

Using ROC for model comparison No model consistently outperforms the other M1 is better for small FPR M2 is better for large FPR Area Under the ROC curve (AUC) Ideal: Area = 1 Random guess: Area = 0.5

Questions?