CSE 4705 Artificial Intelligence

Slides:

Advertisements

Similar presentations

Learning Algorithm Evaluation

Advertisements

Computational Biology Lecture Slides Week 10 Classification (some parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar)

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Classification: Definition l Given a collection of records (training set) l Find a model.

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.

Lecture Notes for Chapter 4 (2) Introduction to Data Mining

Chapter 6. Classification and Prediction

Data Mining Sangeeta Devadiga CS 157B, Spring 2007.

Lecture Notes for Chapter 4 Part III Introduction to Data Mining

1 Machine Learning Support Vector Machines. 2 Perceptron Revisited: Linear Separators Binary classification can be viewed as the task of separating classes.

More Classifier and Accuracy Measure of Classifiers

Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.

Classification and risk prediction

Model Evaluation Metrics for Performance Evaluation

© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.

Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.

University of Minnesota

Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University

Classification and Prediction: Basic Concepts Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

1 Business Intelligence and Data Analytics Intro Qiang Yang Based on Textbook: Business Intelligence by Carlos Vercellis.

CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.

Classification II (continued) Model Evaluation

Evaluating Classifiers

MAKING THE BUSINESS BETTER Presented By Mohammed Dwikat DATA MINING Presented to Faculty of IT MIS Department An Najah National University.

Computer Science and Engineering Dept.

© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.

Evaluation – next steps

Classification II. 2 Numeric Attributes Numeric attributes can take many values –Creating branches for each value is not ideal The value range is usually.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Classification Decision Trees Evaluation

10/5/2015Data Mining: Concepts and Techniques1 Chapter 6. Classification and Prediction What is classification? What is prediction? Issues regarding classification.

Knowledge Discovery and Data Mining Evgueni Smirnov.

Classification: Basic Concepts, Decision Trees, and Model Evaluation

Knowledge Discovery and Data Mining Evgueni Smirnov.

Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.

CpSc 810: Machine Learning Evaluation of Classifier.

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,

Evaluating Results of Learning Blaž Zupan

Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

Model Evaluation l Metrics for Performance Evaluation –How to evaluate the performance of a model? l Methods for Performance Evaluation –How to obtain.

Bab /57 Bab 4 Classification: Basic Concepts, Decision Trees & Model Evaluation Part 2 Model Overfitting & Classifier Evaluation.

Practical Issues of Classification Underfitting and Overfitting –Training errors –Generalization (test) errors Missing Values Costs of Classification.

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,

1 Data Mining Lecture 4: Decision Tree & Model Evaluation.

Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.

Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4.

Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.

WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.

Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.

WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.

Supervise Learning. 2 What is learning? “Learning denotes changes in a system that... enable a system to do the same task more efficiently the next time.”

DATA MINING and VISUALIZATION Instructor: Dr. Matthew Iklé, Adams State University Remote Instructor: Dr. Hong Liu, Embry-Riddle Aeronautical University.

CSE 4705 Artificial Intelligence

Semi-Supervised Clustering

Evaluation – next steps

Lecture Notes for Chapter 4 Introduction to Data Mining

CSE 4705 Artificial Intelligence

CSE 4705 Artificial Intelligence

Lecture Notes for Chapter 4 Introduction to Data Mining

Data Mining Classification: Alternative Techniques

Sangeeta Devadiga CS 157B, Spring 2007

آبان 96. آبان 96 Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan,

Data Mining: Introduction

Data Mining Class Imbalance

COSC 4368 Intro Supervised Learning Organization

Presentation transcript:

CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering http://www.engr.uconn.edu/~jinbo

Tasks may be in Machine Learning/Data Mining Prediction tasks (supervised learning problem) Classification, regression, ranking Use some variables to predict unknown or future values of other variables. Description tasks (unsupervised learning problem) Cluster analysis, novelty detection, Find human-interpretable patterns that describe the data. From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

Classification: Definition Given a collection of examples (training set ) Each example contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen examples should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

Classification Example categorical categorical continuous class Test Set Learn Classifier Model Training Set

Classification: Application 1 High Risky Patient Detection Goal: Predict if a patient will suffer major complication after a surgery procedure Approach: Use patients vital signs before and after surgical operation. Heart Rate, Respiratory Rate, etc. Monitor patients by expert medical professionals to label which patient has complication, which has not. Learn a model for the class of the after-surgery risk. Use this model to detect potential high-risk patients for a particular surgical procedure

Classification: Application 2 Face recognition Goal: Predict the identity of a face image Approach: Align all images to derive the features Model the class (identity) based on these features

Classification: Application 3 Cancer Detection Goal: To predict class (cancer or normal) of a sample (person), based on the microarray gene expression data Approach: Use expression levels of all genes as the features Label each example as cancer or normal Learn a model for the class of all samples

Classification: Application 4 Alzheimer's Disease Detection Goal: To predict class (AD or normal) of a sample (person), based on neuroimaging data such as MRI and PET Approach: Extract features from neuroimages Label each example as AD or normal Learn a model for the class of all samples Reduced gray matter volume (colored areas) detected by MRI voxel-based morphometry in AD patients compared to normal healthy controls.

Regression Predict a value of a real-valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency. Extensively studied in statistics, neural network fields. Find a model to predict the dependent variable as a function of the values of independent variables. Goal: previously unseen examples should be predicted as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

Regression application 1 Continuous target categorical categorical continuous Current data, want to use the model to predict Tid Refund Marital Status Taxable Income Loss 1 Yes Single 125K 100 2 No Married 100K 120 3 70K -200 4 120K -300 5 Divorced 95K -400 6 60K -500 7 220K -190 8 85K 300 9 75K -240 10 90K 90 Test Set Learn Regressor Model Training Set Past transaction records, label them goals: Predict the possible loss from a customer

Regression applications Examples: Predicting sales amounts of new product based on advertising expenditure. Predicting wind velocities as a function of temperature, humidity, air pressure, etc. Time series prediction of stock market indices.

Clustering Definition Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that Data points in one cluster are more similar to one another. Data points in separate clusters are less similar to one another. Similarity Measures: Euclidean Distance if attributes are continuous. Other Problem-specific Measures

Illustrating Clustering Euclidean Distance Based Clustering in 3-D space. Intracluster distances are minimized Intercluster distances are maximized

Clustering: Application 1 High Risky Patient Detection Goal: Predict if a patient will suffer major complication after a surgery procedure Approach: Use patients vital signs before and after surgical operation. Heart Rate, Respiratory Rate, etc. Find patients whose symptoms are dissimilar from most of other patients.

Clustering: Application 2 Document Clustering: Goal: To find groups of documents that are similar to each other based on the important terms appearing in them. Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster. Gain: Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents.

Illustrating Document Clustering Clustering Points: 3204 Articles of Los Angeles Times. Similarity Measure: How many words are common in these documents (after some word filtering).

Algorithms to solve these problems

Classification algorithms K-Nearest-Neighbor classifiers Naïve Bayes classifier Neural Networks Linear Discriminant Analysis (LDA) Support Vector Machines (SVM) Decision Trees Logistic Regression Graphical models

Regression methods Linear Regression Ridge Regression LASSO – Least Absolute Shrinkage and Selection Operator Neural Networks

Clustering algorithms K-Means Hierarchical clustering Graph-based clustering (Spectral clustering) Semi-supervised clustering Others

Challenges of Big Data Scalability Dimensionality Complex and Heterogeneous Data Data Quality Data Ownership and Distribution Privacy Preservation

Our Focus Supervised learning Classification (support vector machine) Regression (backpropagation neural networks) Before talking about the techniques, let us first understand how a learning model is evaluated.

Model Evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to compare the relative performance among competing models?

Metrics for Performance Evaluation Regression Sum of squares Sum of deviation Exponential function of the deviation

Metrics for Performance Evaluation Focus on the predictive capability of a model Rather than how fast it takes to classify or build models, scalability, etc. Confusion Matrix: PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No a b c d a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative)

Metrics for Performance Evaluation… Most widely-used metric: PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No a (TP) b (FN) c (FP) d (TN)

Limitation of Accuracy Consider a 2-class problem Number of Class 0 examples = 9990 Number of Class 1 examples = 10 If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 % Accuracy is misleading because model does not detect any class 1 example

Cost Matrix PREDICTED CLASS C(i|j) ACTUAL CLASS Class=Yes Class=No C(Yes|Yes) C(No|Yes) C(Yes|No) C(No|No) C(i|j): Cost of misclassifying class j example as class i

Computing Cost of Classification Cost Matrix PREDICTED CLASS ACTUAL CLASS C(i|j) + - -1 100 1 Model M1 PREDICTED CLASS ACTUAL CLASS + - 150 40 60 250 Model M2 PREDICTED CLASS ACTUAL CLASS + - 250 45 5 200 Accuracy = 80% Cost = 3910 Accuracy = 90% Cost = 4255

Cost vs Accuracy Count Cost a b c d p q PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No a b c d N = a + b + c + d Accuracy = (a + d)/N Cost = p (a + d) + q (b + c) = p (a + d) + q (N – a – d) = q N – (q – p)(a + d) = N [q – (q-p)  Accuracy] Accuracy is proportional to cost if 1. C(Yes|No)=C(No|Yes) = q 2. C(Yes|Yes)=C(No|No) = p Cost PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No p q

Cost-Sensitive Measures Count PREDICTED CLASS ACTUAL CLASS Class= Yes Class= No a b c d Precision is biased towards C(Yes|Yes) & C(Yes|No) Recall is biased towards C(Yes|Yes) & C(No|Yes) A model that declares every record to be the positive class: b = d = 0 A model that assigns a positive class to the (sure) test record: c is small Recall is high Precision is high

Cost-Sensitive Measures (Cont’d) Count PREDICTED CLASS ACTUAL CLASS Class= Yes Class= No a b c d F-measure is biased towards all except C(No|No)

Model Evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to compare the relative performance among competing models?

Methods for Performance Evaluation How to obtain a reliable estimate of performance? Performance of a model may depend on other factors besides the learning algorithm: Class distribution Cost of misclassification Size of training and test sets

Learning Curve Learning curve shows how accuracy changes with varying sample size Requires a sampling schedule for creating learning curve: Arithmetic sampling (Langley, et al) Geometric sampling (Provost et al) Effect of small sample size: Bias in the estimate Variance of estimate

Methods of Estimation Holdout Reserve 2/3 for training and 1/3 for testing Random subsampling Repeated holdout Cross validation Partition data into k disjoint subsets k-fold: train on k-1 partitions, test on the remaining one Leave-one-out: k=n Stratified sampling oversampling vs undersampling Bootstrap Sampling with replacement

A Useful Link http://dlib.net/ml_guide.svg

Methods of Estimation (Cont’d) Holdout method Given data is randomly partitioned into two independent sets Training set (e.g., 2/3) for model construction Test set (e.g., 1/3) for accuracy estimation Random sampling: a variation of holdout Repeat holdout k times, accuracy = avg. of the accuracies obtained Cross-validation (k-fold, where k = 10 is most popular) Randomly partition the data into k mutually exclusive subsets, each approximately equal size At i-th iteration, use Di as test set and others as training set Leave-one-out: k folds where k = # of tuples, for small sized data Stratified cross-validation: folds are stratified so that class dist. in each fold is approx. the same as that in the initial data

Methods of Estimation (Cont’d) Bootstrap Works well with small data sets Samples the given training tuples uniformly with replacement i.e., each time a tuple is selected, it is equally likely to be selected again and re-added to the training set Several boostrap methods, and a common one is .632 boostrap Suppose we are given a data set of d examples. The data set is sampled d times, with replacement, resulting in a training set of d samples. The data points that did not make it into the training set end up forming the test set. About 63.2% of the original data will end up in the bootstrap, and the remaining 36.8% will form the test set (since (1 – 1/d)d ≈ e-1 = 0.368) Repeat the sampling procedure k times, overall accuracy of the model:

Model Evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to compare the relative performance among competing models?

ROC (Receiver Operating Characteristic) Developed in 1950s for signal detection theory to analyze noisy signals Characterize the trade-off between positive hits and false alarms ROC curve plots TPR (on the y-axis) against FPR (on the x-axis) Performance of each classifier represented as a point on the ROC curve If the classifier returns a real-valued prediction, changing the threshold of algorithm, sample distribution or cost matrix changes the location of the point

ROC Curve PREDICTED CLASS ACTUAL CLASS TPR = TP/(TP+FN) Class =Yes Class= No a (TP) b (FN) Class =No c (FP) d (TN) At threshold t: TP=50, FN=50, FP=12, TN=88 TPR = TP/(TP+FN) FPR = FP/(FP+TN)

ROC Curve PREDICTED CLASS (TPR,FPR): (0,0): declare everything to be negative class TP=0, FP = 0 (1,1): declare everything to be positive class FN = 0, TN = 0 (1,0): ideal FN = 0, FP = 0 PREDICTED CLASS ACTUAL CLASS Class =Yes Class= No a (TP) b (FN) Class =No c (FP) d (TN) TPR = TP/(TP+FN) FPR = FP/(FP+TN)

ROC Curve (TPR,FPR): (0,0): declare everything to be negative class (1,1): declare everything to be positive class (1,0): ideal Diagonal line: Random guessing Below diagonal line: prediction is opposite of the true class

How to Construct an ROC curve Use classifier that produces posterior probability for each test instance P(+|A) Sort the instances according to P(+|A) in decreasing order Apply threshold at each unique value of P(+|A) Count the number of TP, FP, TN, FN at each threshold TP rate, TPR = TP/(TP+FN) FP rate, FPR = FP/(FP + TN) Instance P(+|A) True Class 1 0.95 + 2 0.93 3 0.87 - 4 0.85 5 6 7 0.76 8 0.53 9 0.43 10 0.25

How to Construct an ROC curve Use classifier that produces posterior probability for each test instance P(+|A) Sort the instances according to P(+|A) in decreasing order Pick a threshold 0.85 p>= 0.85, predicted to P p< 0.85, predicted to N TP = 3, FP=3, TN=2, FN=2 TP rate, TPR = 3/5=60% FP rate, FPR = 3/5=60% Instance P(+|A) True Class 1 0.95 + 2 0.93 3 0.87 - 4 0.85 5 6 7 0.76 8 0.53 9 0.43 10 0.25

How to construct an ROC curve Threshold >= ROC Curve:

Using ROC for Model Comparison No model consistently outperforms the other M1 is better for small FPR M2 is better for large FPR Area Under the ROC curve (AUC) Ideal: Area = 1 Random guess: Area = 0.5

Data normalization Example-wise normalization 1 Example-wise normalization Each example is normalized and mapped to unit sphere Feature-wise normalization [0,1]-normalization: normalize each feature into a unit space Standard normalization: normalize each feature to have mean 0 and standard deviation 1 1 1