CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering http://www.engr.uconn.edu/~jinbo
Tasks may be in Machine Learning/Data Mining Prediction tasks (supervised learning problem) Classification, regression, ranking Use some variables to predict unknown or future values of other variables. Description tasks (unsupervised learning problem) Cluster analysis, novelty detection, Find human-interpretable patterns that describe the data. From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Classification: Definition Given a collection of examples (training set ) Each example contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen examples should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
Classification Example categorical categorical continuous class Test Set Learn Classifier Model Training Set
Classification: Application 1 High Risky Patient Detection Goal: Predict if a patient will suffer major complication after a surgery procedure Approach: Use patients vital signs before and after surgical operation. Heart Rate, Respiratory Rate, etc. Monitor patients by expert medical professionals to label which patient has complication, which has not. Learn a model for the class of the after-surgery risk. Use this model to detect potential high-risk patients for a particular surgical procedure
Classification: Application 2 Face recognition Goal: Predict the identity of a face image Approach: Align all images to derive the features Model the class (identity) based on these features
Classification: Application 3 Cancer Detection Goal: To predict class (cancer or normal) of a sample (person), based on the microarray gene expression data Approach: Use expression levels of all genes as the features Label each example as cancer or normal Learn a model for the class of all samples
Classification: Application 4 Alzheimer's Disease Detection Goal: To predict class (AD or normal) of a sample (person), based on neuroimaging data such as MRI and PET Approach: Extract features from neuroimages Label each example as AD or normal Learn a model for the class of all samples Reduced gray matter volume (colored areas) detected by MRI voxel-based morphometry in AD patients compared to normal healthy controls.
Regression Predict a value of a real-valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency. Extensively studied in statistics, neural network fields. Find a model to predict the dependent variable as a function of the values of independent variables. Goal: previously unseen examples should be predicted as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
Regression application 1 Continuous target categorical categorical continuous Current data, want to use the model to predict Tid Refund Marital Status Taxable Income Loss 1 Yes Single 125K 100 2 No Married 100K 120 3 70K -200 4 120K -300 5 Divorced 95K -400 6 60K -500 7 220K -190 8 85K 300 9 75K -240 10 90K 90 Test Set Learn Regressor Model Training Set Past transaction records, label them goals: Predict the possible loss from a customer
Regression applications Examples: Predicting sales amounts of new product based on advertising expenditure. Predicting wind velocities as a function of temperature, humidity, air pressure, etc. Time series prediction of stock market indices.
Clustering Definition Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that Data points in one cluster are more similar to one another. Data points in separate clusters are less similar to one another. Similarity Measures: Euclidean Distance if attributes are continuous. Other Problem-specific Measures
Illustrating Clustering Euclidean Distance Based Clustering in 3-D space. Intracluster distances are minimized Intercluster distances are maximized
Clustering: Application 1 High Risky Patient Detection Goal: Predict if a patient will suffer major complication after a surgery procedure Approach: Use patients vital signs before and after surgical operation. Heart Rate, Respiratory Rate, etc. Find patients whose symptoms are dissimilar from most of other patients.
Clustering: Application 2 Document Clustering: Goal: To find groups of documents that are similar to each other based on the important terms appearing in them. Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster. Gain: Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents.
Illustrating Document Clustering Clustering Points: 3204 Articles of Los Angeles Times. Similarity Measure: How many words are common in these documents (after some word filtering).
Algorithms to solve these problems
Classification algorithms K-Nearest-Neighbor classifiers Naïve Bayes classifier Neural Networks Linear Discriminant Analysis (LDA) Support Vector Machines (SVM) Decision Trees Logistic Regression Graphical models
Regression methods Linear Regression Ridge Regression LASSO – Least Absolute Shrinkage and Selection Operator Neural Networks
Clustering algorithms K-Means Hierarchical clustering Graph-based clustering (Spectral clustering) Semi-supervised clustering Others
Challenges of Big Data Scalability Dimensionality Complex and Heterogeneous Data Data Quality Data Ownership and Distribution Privacy Preservation
Our Focus Supervised learning Classification (support vector machine) Regression (backpropagation neural networks) Before talking about the techniques, let us first understand how a learning model is evaluated.
Model Evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to compare the relative performance among competing models?
Metrics for Performance Evaluation Regression Sum of squares Sum of deviation Exponential function of the deviation
Metrics for Performance Evaluation Focus on the predictive capability of a model Rather than how fast it takes to classify or build models, scalability, etc. Confusion Matrix: PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No a b c d a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative)
Metrics for Performance Evaluation… Most widely-used metric: PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No a (TP) b (FN) c (FP) d (TN)
Limitation of Accuracy Consider a 2-class problem Number of Class 0 examples = 9990 Number of Class 1 examples = 10 If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 % Accuracy is misleading because model does not detect any class 1 example
Cost Matrix PREDICTED CLASS C(i|j) ACTUAL CLASS Class=Yes Class=No C(Yes|Yes) C(No|Yes) C(Yes|No) C(No|No) C(i|j): Cost of misclassifying class j example as class i
Computing Cost of Classification Cost Matrix PREDICTED CLASS ACTUAL CLASS C(i|j) + - -1 100 1 Model M1 PREDICTED CLASS ACTUAL CLASS + - 150 40 60 250 Model M2 PREDICTED CLASS ACTUAL CLASS + - 250 45 5 200 Accuracy = 80% Cost = 3910 Accuracy = 90% Cost = 4255
Cost vs Accuracy Count Cost a b c d p q PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No a b c d N = a + b + c + d Accuracy = (a + d)/N Cost = p (a + d) + q (b + c) = p (a + d) + q (N – a – d) = q N – (q – p)(a + d) = N [q – (q-p) Accuracy] Accuracy is proportional to cost if 1. C(Yes|No)=C(No|Yes) = q 2. C(Yes|Yes)=C(No|No) = p Cost PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No p q
Cost-Sensitive Measures Count PREDICTED CLASS ACTUAL CLASS Class= Yes Class= No a b c d Precision is biased towards C(Yes|Yes) & C(Yes|No) Recall is biased towards C(Yes|Yes) & C(No|Yes) A model that declares every record to be the positive class: b = d = 0 A model that assigns a positive class to the (sure) test record: c is small Recall is high Precision is high
Cost-Sensitive Measures (Cont’d) Count PREDICTED CLASS ACTUAL CLASS Class= Yes Class= No a b c d F-measure is biased towards all except C(No|No)
Model Evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to compare the relative performance among competing models?
Methods for Performance Evaluation How to obtain a reliable estimate of performance? Performance of a model may depend on other factors besides the learning algorithm: Class distribution Cost of misclassification Size of training and test sets
Learning Curve Learning curve shows how accuracy changes with varying sample size Requires a sampling schedule for creating learning curve: Arithmetic sampling (Langley, et al) Geometric sampling (Provost et al) Effect of small sample size: Bias in the estimate Variance of estimate
Methods of Estimation Holdout Reserve 2/3 for training and 1/3 for testing Random subsampling Repeated holdout Cross validation Partition data into k disjoint subsets k-fold: train on k-1 partitions, test on the remaining one Leave-one-out: k=n Stratified sampling oversampling vs undersampling Bootstrap Sampling with replacement
A Useful Link http://dlib.net/ml_guide.svg
Methods of Estimation (Cont’d) Holdout method Given data is randomly partitioned into two independent sets Training set (e.g., 2/3) for model construction Test set (e.g., 1/3) for accuracy estimation Random sampling: a variation of holdout Repeat holdout k times, accuracy = avg. of the accuracies obtained Cross-validation (k-fold, where k = 10 is most popular) Randomly partition the data into k mutually exclusive subsets, each approximately equal size At i-th iteration, use Di as test set and others as training set Leave-one-out: k folds where k = # of tuples, for small sized data Stratified cross-validation: folds are stratified so that class dist. in each fold is approx. the same as that in the initial data
Methods of Estimation (Cont’d) Bootstrap Works well with small data sets Samples the given training tuples uniformly with replacement i.e., each time a tuple is selected, it is equally likely to be selected again and re-added to the training set Several boostrap methods, and a common one is .632 boostrap Suppose we are given a data set of d examples. The data set is sampled d times, with replacement, resulting in a training set of d samples. The data points that did not make it into the training set end up forming the test set. About 63.2% of the original data will end up in the bootstrap, and the remaining 36.8% will form the test set (since (1 – 1/d)d ≈ e-1 = 0.368) Repeat the sampling procedure k times, overall accuracy of the model:
Model Evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to compare the relative performance among competing models?
ROC (Receiver Operating Characteristic) Developed in 1950s for signal detection theory to analyze noisy signals Characterize the trade-off between positive hits and false alarms ROC curve plots TPR (on the y-axis) against FPR (on the x-axis) Performance of each classifier represented as a point on the ROC curve If the classifier returns a real-valued prediction, changing the threshold of algorithm, sample distribution or cost matrix changes the location of the point
ROC Curve PREDICTED CLASS ACTUAL CLASS TPR = TP/(TP+FN) Class =Yes Class= No a (TP) b (FN) Class =No c (FP) d (TN) At threshold t: TP=50, FN=50, FP=12, TN=88 TPR = TP/(TP+FN) FPR = FP/(FP+TN)
ROC Curve PREDICTED CLASS (TPR,FPR): (0,0): declare everything to be negative class TP=0, FP = 0 (1,1): declare everything to be positive class FN = 0, TN = 0 (1,0): ideal FN = 0, FP = 0 PREDICTED CLASS ACTUAL CLASS Class =Yes Class= No a (TP) b (FN) Class =No c (FP) d (TN) TPR = TP/(TP+FN) FPR = FP/(FP+TN)
ROC Curve (TPR,FPR): (0,0): declare everything to be negative class (1,1): declare everything to be positive class (1,0): ideal Diagonal line: Random guessing Below diagonal line: prediction is opposite of the true class
How to Construct an ROC curve Use classifier that produces posterior probability for each test instance P(+|A) Sort the instances according to P(+|A) in decreasing order Apply threshold at each unique value of P(+|A) Count the number of TP, FP, TN, FN at each threshold TP rate, TPR = TP/(TP+FN) FP rate, FPR = FP/(FP + TN) Instance P(+|A) True Class 1 0.95 + 2 0.93 3 0.87 - 4 0.85 5 6 7 0.76 8 0.53 9 0.43 10 0.25
How to Construct an ROC curve Use classifier that produces posterior probability for each test instance P(+|A) Sort the instances according to P(+|A) in decreasing order Pick a threshold 0.85 p>= 0.85, predicted to P p< 0.85, predicted to N TP = 3, FP=3, TN=2, FN=2 TP rate, TPR = 3/5=60% FP rate, FPR = 3/5=60% Instance P(+|A) True Class 1 0.95 + 2 0.93 3 0.87 - 4 0.85 5 6 7 0.76 8 0.53 9 0.43 10 0.25
How to construct an ROC curve Threshold >= ROC Curve:
Using ROC for Model Comparison No model consistently outperforms the other M1 is better for small FPR M2 is better for large FPR Area Under the ROC curve (AUC) Ideal: Area = 1 Random guess: Area = 0.5
Data normalization Example-wise normalization 1 Example-wise normalization Each example is normalized and mapped to unit sphere Feature-wise normalization [0,1]-normalization: normalize each feature into a unit space Standard normalization: normalize each feature to have mean 0 and standard deviation 1 1 1