Classification and Prediction
Classification, Regression, and Prediction Predict categorical class labels Classify data (constructs a model) based on training set and values (class labels) in a classifying attribute and uses it in classifying new data Regression: Model continuous-valued functions; i.e., predicts unknown or missing values Prediction: Classification + Regression Sometimes refers only to regression (e.g., in the text book)
Classification—A Two-Step Process Step 1. Model construction: describing a set of predetermined classes Set of tuples used for model construction: training set Each tuple/sample is assumed to belong to a predefined class, as determined by class label attribute Model is represented as classification rules, decision trees, or mathematical formulae IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’
Classification—A Two-Step Process Step 2. Model usage: for classifying future or unknown objects Estimate predictive accuracy of model Known label of test sample is compared with classified result from model Accuracy rate is percentage of test set samples that are correctly classified by model Test set is independent of training set, otherwise over-fitting will occur IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ (Jeff, Professor, 4)
Classification Process (1): Model Construction Algorithms Training Data Classifier (Model) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’
Classification Process (2): Use Model in Prediction Classifier (Model) (Jeff, Professor, 4) Tenured? Unseen Data Yes IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Test Data
Supervised versus Unsupervised Learning Supervised learning (classification) Supervision: Training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on training set Unsupervised learning (clustering) Class labels of training data are unknown Given a set of measurements, observations, etc., need to establish existence of classes or clusters in data
Classification and Prediction What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Classification based on concepts from association rule mining Other Classification Methods Prediction Classification accuracy Summary
Issues (1): Data Preparation Data cleaning Preprocess data in order to reduce noise (e.g., by smoothing) and handle missing values (e.g., use most commonly occurring value) Help to reduce confusion during learning Relevance analysis (feature selection) Remove irrelevant or redundant attributes Data transformation Generalize (to higher level concepts) and/or normalize data (scaling values so that they fall within specified range)
Issues (2): Evaluating Classification Methods Predictive accuracy Predict class label Speed Time to construct model Time to use model Robustness Make correct prediction given noise and missing values Scalability Construct model efficiently given data size Interpretability: Understanding and insight provided by model Goodness of rules Decision tree size Compactness of classification rules