2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science & Technology Chapter 7: Multiple Linear Regression
Data Mining, IISE, SNUT Steps in Data Mining revisited 1. Define and understand the purpose of data mining project 2. Formulate the data mining problem 3. Obtain/verify/modify the data 5. Build data mining models 6. Evaluate and interpret the results 7. Deploy and monitor the model 4. Explore and customize the data
Data Mining, IISE, SNUT Prediction revisited Predict the selling price of Toyota corolla… Independent variables (attributes, features) Dependent variable (target)
Goal Fit a linear relationship between a quantitative dependent variable Y and a set of predictors X 1, X 2, …, X p Data Mining, IISE, SNUT Multiple Linear Regression coefficients unexplained
Explanatory vs. Predictive Data Mining, IISE, SNUT Multiple Linear Regression Explain relationship between predictors (explanatory variables) and target. Familiar use of regression in data analysis. Model Goal: Fit the data well and understand the contribution of explanatory variables to the model. “goodness-of-fit”: R 2, residual analysis, p-values. Explain relationship between predictors (explanatory variables) and target. Familiar use of regression in data analysis. Model Goal: Fit the data well and understand the contribution of explanatory variables to the model. “goodness-of-fit”: R 2, residual analysis, p-values. predict target values in other data where we have predictor values, but not target values. Classic data mining context Model Goal: Optimize predictive accuracy Train model on training data Assess performance on validation (hold-out) data Explaining role of predictors is not primary purpose (but useful) predict target values in other data where we have predictor values, but not target values. Classic data mining context Model Goal: Optimize predictive accuracy Train model on training data Assess performance on validation (hold-out) data Explaining role of predictors is not primary purpose (but useful) Explanatory Regression Predictive Regression
Estimating the coefficients Ordinary least square (OLS) Actual target: Predicted target: Goal: minimize the difference between the actual and predicted target Data Mining, IISE, SNUT Multiple Linear Regression
Ordinary least square: Matrix solution X: n by p matrix, y: n by 1 vector, β: p by 1 vector Data Mining, IISE, SNUT Multiple Linear Regression
Ordinary least square Finds the best estimates β when the following conditions are satisfied: The noise ε follows a normal distribution. The linear relationship is correct. The cases are independent of each other. The variability in Y values for a given set of predictors is the same regardless of the values of the predictors (homoskedasticity) Data Mining, IISE, SNUT Multiple Linear Regression
Example: predict the selling price of Toyota corolla Data Mining, IISE, SNUT Multiple Linear Regression YX
Data preprocessing Create dummy variables for fuel types Data partitioning 60% training data / 40% validation data Data Mining, IISE, SNUT Multiple Linear Regression Fuel_type = Disel Fuel_type = Petrol Fuel_type = CNG Diesel100 Petrol010
Fitted linear regression model Data Mining, IISE, SNUT Multiple Linear Regression β 유의확률
Actual & predicted targets Data Mining, IISE, SNUT Multiple Linear Regression
Data Mining, IISE, SNUT Prediction Performance Example Predict a baby’s weight(kg) based on his age. 1 Age Actual Weight(y) Predicted Weight(y’)
14 Average error Indicate whether the predictions are on average over- or under- predicted. Age Actual Weight(y) Predicted Weight(y’) Data Mining, IISE, SNUT Prediction Performance 2
15 Mean absolute error (MAE) Gives the magnitude of the average error. Age Actual Weight(y) Predicted Weight(y’) Data Mining, IISE, SNUT Prediction Performance 3
16 Mean absolute percentage error (MAPE) Gives a percentage score of how predictions deviate (on average) from the actual values. Age Actual Weight(y) Predicted Weight(y’) Data Mining, IISE, SNUT Prediction Performance 4
17 (Root) Mean squared error ((R)MSE) Standard error of estimate. Same units as the variable predicted. Age Actual Weight(y) Predicted Weight(y’) Data Mining, IISE, SNUT Prediction Performance 5
Performance evaluation Data Mining, IISE, SNUT Multiple Linear Regression
Residual distribution Data Mining, IISE, SNUT Multiple Linear Regression
Why should we select a subset of variables? May be expensive or not feasible to collect a full complement of predictors for future prediction. May be able to measure fewer predictors more accurately (e.g. in surveys). More predictors, more missing values. Parsimony (a.k.a. Occam’s Razor): the simpler, the better. Multicollinearity: presence of two or more predictors sharing the same linear relationship with the outcome variables Data Mining, IISE, SNUT Variable Selection in Linear Regression
Goal Find parsimonious model (the simplest model that performs sufficiently well). More robust. Higher predictive accuracy. Methods Exhaustive search Partial search Forward Backward Stepwise Data Mining, IISE, SNUT Variable Selection in Linear Regression
Exhaustive search All possible subsets of predictors assessed (single, pairs, triplets, etc.) Example: for three variables A total of six combinations are evaluated: Adjusted R 2 is used for performance criterion Data Mining, IISE, SNUT Variable Selection in Linear Regression X1X1 X2X2 X3X3 X1X1 X2X2 X3X3 X1X1 X2X2 X2X2 X3X3 X1X1 X2X2 X3X3 the number of recordsthe number of predictors X1X1 X3X3
Forward Selection Start with no predictors. Add them one by one (add the one with largest contribution). Stop when the addition is not statistically significant. Example: for three variables Data Mining, IISE, SNUT Variable Selection in Linear Regression X1X1 X2X2 X3X3 X1X1 X2X2 X3X3 X1X1 X1X1 X2X2 X3X3 X1X1 X3X3 X1X1 X3X3 X1X1 X2X2 X3X3 X1X1
Backward Elimination Start with all predictors. Successively eliminate least useful predictors one by one. Stop when all remaining predictors have statistically significant contribution. Example: for three variables Stepwise Selection Like Forward Selection. Except at each step, also consider dropping non-significant predictors Data Mining, IISE, SNUT Variable Selection in Linear Regression X1X1 X2X2 X3X3 X3X3 X1X1 X3X3 X1X1 X2X2
Exhaustive search results Data Mining, IISE, SNUT Variable Selection in Linear Regression
Backward elimination results Data Mining, IISE, SNUT Variable Selection in Linear Regression
With six variables 27 Model Fit Predictive performance (compare to 12-predictor model!) 2011 Data Mining, IISE, SNUT Prediction performance evaluation
Summary Linear regression models are very popular tools, not only for explanatory modeling, but also for prediction. A good predictive model has high predictive accuracy (to a useful practical level). Predictive models are built using a training data set, and evaluated on a separate validation data set. Removing redundant predictors is key to achieving predictive accuracy and robustness. Subset selection methods help find “good” candidate models. These should then be run and assessed Data Mining, IISE, SNUT Linear Regression