2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science.

2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science & Technology Chapter 7: Multiple Linear Regression

2 2011 Data Mining, IISE, SNUT Steps in Data Mining revisited 1. Define and understand the purpose of data mining project 2. Formulate the data mining problem 3. Obtain/verify/modify the data 5. Build data mining models 6. Evaluate and interpret the results 7. Deploy and monitor the model 4. Explore and customize the data

3 2011 Data Mining, IISE, SNUT Prediction revisited  Predict the selling price of Toyota corolla… Independent variables (attributes, features) Dependent variable (target)

 Goal  Fit a linear relationship between a quantitative dependent variable Y and a set of predictors X 1, X 2, …, X p. 4 2011 Data Mining, IISE, SNUT Multiple Linear Regression coefficients unexplained

 Explanatory vs. Predictive 5 2011 Data Mining, IISE, SNUT Multiple Linear Regression  Explain relationship between predictors (explanatory variables) and target.  Familiar use of regression in data analysis.  Model Goal: Fit the data well and understand the contribution of explanatory variables to the model.  “goodness-of-fit”: R 2, residual analysis, p-values.  Explain relationship between predictors (explanatory variables) and target.  Familiar use of regression in data analysis.  Model Goal: Fit the data well and understand the contribution of explanatory variables to the model.  “goodness-of-fit”: R 2, residual analysis, p-values.  predict target values in other data where we have predictor values, but not target values.  Classic data mining context  Model Goal: Optimize predictive accuracy  Train model on training data  Assess performance on validation (hold-out) data  Explaining role of predictors is not primary purpose (but useful)  predict target values in other data where we have predictor values, but not target values.  Classic data mining context  Model Goal: Optimize predictive accuracy  Train model on training data  Assess performance on validation (hold-out) data  Explaining role of predictors is not primary purpose (but useful) Explanatory Regression Predictive Regression

 Estimating the coefficients  Ordinary least square (OLS) Actual target: Predicted target: Goal: minimize the difference between the actual and predicted target. 6 2011 Data Mining, IISE, SNUT Multiple Linear Regression

 Ordinary least square: Matrix solution  X: n by p matrix, y: n by 1 vector, β: p by 1 vector. 7 2011 Data Mining, IISE, SNUT Multiple Linear Regression

 Ordinary least square  Finds the best estimates β when the following conditions are satisfied: The noise ε follows a normal distribution. The linear relationship is correct. The cases are independent of each other. The variability in Y values for a given set of predictors is the same regardless of the values of the predictors (homoskedasticity). 8 2011 Data Mining, IISE, SNUT Multiple Linear Regression

 Example: predict the selling price of Toyota corolla 9 2011 Data Mining, IISE, SNUT Multiple Linear Regression YX

 Data preprocessing  Create dummy variables for fuel types  Data partitioning  60% training data / 40% validation data 10 2011 Data Mining, IISE, SNUT Multiple Linear Regression Fuel_type = Disel Fuel_type = Petrol Fuel_type = CNG Diesel100 Petrol010

 Fitted linear regression model 11 2011 Data Mining, IISE, SNUT Multiple Linear Regression β 유의확률

 Actual & predicted targets 12 2011 Data Mining, IISE, SNUT Multiple Linear Regression

13 2011 Data Mining, IISE, SNUT Prediction Performance Example  Predict a baby’s weight(kg) based on his age. 1 Age Actual Weight(y) Predicted Weight(y’) 15.66.0 26.96.4 310.410.9 413.712.4 517.415.6 620.721.5 723.523.0

14 Average error  Indicate whether the predictions are on average over- or under- predicted. Age Actual Weight(y) Predicted Weight(y’) 15.66.0 26.96.4 310.410.9 413.712.4 517.415.6 620.721.5 723.523.0 2011 Data Mining, IISE, SNUT Prediction Performance 2

15 Mean absolute error (MAE)  Gives the magnitude of the average error. Age Actual Weight(y) Predicted Weight(y’) 15.66.0 26.96.4 310.410.9 413.712.4 517.415.6 620.721.5 723.523.0 2011 Data Mining, IISE, SNUT Prediction Performance 3

16 Mean absolute percentage error (MAPE)  Gives a percentage score of how predictions deviate (on average) from the actual values. Age Actual Weight(y) Predicted Weight(y’) 15.66.0 26.96.4 310.410.9 413.712.4 517.415.6 620.721.5 723.523.0 2011 Data Mining, IISE, SNUT Prediction Performance 4

17 (Root) Mean squared error ((R)MSE)  Standard error of estimate.  Same units as the variable predicted. Age Actual Weight(y) Predicted Weight(y’) 15.66.0 26.96.4 310.410.9 413.712.4 517.415.6 620.721.5 723.523.0 2011 Data Mining, IISE, SNUT Prediction Performance 5

 Performance evaluation 18 2011 Data Mining, IISE, SNUT Multiple Linear Regression

 Residual distribution 19 2011 Data Mining, IISE, SNUT Multiple Linear Regression

 Why should we select a subset of variables?  May be expensive or not feasible to collect a full complement of predictors for future prediction.  May be able to measure fewer predictors more accurately (e.g. in surveys).  More predictors, more missing values.  Parsimony (a.k.a. Occam’s Razor): the simpler, the better.  Multicollinearity: presence of two or more predictors sharing the same linear relationship with the outcome variables. 20 2011 Data Mining, IISE, SNUT Variable Selection in Linear Regression

 Goal  Find parsimonious model (the simplest model that performs sufficiently well). More robust. Higher predictive accuracy.  Methods  Exhaustive search  Partial search Forward Backward Stepwise 21 2011 Data Mining, IISE, SNUT Variable Selection in Linear Regression

 Exhaustive search  All possible subsets of predictors assessed (single, pairs, triplets, etc.) Example: for three variables A total of six combinations are evaluated:  Adjusted R 2 is used for performance criterion. 22 2011 Data Mining, IISE, SNUT Variable Selection in Linear Regression X1X1 X2X2 X3X3 X1X1 X2X2 X3X3 X1X1 X2X2 X2X2 X3X3 X1X1 X2X2 X3X3 the number of recordsthe number of predictors X1X1 X3X3

 Forward Selection  Start with no predictors.  Add them one by one (add the one with largest contribution).  Stop when the addition is not statistically significant. Example: for three variables 23 2011 Data Mining, IISE, SNUT Variable Selection in Linear Regression X1X1 X2X2 X3X3 X1X1 X2X2 X3X3 X1X1 X1X1 X2X2 X3X3 X1X1 X3X3 X1X1 X3X3 X1X1 X2X2 X3X3 X1X1

 Backward Elimination  Start with all predictors.  Successively eliminate least useful predictors one by one.  Stop when all remaining predictors have statistically significant contribution. Example: for three variables  Stepwise Selection  Like Forward Selection.  Except at each step, also consider dropping non-significant predictors. 24 2011 Data Mining, IISE, SNUT Variable Selection in Linear Regression X1X1 X2X2 X3X3 X3X3 X1X1 X3X3 X1X1 X2X2

 Exhaustive search results 25 2011 Data Mining, IISE, SNUT Variable Selection in Linear Regression

 Backward elimination results 26 2011 Data Mining, IISE, SNUT Variable Selection in Linear Regression

 With six variables 27 Model Fit Predictive performance (compare to 12-predictor model!) 2011 Data Mining, IISE, SNUT Prediction performance evaluation

 Summary  Linear regression models are very popular tools, not only for explanatory modeling, but also for prediction.  A good predictive model has high predictive accuracy (to a useful practical level).  Predictive models are built using a training data set, and evaluated on a separate validation data set.  Removing redundant predictors is key to achieving predictive accuracy and robustness.  Subset selection methods help find “good” candidate models. These should then be run and assessed. 28 2011 Data Mining, IISE, SNUT Linear Regression

2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science.

Similar presentations

Presentation on theme: "2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science.

Similar presentations

Presentation on theme: "2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science."— Presentation transcript:

Similar presentations

About project

Feedback