Presentation is loading. Please wait.

Presentation is loading. Please wait.

© Galit Shmueli and Peter Bruce 2010 Chapter 6: Multiple Linear Regression Data Mining for Business Analytics Shmueli, Patel & Bruce.

Similar presentations


Presentation on theme: "© Galit Shmueli and Peter Bruce 2010 Chapter 6: Multiple Linear Regression Data Mining for Business Analytics Shmueli, Patel & Bruce."— Presentation transcript:

1 © Galit Shmueli and Peter Bruce 2010 Chapter 6: Multiple Linear Regression Data Mining for Business Analytics Shmueli, Patel & Bruce

2 Topics Explanatory vs. predictive modeling with regression Example: prices of Toyota Corollas Fitting a predictive model Assessing predictive accuracy Selecting a subset of predictors

3 Explanatory Modeling Goal: Explain relationship between predictors (explanatory variables) and target Familiar use of regression in data analysis Model Goal: Fit the data well and understand the contribution of explanatory variables to the model “goodness-of-fit”: R 2, residual analysis, p-values

4 Predictive Modeling Goal: predict target values in other data where we have predictor values, but not target values Classic data mining context Model Goal: Optimize predictive accuracy Train model on training data Assess performance on validation (hold-out) data Explaining role of predictors is not primary purpose (but useful)

5 Example: Prices of Toyota Corolla ToyotaCorolla.xls Goal: predict prices of used Toyota Corollas based on their specification Data: Prices of 1442 used Toyota Corollas, with their specification information

6 Data Sample (showing only the variables to be used in analysis) PriceAgeKMFuel_TypeHP Metalli c Autom aticccDoorsQuart_TaxWeight 135002346986Diesel9010200032101165 137502372937Diesel9010200032101165 139502441711Diesel9010200032101165 149502648000Diesel9000200032101165 137503038500Diesel9000200032101170 129503261000Diesel9000200032101170 169002794612Diesel9010200032101245 186003075889Diesel9010200032101245 215002719700Petrol19200180031001185 129502371138Diesel6900190031851105 209502531461Petrol19200180031001185

7 Variables Used Price in Euros Age in months as of 8/04 KM (kilometers) Fuel Type (diesel, petrol, CNG) HP (horsepower) Metallic color (1=yes, 0=no) Automatic transmission (1=yes, 0=no) CC (cylinder volume) Doors Quart_Tax (road tax) Weight (in kg)

8 Preprocessing Fuel type is categorical, must be transformed into binary variables Diesel (1=yes, 0=no) CNG (1=yes, 0=no) None needed for “Petrol” (reference category)

9 Subset of the records selected for training partition (limited # of variables shown) 60% training data / 40% validation data PriceAgeKMHPMetallic Automati cccDoorsQuart_TaxWeight Fuel_Typ e_CNG Fuel_Typ e_Diesel Fuel_Typ e_Petrol 1350023469869010200032101165010 1375030385009000200032101170010 1860030758899010200032101245010 20950253146119200180031001185001 19950224361019200180031001185001 22500323413119210180031001185001 22000281873919200180031001185001 1795024217161101016003851105001 1695030643591101016003851105001 1595030676601101016003851105001 16950294390511001160031001170001 1595028563491101016003851120001

10 The Fitted Regression Model Input Variables CoefficientStd. Errort-StatisticP-Value Intercept-1833.31363.468725-1.344585310.179118 Age-121.6073.320627169-36.62181415.8E-177 KM-0.019280.001699871-11.34317087.2E-28 HP29.756754.2240604947.0445848233.84E-12 Metallic101.160997.80791891.0342809770.301299 Automatic456.6505199.44557592.2895993020.022289 cc-0.02230.091510026-0.24374260.807489 Doors-118.37850.63479867-2.337880410.019625 Quart_Tax11.750232.2017775345.3367008831.21E-07 Weight16.067551.41446977311.359417876.13E-28 Fuel_Type_CNG-2367.26448.7481365-5.275242421.68E-07 Fuel_Type_Diesel-1077.4354.0357916-3.043196080.002413

11 Error reports

12 Predicted Values Predicted price computed using regression coefficients Residuals = difference between actual and predicted prices Predicted Value Actual Value Residual 16952.121213500-3452.12118 16243.672713750-2493.67273 16828.9683186001771.03168 20052.973420950897.026598 20183.539519950-233.539507 19251.3998225003248.60019 19933.4558220002066.54416 16566.3936179501383.60641 15014.5103169501935.48975 14950.860615950999.139372 17106.64416950-156.644035 15653.186515950296.813474 16118.4416950831.559982

13 Distribution of Residuals Symmetric distribution Some outliers

14 Selecting Subsets of Predictors Goal: Find parsimonious model (the simplest model that performs sufficiently well) More robust Higher predictive accuracy Exhaustive Search Partial Search Algorithms Forward Backward Stepwise

15 Exhaustive Search All possible subsets of predictors assessed (single, pairs, triplets, etc.) Computationally intensive Judge by “adjusted R 2 ” Penalty for number of predictors

16 Forward Selection Start with no predictors Add them one by one (add the one with largest contribution) Stop when the addition is not statistically significant

17 Backward Elimination Start with all predictors Successively eliminate least useful predictors one by one Stop when all remaining predictors have statistically significant contribution

18 Stepwise Like Forward Selection Except at each step, also consider dropping non- significant predictors

19 Backward elimination (showing last 7 models) Top model has a single predictor (Age_08_04) Second model has two predictors, etc.

20 All 12 Models

21 Diagnostics for the 12 models Good model has: High adj-R 2, Cp = # predictors

22 Next step Subset selection methods give candidate models that might be “good models” Do not guarantee that “best” model is indeed best Also, “best” model can still have insufficient predictive accuracy Must run the candidates and assess predictive accuracy (click “choose subset”)

23 Model with only 6 predictors Model Fit Predictive performance (compare to 12-predictor model!)

24 Summary Linear regression models are very popular tools, not only for explanatory modeling, but also for prediction A good predictive model has high predictive accuracy (to a useful practical level) Predictive models are built using a training data set, and evaluated on a separate validation data set Removing redundant predictors is key to achieving predictive accuracy and robustness Subset selection methods help find “good” candidate models. These should then be run and assessed.


Download ppt "© Galit Shmueli and Peter Bruce 2010 Chapter 6: Multiple Linear Regression Data Mining for Business Analytics Shmueli, Patel & Bruce."

Similar presentations


Ads by Google