© Galit Shmueli and Peter Bruce 2010 Chapter 6: Multiple Linear Regression Data Mining for Business Analytics Shmueli, Patel & Bruce.

Slides:



Advertisements
Similar presentations
Chapter 5 Multiple Linear Regression
Advertisements

1 1 Chapter 5: Multiple Regression 5.1 Fitting a Multiple Regression Model 5.2 Fitting a Multiple Regression Model with Interactions 5.3 Generating and.
Automated Regression Modeling Descriptive vs. Predictive Regression Models Four common automated modeling procedures Forward Modeling Backward Modeling.
Chapter 3 – Data Exploration and Dimension Reduction © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Stat 112: Lecture 7 Notes Homework 2: Due next Thursday The Multiple Linear Regression model (Chapter 4.1) Inferences from multiple regression analysis.
Chapter 8 – Logistic Regression
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Multiple Linear Regression
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
MULTIPLE REGRESSION. OVERVIEW What Makes it Multiple? What Makes it Multiple? Additional Assumptions Additional Assumptions Methods of Entering Variables.
Statistics for Managers Using Microsoft® Excel 5th Edition
Statistics for Managers Using Microsoft® Excel 5th Edition
Part I – MULTIVARIATE ANALYSIS C3 Multiple Linear Regression II © Angel A. Juan & Carles Serrat - UPC 2007/2008.
Multiple Regression Involves the use of more than one independent variable. Multivariate analysis involves more than one dependent variable - OMS 633 Adding.
© 2003 Prentice-Hall, Inc.Chap 14-1 Basic Business Statistics (9 th Edition) Chapter 14 Introduction to Multiple Regression.
Lecture 6: Multiple Regression
1 Chapter 9 Variable Selection and Model building Ray-Bing Chen Institute of Statistics National University of Kaohsiung.
Stat 112: Lecture 8 Notes Homework 2: Due on Thursday Assessing Quality of Prediction (Chapter 3.5.3) Comparing Two Regression Models (Chapter 4.4) Prediction.
Lecture 11 Multivariate Regression A Case Study. Other topics: Multicollinearity  Assuming that all the regression assumptions hold how good are our.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 11 th Edition.
Chapter 15: Model Building
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Copyright ©2011 Pearson Education 15-1 Chapter 15 Multiple Regression Model Building Statistics for Managers using Microsoft Excel 6 th Global Edition.
Regression Model Building
Quantitative Business Analysis for Decision Making Multiple Linear RegressionAnalysis.
Multiple Linear Regression Response Variable: Y Explanatory Variables: X 1,...,X k Model (Extension of Simple Regression): E(Y) =  +  1 X 1 +  +  k.
Inference for regression - Simple linear regression
Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall 15-1 Chapter 15 Multiple Regression Model Building Statistics for Managers using Microsoft.
Lecture 12 Model Building BMTRY 701 Biostatistical Methods II.
Variable selection and model building Part II. Statement of situation A common situation is that there is a large set of candidate predictor variables.
Business Statistics, 4e, by Ken Black. © 2003 John Wiley & Sons Business Statistics, 4e by Ken Black Chapter 15 Building Multiple Regression Models.
Topic 14: Inference in Multiple Regression. Outline Review multiple linear regression Inference of regression coefficients –Application to book example.
Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory.
Week 6: Model selection Overview Questions from last week Model selection in multivariable analysis -bivariate significance -interaction and confounding.
1 Model Selection Response: Highway MPG Explanatory: 13 explanatory variables Indicator variables for types of car – Sports Car, SUV, Wagon, Minivan There.
Review of Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory.
Chapter 22: Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
1 Experimental Statistics - week 12 Chapter 12: Multiple Regression Chapter 13: Variable Selection Model Checking.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
1 Chapter 4: Introduction to Predictive Modeling: Regressions 4.1 Introduction 4.2 Selecting Regression Inputs 4.3 Optimizing Regression Complexity 4.4.
Lesson 14 - R Chapter 14 Review. Objectives Summarize the chapter Define the vocabulary used Complete all objectives Successfully answer any of the review.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
1 Building the Regression Model –I Selection and Validation KNN Ch. 9 (pp )
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
Using SPSS Note: The use of another statistical package such as Minitab is similar to using SPSS.
Chapter 5 – Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel & Bruce.
DATA ANALYSIS AND MODEL BUILDING LECTURE 9 Prof. Roland Craigwell Department of Economics University of the West Indies Cave Hill Campus and Rebecca Gookool.
2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science.
Chapter 11 – Neural Nets © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Canadian Bioinformatics Workshops
Model selection and model building. Model selection Selection of predictor variables.
Yandell – Econ 216 Chap 15-1 Chapter 15 Multiple Regression Model Building.
Chapter 15 Multiple Regression Model Building
Chapter 9 Multiple Linear Regression
Multiple Linear Regression
Statistics in MSmcDESPOT
BUSI 410 Business Analytics
Business Statistics, 4e by Ken Black
S519: Evaluation of Information Systems
Chapter 6: Multiple Linear Regression
Prepared by Lee Revere and John Large
Regression Model Building
Regression Model Building
Linear Model Selection and regularization
Multiple Linear Regression
Regression and Categorical Predictors
Business Statistics, 4e by Ken Black
Presentation transcript:

© Galit Shmueli and Peter Bruce 2010 Chapter 6: Multiple Linear Regression Data Mining for Business Analytics Shmueli, Patel & Bruce

Topics Explanatory vs. predictive modeling with regression Example: prices of Toyota Corollas Fitting a predictive model Assessing predictive accuracy Selecting a subset of predictors

Explanatory Modeling Goal: Explain relationship between predictors (explanatory variables) and target Familiar use of regression in data analysis Model Goal: Fit the data well and understand the contribution of explanatory variables to the model “goodness-of-fit”: R 2, residual analysis, p-values

Predictive Modeling Goal: predict target values in other data where we have predictor values, but not target values Classic data mining context Model Goal: Optimize predictive accuracy Train model on training data Assess performance on validation (hold-out) data Explaining role of predictors is not primary purpose (but useful)

Example: Prices of Toyota Corolla ToyotaCorolla.xls Goal: predict prices of used Toyota Corollas based on their specification Data: Prices of 1442 used Toyota Corollas, with their specification information

Data Sample (showing only the variables to be used in analysis) PriceAgeKMFuel_TypeHP Metalli c Autom aticccDoorsQuart_TaxWeight Diesel Diesel Diesel Diesel Diesel Diesel Diesel Diesel Petrol Diesel Petrol

Variables Used Price in Euros Age in months as of 8/04 KM (kilometers) Fuel Type (diesel, petrol, CNG) HP (horsepower) Metallic color (1=yes, 0=no) Automatic transmission (1=yes, 0=no) CC (cylinder volume) Doors Quart_Tax (road tax) Weight (in kg)

Preprocessing Fuel type is categorical, must be transformed into binary variables Diesel (1=yes, 0=no) CNG (1=yes, 0=no) None needed for “Petrol” (reference category)

Subset of the records selected for training partition (limited # of variables shown) 60% training data / 40% validation data PriceAgeKMHPMetallic Automati cccDoorsQuart_TaxWeight Fuel_Typ e_CNG Fuel_Typ e_Diesel Fuel_Typ e_Petrol

The Fitted Regression Model Input Variables CoefficientStd. Errort-StatisticP-Value Intercept Age E-177 KM E-28 HP E-12 Metallic Automatic cc Doors Quart_Tax E-07 Weight E-28 Fuel_Type_CNG E-07 Fuel_Type_Diesel

Error reports

Predicted Values Predicted price computed using regression coefficients Residuals = difference between actual and predicted prices Predicted Value Actual Value Residual

Distribution of Residuals Symmetric distribution Some outliers

Selecting Subsets of Predictors Goal: Find parsimonious model (the simplest model that performs sufficiently well) More robust Higher predictive accuracy Exhaustive Search Partial Search Algorithms Forward Backward Stepwise

Exhaustive Search All possible subsets of predictors assessed (single, pairs, triplets, etc.) Computationally intensive Judge by “adjusted R 2 ” Penalty for number of predictors

Forward Selection Start with no predictors Add them one by one (add the one with largest contribution) Stop when the addition is not statistically significant

Backward Elimination Start with all predictors Successively eliminate least useful predictors one by one Stop when all remaining predictors have statistically significant contribution

Stepwise Like Forward Selection Except at each step, also consider dropping non- significant predictors

Backward elimination (showing last 7 models) Top model has a single predictor (Age_08_04) Second model has two predictors, etc.

All 12 Models

Diagnostics for the 12 models Good model has: High adj-R 2, Cp = # predictors

Next step Subset selection methods give candidate models that might be “good models” Do not guarantee that “best” model is indeed best Also, “best” model can still have insufficient predictive accuracy Must run the candidates and assess predictive accuracy (click “choose subset”)

Model with only 6 predictors Model Fit Predictive performance (compare to 12-predictor model!)

Summary Linear regression models are very popular tools, not only for explanatory modeling, but also for prediction A good predictive model has high predictive accuracy (to a useful practical level) Predictive models are built using a training data set, and evaluated on a separate validation data set Removing redundant predictors is key to achieving predictive accuracy and robustness Subset selection methods help find “good” candidate models. These should then be run and assessed.