2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science.

Slides:



Advertisements
Similar presentations
Chapter 5 Multiple Linear Regression
Advertisements

Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 14-1 Business Statistics: A Decision-Making Approach 6 th Edition Chapter.
Chapter 8 – Logistic Regression
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Multiple Linear Regression
LINEAR REGRESSION: Evaluating Regression Models Overview Assumptions for Linear Regression Evaluating a Regression Model.
MULTIPLE REGRESSION. OVERVIEW What Makes it Multiple? What Makes it Multiple? Additional Assumptions Additional Assumptions Methods of Entering Variables.
LINEAR REGRESSION: Evaluating Regression Models. Overview Assumptions for Linear Regression Evaluating a Regression Model.
Statistics for Managers Using Microsoft® Excel 5th Edition
Statistics for Managers Using Microsoft® Excel 5th Edition
Part I – MULTIVARIATE ANALYSIS C3 Multiple Linear Regression II © Angel A. Juan & Carles Serrat - UPC 2007/2008.
Multiple Regression Involves the use of more than one independent variable. Multivariate analysis involves more than one dependent variable - OMS 633 Adding.
Multivariate Data Analysis Chapter 4 – Multiple Regression.
1 MF-852 Financial Econometrics Lecture 6 Linear Regression I Roy J. Epstein Fall 2003.
Lecture 11 Multivariate Regression A Case Study. Other topics: Multicollinearity  Assuming that all the regression assumptions hold how good are our.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 11 th Edition.
Correlational Designs
Chapter 15: Model Building
Classification and Prediction: Regression Analysis
Variance and covariance Sums of squares General linear models.
Copyright ©2011 Pearson Education 15-1 Chapter 15 Multiple Regression Model Building Statistics for Managers using Microsoft Excel 6 th Global Edition.
Regression Model Building
Correlation & Regression
Quantitative Business Analysis for Decision Making Multiple Linear RegressionAnalysis.
Multiple Linear Regression Response Variable: Y Explanatory Variables: X 1,...,X k Model (Extension of Simple Regression): E(Y) =  +  1 X 1 +  +  k.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition
Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall 15-1 Chapter 15 Multiple Regression Model Building Statistics for Managers using Microsoft.
© 2004 Prentice-Hall, Inc.Chap 15-1 Basic Business Statistics (9 th Edition) Chapter 15 Multiple Regression Model Building.
Business Statistics, 4e, by Ken Black. © 2003 John Wiley & Sons Business Statistics, 4e by Ken Black Chapter 15 Building Multiple Regression Models.
Chapter 12 Examining Relationships in Quantitative Research Copyright © 2013 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin.
L 1 Chapter 12 Correlational Designs EDUC 640 Dr. William M. Bauer.
Multiple Linear Regression. Purpose To analyze the relationship between a single dependent variable and several independent variables.
Examining Relationships in Quantitative Research
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
Discriminant Analysis Discriminant analysis is a technique for analyzing data when the criterion or dependent variable is categorical and the predictor.
1 Multiple Regression A single numerical response variable, Y. Multiple numerical explanatory variables, X 1, X 2,…, X k.
Chapter 16 Data Analysis: Testing for Associations.
1 Quadratic Model In order to account for curvature in the relationship between an explanatory and a response variable, one often adds the square of the.
1 Experimental Statistics - week 12 Chapter 12: Multiple Regression Chapter 13: Variable Selection Model Checking.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
Lesson 14 - R Chapter 14 Review. Objectives Summarize the chapter Define the vocabulary used Complete all objectives Successfully answer any of the review.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
Chap 13-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 13 Multiple Regression and.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
© Galit Shmueli and Peter Bruce 2010 Chapter 6: Multiple Linear Regression Data Mining for Business Analytics Shmueli, Patel & Bruce.
There is a hypothesis about dependent and independent variables The relation is supposed to be linear We have a hypothesis about the distribution of errors.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science.
DATA ANALYSIS AND MODEL BUILDING LECTURE 9 Prof. Roland Craigwell Department of Economics University of the West Indies Cave Hill Campus and Rebecca Gookool.
Multiple Regression Reference: Chapter 18 of Statistics for Management and Economics, 7 th Edition, Gerald Keller. 1.
DSCI 346 Yamasaki Lecture 6 Multiple Regression and Model Building.
Canadian Bioinformatics Workshops
2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science.
Yandell – Econ 216 Chap 15-1 Chapter 15 Multiple Regression Model Building.
Predicting Energy Consumption in Buildings using Multiple Linear Regression Introduction Linear regression is used to model energy consumption in buildings.
Chapter 15 Multiple Regression Model Building
Chapter 15 Multiple Regression and Model Building
Correlation, Bivariate Regression, and Multiple Regression
Chapter 9 Multiple Linear Regression
Multiple Linear Regression
Statistics in MSmcDESPOT
Multivariate Analysis Lec 4
Business Statistics, 4e by Ken Black
S519: Evaluation of Information Systems
Stats Club Marnie Brennan
Chapter 6: Multiple Linear Regression
Multivariate Linear Regression Models
Business Statistics, 4e by Ken Black
Presentation transcript:

2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science & Technology Chapter 7: Multiple Linear Regression

Data Mining, IISE, SNUT Steps in Data Mining revisited 1. Define and understand the purpose of data mining project 2. Formulate the data mining problem 3. Obtain/verify/modify the data 5. Build data mining models 6. Evaluate and interpret the results 7. Deploy and monitor the model 4. Explore and customize the data

Data Mining, IISE, SNUT Prediction revisited  Predict the selling price of Toyota corolla… Independent variables (attributes, features) Dependent variable (target)

 Goal  Fit a linear relationship between a quantitative dependent variable Y and a set of predictors X 1, X 2, …, X p Data Mining, IISE, SNUT Multiple Linear Regression coefficients unexplained

 Explanatory vs. Predictive Data Mining, IISE, SNUT Multiple Linear Regression  Explain relationship between predictors (explanatory variables) and target.  Familiar use of regression in data analysis.  Model Goal: Fit the data well and understand the contribution of explanatory variables to the model.  “goodness-of-fit”: R 2, residual analysis, p-values.  Explain relationship between predictors (explanatory variables) and target.  Familiar use of regression in data analysis.  Model Goal: Fit the data well and understand the contribution of explanatory variables to the model.  “goodness-of-fit”: R 2, residual analysis, p-values.  predict target values in other data where we have predictor values, but not target values.  Classic data mining context  Model Goal: Optimize predictive accuracy  Train model on training data  Assess performance on validation (hold-out) data  Explaining role of predictors is not primary purpose (but useful)  predict target values in other data where we have predictor values, but not target values.  Classic data mining context  Model Goal: Optimize predictive accuracy  Train model on training data  Assess performance on validation (hold-out) data  Explaining role of predictors is not primary purpose (but useful) Explanatory Regression Predictive Regression

 Estimating the coefficients  Ordinary least square (OLS) Actual target: Predicted target: Goal: minimize the difference between the actual and predicted target Data Mining, IISE, SNUT Multiple Linear Regression

 Ordinary least square: Matrix solution  X: n by p matrix, y: n by 1 vector, β: p by 1 vector Data Mining, IISE, SNUT Multiple Linear Regression

 Ordinary least square  Finds the best estimates β when the following conditions are satisfied: The noise ε follows a normal distribution. The linear relationship is correct. The cases are independent of each other. The variability in Y values for a given set of predictors is the same regardless of the values of the predictors (homoskedasticity) Data Mining, IISE, SNUT Multiple Linear Regression

 Example: predict the selling price of Toyota corolla Data Mining, IISE, SNUT Multiple Linear Regression YX

 Data preprocessing  Create dummy variables for fuel types  Data partitioning  60% training data / 40% validation data Data Mining, IISE, SNUT Multiple Linear Regression Fuel_type = Disel Fuel_type = Petrol Fuel_type = CNG Diesel100 Petrol010

 Fitted linear regression model Data Mining, IISE, SNUT Multiple Linear Regression β 유의확률

 Actual & predicted targets Data Mining, IISE, SNUT Multiple Linear Regression

Data Mining, IISE, SNUT Prediction Performance Example  Predict a baby’s weight(kg) based on his age. 1 Age Actual Weight(y) Predicted Weight(y’)

14 Average error  Indicate whether the predictions are on average over- or under- predicted. Age Actual Weight(y) Predicted Weight(y’) Data Mining, IISE, SNUT Prediction Performance 2

15 Mean absolute error (MAE)  Gives the magnitude of the average error. Age Actual Weight(y) Predicted Weight(y’) Data Mining, IISE, SNUT Prediction Performance 3

16 Mean absolute percentage error (MAPE)  Gives a percentage score of how predictions deviate (on average) from the actual values. Age Actual Weight(y) Predicted Weight(y’) Data Mining, IISE, SNUT Prediction Performance 4

17 (Root) Mean squared error ((R)MSE)  Standard error of estimate.  Same units as the variable predicted. Age Actual Weight(y) Predicted Weight(y’) Data Mining, IISE, SNUT Prediction Performance 5

 Performance evaluation Data Mining, IISE, SNUT Multiple Linear Regression

 Residual distribution Data Mining, IISE, SNUT Multiple Linear Regression

 Why should we select a subset of variables?  May be expensive or not feasible to collect a full complement of predictors for future prediction.  May be able to measure fewer predictors more accurately (e.g. in surveys).  More predictors, more missing values.  Parsimony (a.k.a. Occam’s Razor): the simpler, the better.  Multicollinearity: presence of two or more predictors sharing the same linear relationship with the outcome variables Data Mining, IISE, SNUT Variable Selection in Linear Regression

 Goal  Find parsimonious model (the simplest model that performs sufficiently well). More robust. Higher predictive accuracy.  Methods  Exhaustive search  Partial search Forward Backward Stepwise Data Mining, IISE, SNUT Variable Selection in Linear Regression

 Exhaustive search  All possible subsets of predictors assessed (single, pairs, triplets, etc.) Example: for three variables A total of six combinations are evaluated:  Adjusted R 2 is used for performance criterion Data Mining, IISE, SNUT Variable Selection in Linear Regression X1X1 X2X2 X3X3 X1X1 X2X2 X3X3 X1X1 X2X2 X2X2 X3X3 X1X1 X2X2 X3X3 the number of recordsthe number of predictors X1X1 X3X3

 Forward Selection  Start with no predictors.  Add them one by one (add the one with largest contribution).  Stop when the addition is not statistically significant. Example: for three variables Data Mining, IISE, SNUT Variable Selection in Linear Regression X1X1 X2X2 X3X3 X1X1 X2X2 X3X3 X1X1 X1X1 X2X2 X3X3 X1X1 X3X3 X1X1 X3X3 X1X1 X2X2 X3X3 X1X1

 Backward Elimination  Start with all predictors.  Successively eliminate least useful predictors one by one.  Stop when all remaining predictors have statistically significant contribution. Example: for three variables  Stepwise Selection  Like Forward Selection.  Except at each step, also consider dropping non-significant predictors Data Mining, IISE, SNUT Variable Selection in Linear Regression X1X1 X2X2 X3X3 X3X3 X1X1 X3X3 X1X1 X2X2

 Exhaustive search results Data Mining, IISE, SNUT Variable Selection in Linear Regression

 Backward elimination results Data Mining, IISE, SNUT Variable Selection in Linear Regression

 With six variables 27 Model Fit Predictive performance (compare to 12-predictor model!) 2011 Data Mining, IISE, SNUT Prediction performance evaluation

 Summary  Linear regression models are very popular tools, not only for explanatory modeling, but also for prediction.  A good predictive model has high predictive accuracy (to a useful practical level).  Predictive models are built using a training data set, and evaluated on a separate validation data set.  Removing redundant predictors is key to achieving predictive accuracy and robustness.  Subset selection methods help find “good” candidate models. These should then be run and assessed Data Mining, IISE, SNUT Linear Regression