Lecture 25 Regression diagnostics for the multiple linear regression model Dealing with influential observations for multiple linear regression Interaction.

Slides:



Advertisements
Similar presentations
Stat 112: Lecture 7 Notes Homework 2: Due next Thursday The Multiple Linear Regression model (Chapter 4.1) Inferences from multiple regression analysis.
Advertisements

Inference for Regression
Prediction, Correlation, and Lack of Fit in Regression (§11. 4, 11
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 13 Nonlinear and Multiple Regression.
Class 16: Thursday, Nov. 4 Note: I will you some info on the final project this weekend and will discuss in class on Tuesday.
LECTURE 3 Introduction to Linear Regression and Correlation Analysis
Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section ) –Outliers and Influential Points (Section 6.7) Homework.
Stat 112: Notes 1 Main topics of course: –Simple Regression –Multiple Regression –Analysis of Variance –Chapters 3-9 of textbook Readings for Notes 1:
Class 17: Tuesday, Nov. 9 Another example of interpreting multiple regression coefficients Steps in multiple regression analysis and example analysis Omitted.
Lecture 18: Thurs., Nov. 6th Chapters 8.3.2, 8.4, Outliers and Influential Observations Transformations Interpretation of log transformations (8.4)
Lecture 23: Tues., Dec. 2 Today: Thursday:
Lecture 22: Thurs., April 1 Outliers and influential points for simple linear regression Multiple linear regression –Basic model –Interpreting the coefficients.
Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.
Stat 112: Lecture 10 Notes Fitting Curvilinear Relationships –Polynomial Regression (Ch ) –Transformations (Ch ) Schedule: –Homework.
Class 19: Tuesday, Nov. 16 Specially Constructed Explanatory Variables.
Lecture 25 Multiple Regression Diagnostics (Sections )
Stat 112: Lecture 22 Notes Chapter 9.1: One-way Analysis of Variance. Chapter 9.3: Two-way Analysis of Variance Homework 6 is due on Friday.
Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.
Multivariate Data Analysis Chapter 4 – Multiple Regression.
Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review
Stat 112: Lecture 14 Notes Finish Chapter 6:
Class 6: Tuesday, Sep. 28 Section 2.4. Checking the assumptions of the simple linear regression model: –Residual plots –Normal quantile plots Outliers.
Stat 112: Lecture 19 Notes Chapter 7.2: Interaction Variables Thursday: Paragraph on Project Due.
Lecture 26 Omitted Variable Bias formula revisited Specially constructed variables –Interaction variables –Polynomial terms for curvature –Dummy variables.
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
Lecture 24 Multiple Regression (Sections )
Lecture 24: Thurs., April 8th
Lecture 20 Simple linear regression (18.6, 18.9)
Class 7: Thurs., Sep. 30. Outliers and Influential Observations Outlier: Any really unusual observation. Outlier in the X direction (called high leverage.
Lecture 27 Polynomial Terms for Curvature Categorical Variables.
Stat 112: Lecture 20 Notes Chapter 7.2: Interaction Variables. Chapter 8: Model Building. I will Homework 6 by Friday. It will be due on Friday,
Stat 112: Lecture 13 Notes Finish Chapter 5: –Review Predictions in Log-Log Transformation. –Polynomials and Transformations in Multiple Regression Start.
Regression Diagnostics Checking Assumptions and Data.
Lecture 19 Transformations, Predictions after Transformations Other diagnostic tools: Residual plot for nonconstant variance, histogram to check normality.
Lecture 21 – Thurs., Nov. 20 Review of Interpreting Coefficients and Prediction in Multiple Regression Strategy for Data Analysis and Graphics (Chapters.
Stat 112: Lecture 21 Notes Model Building (Brief Discussion) Chapter 9.1: One way Analysis of Variance. Homework 6 is due Friday, Dec. 1 st. I will be.
Class 11: Thurs., Oct. 14 Finish transformations Example Regression Analysis Next Tuesday: Review for Midterm (I will take questions and go over practice.
Stat 112: Lecture 16 Notes Finish Chapter 6: –Influential Points for Multiple Regression (Section 6.7) –Assessing the Independence Assumptions and Remedies.
Correlation and Regression Analysis
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Introduction to Regression Analysis, Chapter 13,
Forecasting Revenue: An Example of Regression Model Building Setting: Possibly a large set of predictor variables used to predict future quarterly revenues.
Correlation & Regression
Slide 1 SOLVING THE HOMEWORK PROBLEMS Simple linear regression is an appropriate model of the relationship between two quantitative variables provided.
Regression and Correlation Methods Judy Zhong Ph.D.
Inference for regression - Simple linear regression
Stat 112 Notes 15 Today: –Outliers and influential points. Homework 4 due on Thursday.
Regression Analysis Week 8 DIAGNOSTIC AND REMEDIAL MEASURES Residuals The main purpose examining residuals Diagnostic for Residuals Test involving residuals.
Anaregweek11 Regression diagnostics. Regression Diagnostics Partial regression plots Studentized deleted residuals Hat matrix diagonals Dffits, Cook’s.
Review of Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
Stat 112 Notes 16 Today: –Outliers and influential points in multiple regression (Chapter 6.7)
11/4/2015Slide 1 SOLVING THE PROBLEM Simple linear regression is an appropriate model of the relationship between two quantitative variables provided the.
Week 5Slide #1 Adjusted R 2, Residuals, and Review Adjusted R 2 Residual Analysis Stata Regression Output revisited –The Overall Model –Analyzing Residuals.
Stat 112: Notes 1 Main topics of course: –Simple Regression –Multiple Regression –Analysis of Variance –Chapters 3-9 of textbook Readings for Notes 1:
Stat 112 Notes 10 Today: –Fitting Curvilinear Relationships (Chapter 5) Homework 3 due Thursday.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
Stat 112 Notes 14 Assessing the assumptions of the multiple regression model and remedies when assumptions are not met (Chapter 6).
Individual observations need to be checked to see if they are: –outliers; or –influential observations Outliers are defined as observations that differ.
Lab 4 Multiple Linear Regression. Meaning  An extension of simple linear regression  It models the mean of a response variable as a linear function.
Yandell – Econ 216 Chap 15-1 Chapter 15 Multiple Regression Model Building.
Predicting Energy Consumption in Buildings using Multiple Linear Regression Introduction Linear regression is used to model energy consumption in buildings.
Statistical Data Analysis - Lecture /04/03
Multiple Linear Regression
(Residuals and
Diagnostics and Transformation for SLR
Residuals The residuals are estimate of the error
The greatest blessing in life is
Three Measures of Influence
Checking the data and assumptions before the final analysis.
Diagnostics and Transformation for SLR
Presentation transcript:

Lecture 25 Regression diagnostics for the multiple linear regression model Dealing with influential observations for multiple linear regression Interaction variables

Assumptions of Multiple Linear Regression Model Assumptions of multiple linear regression: –For each subpopulation, (A-1A) (A-1B) (A-1C) The distribution of is normal [Distribution of residuals should not depend on ] –(A-2) The observations are independent of one another

Checking/Refining Model Tools for checking (A-1A) and (A-1B) –Residual plots versus predicted (fitted) values –Residual plots versus explanatory variables –If model is correct, there should be no pattern in the residual plots Tool for checking (A-1C) – Histogram of residuals Tool for checking (A-2) –Residual plot versus time or spatial order of observations

Model Building (Display 9.9) 1.Make scatterplot matrix of variables (using analyze, multivariate). Decide on whether to transform any of the explanatory variables. Check for obvious outliers. 2.Fit tentative model. 3.Check residual plots for whether assumptions of multiple regression model are satisfied. Look for outliers and influential points. 4.Consider fitting richer model with interactions or curvature. See if extra terms can be dropped. 5.Make changes to model and repeat steps 2-4 until an adequate model is found.

Multiple regression, modeling and outliers, leverage and influential points Pollution Example Data set pollutionhc.JMP provides information about the relationship between pollution and mortality for 60 cities between The variables are y (MORT)=total age adjusted mortality in deaths per 100,000 population; PRECIP=mean annual precipitation (in inches); EDUC=median number of school years completed for persons 25 and older; NONWHITE=percentage of 1960 population that is nonwhite; HC=relative pollution potential of hydrocarbons (product of tons emitted per day per square kilometer and a factor correcting for SMSA dimension and exposure)

Transformations for Explanatory Variables In deciding whether to transform an explanatory variable x, we consider two features of the plot of the response y vs. the explanatory variable x. 1.Is there curvature in the relationship between y and x? This suggests a transformation chosen by Tukey’s Bulging rule. 2.Are most of the x values “crunched together” and a few very spread apart? This will lead to several points being very influential. When this is the case, it is best to transform x to make the x values more evenly spaced and less influential. If the x values are positive, the log transformation is a good idea. For the pollution data, reason 2 suggests transforming HC to log HC.

Residual vs. Predicted Plot Useful for detecting nonconstant variance; look for fan or funnel pattern. Plot of residuals versus predicted values, For pollution data, no strong indication of nonconstant variance.

Residual plots vs. each explanatory variable Make plot of residuals vs. an explanatory variable by using Fit Model, clicking red triangle next to response, selecting Save Columns and selecting save residuals. This creates a column of residuals. Then click Analyze, Fit Y by X and put residuals in Y and the explanatory variable in X. Use these residual plots to check for pattern in the mean of residuals (suggests that we need to transform x or use a polynomial in x) or pattern in the variance of the residuals.

Residual plots look fine. No strong indication of nonlinearity or nonconstant variance.

Check of normality/outliers Normality looks okay. One residual outlier, Lancaster.

Influential Observations As in simple linear regression, one or two observations can strongly influence the estimates. Harder to immediately see the influential observations in multiple regression. Use Cook’s Distances (Cook’s D influence) to look for influential observations. An obs. Has large influence if Cook’s distance is greater than 1. Can use Table, Sort to sort observations by Cook’s Distance or Leverage. For pollution data: no observation has high influence.

Strategy for dealing with influential observations Use Display 11.8 Leverage of point: measure of distance between point’s explanatory variable values and explanatory variable values in entire data set. Two sources of influence: leverage, magnitude of residual. General approach: If an influential point has high leverage, omit point and report conclusions for reduced range of explanatory variables. If an influential point does not have high leverage, then the point cannot just be removed. We can report results with and without point.

Leverage Obtaining leverages from JMP: After Fit Model, click red triangle next to Response, select Save Columns, Hats. Leverages are between 1/n and 1. Average leverage is p/n. An observation is considered to have high leverage if the leverage is greater than 2p/n where p=# of explanatory variables. For pollution data, 2p/n = (2*4)/60=.133

Specially Constructed Explanatory Variables Interaction variables Squared and higher polynomial terms for curvature Dummy variables for categorical variables.

Interaction Interaction is a three-variable concept. One of these is the response variable (Y) and the other two are explanatory variables (X 1 and X 2 ). There is an interaction between X 1 and X 2 if the impact of an increase in X 2 on Y depends on the level of X 1. To incorporate interaction in multiple regression model, we add the explanatory variable. There is evidence of an interaction if the coefficient on is significant (t-test has p- value <.05).

Interaction variables in JMP To add an interaction variable in Fit Model in JMP, add the usual explanatory variables first, then highlight in the Select Columns box and in the Construct Model Effects Box. Then click Cross in the Construct Model Effects Box. JMP creates the explanatory variable

Interaction Model for Pollution Data