Class 6: Tuesday, Sep. 28 Section 2.4. Checking the assumptions of the simple linear regression model: –Residual plots –Normal quantile plots Outliers.

Slides:



Advertisements
Similar presentations
Statistical Methods Lecture 28
Advertisements

Forecasting Using the Simple Linear Regression Model and Correlation
Stat 112: Lecture 7 Notes Homework 2: Due next Thursday The Multiple Linear Regression model (Chapter 4.1) Inferences from multiple regression analysis.
Inference for Regression
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Prediction, Correlation, and Lack of Fit in Regression (§11. 4, 11
Class 16: Thursday, Nov. 4 Note: I will you some info on the final project this weekend and will discuss in class on Tuesday.
LECTURE 3 Introduction to Linear Regression and Correlation Analysis
Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section ) –Outliers and Influential Points (Section 6.7) Homework.
Class 8: Tues., Oct. 5 Causation, Lurking Variables in Regression (Ch. 2.4, 2.5) Inference for Simple Linear Regression (Ch. 10.1) Where we’re headed:
BA 555 Practical Business Analysis
Lecture 25 Multiple Regression Diagnostics (Sections )
Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.
Lecture 25 Regression diagnostics for the multiple linear regression model Dealing with influential observations for multiple linear regression Interaction.
Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review
Stat 112: Lecture 14 Notes Finish Chapter 6:
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
Lecture 24 Multiple Regression (Sections )
Lecture 24: Thurs., April 8th
Lecture 20 Simple linear regression (18.6, 18.9)
Regression Diagnostics - I
Class 7: Thurs., Sep. 30. Outliers and Influential Observations Outlier: Any really unusual observation. Outlier in the X direction (called high leverage.
Statistics 350 Lecture 10. Today Last Day: Start Chapter 3 Today: Section 3.8 Homework #3: Chapter 2 Problems (page 89-99): 13, 16,55, 56 Due: February.
Stat 112: Lecture 13 Notes Finish Chapter 5: –Review Predictions in Log-Log Transformation. –Polynomials and Transformations in Multiple Regression Start.
Regression Diagnostics Checking Assumptions and Data.
Lecture 19 Transformations, Predictions after Transformations Other diagnostic tools: Residual plot for nonconstant variance, histogram to check normality.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 11 th Edition.
Stat Notes 4 Chapter 3.5 Chapter 3.7.
Lecture 17 Interaction Plots Simple Linear Regression (Chapter ) Homework 4 due Friday. JMP instructions for question are actually for.
Class 11: Thurs., Oct. 14 Finish transformations Example Regression Analysis Next Tuesday: Review for Midterm (I will take questions and go over practice.
Stat 112: Lecture 16 Notes Finish Chapter 6: –Influential Points for Multiple Regression (Section 6.7) –Assessing the Independence Assumptions and Remedies.
Business Statistics - QBM117 Statistical inference for regression.
Stat 112: Lecture 9 Notes Homework 3: Due next Thursday
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
1 Simple Linear Regression 1. review of least squares procedure 2. inference for least squares lines.
Inference for regression - Simple linear regression
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 12-1 Chapter 12 Simple Linear Regression Statistics for Managers Using.
© 2004 Prentice-Hall, Inc.Chap 15-1 Basic Business Statistics (9 th Edition) Chapter 15 Multiple Regression Model Building.
Looking at data: relationships - Caution about correlation and regression - The question of causation IPS chapters 2.4 and 2.5 © 2006 W. H. Freeman and.
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
Review of Statistical Models and Linear Regression Concepts STAT E-150 Statistical Methods.
Stat 112 Notes 15 Today: –Outliers and influential points. Homework 4 due on Thursday.
MBP1010H – Lecture 4: March 26, Multiple regression 2.Survival analysis Reading: Introduction to the Practice of Statistics: Chapters 2, 10 and 11.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
Chapters 8 & 9 Linear Regression & Regression Wisdom.
Chapter 14 Inference for Regression © 2011 Pearson Education, Inc. 1 Business Statistics: A First Course.
Regression Analysis Week 8 DIAGNOSTIC AND REMEDIAL MEASURES Residuals The main purpose examining residuals Diagnostic for Residuals Test involving residuals.
Review of Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
Stat 112 Notes 16 Today: –Outliers and influential points in multiple regression (Chapter 6.7)
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 3 Describing Relationships 3.2 Least-Squares.
Stat 112 Notes 5 Today: –Chapter 3.7 (Cautions in interpreting regression results) –Normal Quantile Plots –Chapter 3.6 (Fitting a linear time trend to.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
Business Statistics for Managerial Decision Making
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
Stat 112 Notes 6 Today: –Chapters 4.2 (Inferences from a Multiple Regression Analysis)
Stat 112 Notes 14 Assessing the assumptions of the multiple regression model and remedies when assumptions are not met (Chapter 6).
BPS - 5th Ed. Chapter 231 Inference for Regression.
Week 2 Normal Distributions, Scatter Plots, Regression and Random.
1 Simple Linear Regression Chapter Introduction In Chapters 17 to 19 we examine the relationship between interval variables via a mathematical.
Yandell – Econ 216 Chap 15-1 Chapter 15 Multiple Regression Model Building.
Inference for Least Squares Lines
Examining Relationships Least-Squares Regression & Cautions about Correlation and Regression PSBE Chapters 2.3 and 2.4 © 2011 W. H. Freeman and Company.
Chapter 12: Regression Diagnostics
Regression model Y represents a value of the response variable.
Diagnostics and Transformation for SLR
Stats Club Marnie Brennan
Correlation/regression using averages
Diagnostics and Transformation for SLR
Diagnostics and Remedial Measures
Presentation transcript:

Class 6: Tuesday, Sep. 28 Section 2.4. Checking the assumptions of the simple linear regression model: –Residual plots –Normal quantile plots Outliers and influential observations

Checking the model The simple linear regression model is a great tool but its answers will only be useful if it is the right model for the data. We need to check the assumptions before using the model. Assumptions of the simple linear regression model: 1.Linearity: The mean of Y|X is a straight line. 2.Constant variance: The standard deviation of Y|X is constant. 3.Normality: The distribution of Y|X is normal. 4.Independence: The observations are independent.

Checking that the mean of Y|X is a straight line 1.Scatterplot: Look at whether the mean of Y given X appears to increase or decrease in a straight line.

Residual Plot Residuals: Prediction error of using regression to predict Y i for observation i:, where Residual plot: Plot with residuals on the y axis and the explanatory variable (or some other variable) on the x axis.

Residual Plot in JMP: After doing Fit Line, click red triangle next to Linear Fit and then click Plot Residuals. What should the residual plot look like if the simple linear regression model holds? Under simple linear regression model, the residuals should have approximately a normal distribution with mean zero and a standard deviation which is the same for all X. Simple linear regression model: Residuals should appear as a “swarm” of randomly scattered points about zero. Ideally, you should not be able to detect any patterns. (Try not to read too much into these plots – you’re looking for gross departures from a random scatter). A pattern in the residual plot that for a certain range of X the residuals tend to be greater than zero or tend to be less than zero indicates that the mean of Y|X is not a straight line.

Checking Constant Variance Use residual plot of residuals vs. X to check constant variance assumption. Constant variance: Spread of residuals is similar for all ranges of X Nonconstant variance: Spread of residuals is different for different ranges of X. –Fan shaped plot: Residuals are increasing in spread as X increases –Horn shaped plot: Residuals are decreasing in spread as X increases.

Checking Normality If the distribution of Y|X is normal, then the residuals should have approximately a normal distribution. To check normality, make histogram and normal quantile plot of residuals. In JMP, after using Fit Line, click red triangle next to Linear Fit and click save residuals. Click Analyze, Distribution, put Residuals in Y, click OK and then after histogram appears, click red triangle next to Residuals and click Normal Quantile Plot.

Normal Quantile Plot Section 1.3. Most useful tool for assessing normality. Plot of residuals (or whatever variable is being checked for normality) on y-axis versus z-score of percentile of data point. If the true distribution is normal, the normal quantile plot will be a straight line. Deviations from a straight line indicate that the distribution is not normal. The dotted red lines are “confidence bands.” If all the points lie inside the confidence bands, then we feel that the normality assumption is reasonable.

Independence In a problem where the data is collected over time, plot the residuals vs. time. For simple linear regression model, there should be no pattern in residuals over time. Pattern in residuals over time where residuals are higher or lower in early part of data than later part of data indicates that relationship between Y and X is changing over time and might indicate that there is a lurking variable. Lurking variable: A variable that is not among the explanatory or response variables in a study and yet may influence the interpretation of relationships among those variables.

Residual vs. Time Example Mathematics dept. at large state university must plan number of instructors required for large elementary courses and wants to predict enrollment in elementary math courses (y) based on number of first-year students (x). Data in mathenroll.JMP Residual plot vs. time in JMP: After fit y by x, fit line, click red triangle next to linear fit and click save residuals. Then use fit y by x with y = residuals and x = year.

Residual Plots

Analysis of Math Enrollment Residual plot versus time order indicates that there must be a lurking variable associated with time, in particular there is a change in the relationship between y and x between 1997 and In fact, one of schools in the university changed its program to require that entering students take another mathematics course beginning in 1998, increasing enrollment. Implication: Data from before 1998 should not be used to predict future math enrollment.

What to Do About Violations of Simple Linear Regression Model Coming up in the Future: Nonlinearity: Transformations (Chapter 2.6), Polynomial Regression (Chapter 11) Nonconstant Variance: Transformations (Chapter 2.6) Nonnormality: Transformations (Chapter 2.6). Lack of independence: Incorporate time into multiple regression (Chapter 11), time series techniques (Stat 202).

Outliers and Influential Observations Outlier: Any really unusual observation. Outlier in the X direction (called high leverage point): Has the potential to influence the regression line. Outlier in the direction of the scatterplot: An observation that deviates from the overall pattern of relationship between Y and X. Typically has a residual that is large in absolute value. Influential observation: Point that if it is removed would markedly change the statistical analysis. For simple linear regression, points that are outliers in the x direction are often influential.

Housing Prices and Crime Rates A community in the Philadelphia area is interested in how crime rates are associated with property values. If low crime rates increase property values, the community might be able to cover the costs of increased police protection by gains in tax revenues from higher property values. The town council looked at a recent issue of Philadelphia Magazine (April 1996) and found data for itself and 109 other communities in Pennsylvania near Philadelphia. Data is in philacrimerate.JMP. House price = Average house price for sales during most recent year, Crime Rate=Rate of crimes per 1000 population.

Which points are influential? Center City Philadelphia is influential; Gladwyne is not. In general, points that have high leverage are more likely to be influential.

Formal measures of leverage and influence Leverage: “Hat values” (JMP calls them hats) Influence: Cook’s Distance (JMP calls them Cook’s D Influence). To obtain them in JMP, click Analyze, Fit Model, put Y variable in Y and X variable in Model Effects box. Click Run Model box. After model is fit, click red triangle next to Response. Click Save Columns and then Click Hats for Leverages and Click Cook’s D Influences for Cook’s Distances. To sort observations in terms of Cook’s Distance or Leverage, click Tables, Sort and then put variable you want to sort by in By box.

Center City Philadelphia has both influence (Cook’s Distance much Greater than 1 and high leverage (hat value > 3*2/99=0.06). No other observations have high influence or high leverage.

Rules of Thumb for High Leverage and High Influence High Leverage Any observation with a leverage (hat value) > (3 * # of coefficients in regression model)/n has high leverage, where # of coefficients in regression model = 2 for simple linear regression. n=number of observations. High Influence: Any observation with a Cook’s Distance greater than 1 indicates a high influence.

What to Do About Suspected Influential Observations? See flowchart handout. Does removing the observation change the substantive conclusions? If not, can say something like “Observation x has high influence relative to all other observations but we tried refitting the regression without Observation x and our main conclusions didn’t change.”

If removing the observation does change substantive conclusions, is there any reason to believe the observation belongs to a population other than the one under investigation? –If yes, omit the observation and proceed. –If no, does the observation have high leverage (outlier in explanatory variable). If yes, omit the observation and proceed. Report that conclusions only apply to a limited range of the explanatory variable. If no, not much can be said. More data (or clarification of the influential observation) are needed to resolve the questions. General principle: Delete observations from the analysis sparingly – only when there is good cause (does not belong to population being investigated or is a point with high leverage). If you do delete observations from the analysis, you should state clearly which observations were deleted and why.

Summary Before using the simple linear regression model, we need to check its assumptions. Check linearity, constant variance, normality and independence by using scatterplot, residual plot and normal quantile plot. Influential observations: observations that, if removed, would have a large influence on the fitted regression model. Examine influential observations, remove them only with cause (belongs to a different population than being studied, has high leverage) and explain why you deleted them. Next class: Lurking variables, causation (Sections 2.4, 2.5).