Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic.

Slides:



Advertisements
Similar presentations
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Advertisements

Describing Quantitative Variables
Chapter 12 Inference for Linear Regression
Inference for Regression
STA305 week 31 Assessing Model Adequacy A number of assumptions were made about the model, and these need to be verified in order to use the model for.
Prediction, Correlation, and Lack of Fit in Regression (§11. 4, 11
Descriptive Statistics Summarizing data using graphs.
Chapter 8 Linear Regression © 2010 Pearson Education 1.
Class 16: Thursday, Nov. 4 Note: I will you some info on the final project this weekend and will discuss in class on Tuesday.
Describing Quantitative Data with Numbers Part 2
Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section ) –Outliers and Influential Points (Section 6.7) Homework.
Class 6: Tuesday, Sep. 28 Section 2.4. Checking the assumptions of the simple linear regression model: –Residual plots –Normal quantile plots Outliers.
REGRESSION AND CORRELATION
Regression Diagnostics Checking Assumptions and Data.
Intro to Statistics for the Behavioral Sciences PSYC 1900 Lecture 4: The Normal Distribution and Z-Scores.
Correlation and Regression Analysis
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Regression Chapter 14.
Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data.
Correlation & Regression
Introduction to Linear Regression and Correlation Analysis
Inference for regression - Simple linear regression
AP Statistics Chapters 0 & 1 Review. Variables fall into two main categories: A categorical, or qualitative, variable places an individual into one of.
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 12 Describing Data.
Describing distributions with numbers
Prediction concerning Y variable. Three different research questions What is the mean response, E(Y h ), for a given level, X h, of the predictor variable?
Model Checking Using residuals to check the validity of the linear regression model assumptions.
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
Variable  An item of data  Examples: –gender –test scores –weight  Value varies from one observation to another.
BPS - 3rd Ed. Chapter 211 Inference for Regression.
Quantitative Skills 1: Graphing
Chapter 3: Diagnostics and Remedial Measures
1 Chapter 10 Correlation and Regression 10.2 Correlation 10.3 Regression.
Chapter 10 Correlation and Regression
Describing distributions with numbers
Summarizing Bivariate Data
Business Statistics for Managerial Decision Farideh Dehkordi-Vakil.
Prediction concerning the response Y. Where does this topic fit in? Model formulation Model estimation Model evaluation Model use.
Lecture 8 Simple Linear Regression (cont.). Section Objectives: Statistical model for linear regression Data for simple linear regression Estimation.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
Chapter 14 Inference for Regression © 2011 Pearson Education, Inc. 1 Business Statistics: A First Course.
Regression Analysis Week 8 DIAGNOSTIC AND REMEDIAL MEASURES Residuals The main purpose examining residuals Diagnostic for Residuals Test involving residuals.
TODAY we will Review what we have learned so far about Regression Develop the ability to use Residual Analysis to assess if a model (LSRL) is appropriate.
Descriptive Statistics Summarizing data using graphs.
Unit 4 Statistical Analysis Data Representations.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Multiple regression. Example: Brain and body size predictive of intelligence? Sample of n = 38 college students Response (Y): intelligence based on the.
1 Regression Analysis The contents in this chapter are from Chapters of the textbook. The cntry15.sav data will be used. The data collected 15 countries’
Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory.
Diagnostics – Part II Using statistical tests to check to see if the assumptions we made about the model are realistic.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Simple Linear Regression Analysis Chapter 13.
Stat 112 Notes 14 Assessing the assumptions of the multiple regression model and remedies when assumptions are not met (Chapter 6).
Lecture 7: Bivariate Statistics. 2 Properties of Standard Deviation Variance is just the square of the S.D. If a constant is added to all scores, it has.
732G21/732G28/732A35 Lecture 3. Properties of the model errors ε 4. ε are assumed to be normally distributed
Regression Analysis Presentation 13. Regression In Chapter 15, we looked at associations between two categorical variables. We will now focus on relationships.
Lecturer: Ing. Martina Hanová, PhD.. Regression analysis Regression analysis is a tool for analyzing relationships between financial variables:  Identify.
BPS - 5th Ed. Chapter 231 Inference for Regression.
Chapter 12: Correlation and Linear Regression 1.
Assessing Normality Are my data normally distributed?
Week 2 Normal Distributions, Scatter Plots, Regression and Random.
Quantitative Methods Residual Analysis Multiple Linear Regression C.W. Jackson/B. K. Gordor.
Inference for Regression (Chapter 14) A.P. Stats Review Topic #3
Descriptive Statistics
Unit 4 Statistical Analysis Data Representations
Diagnostics and Transformation for SLR
Lecture 14 Review of Lecture 13 What we’ll talk about today?
Residuals The residuals are estimate of the error
The Examination of Residuals
Diagnostics and Transformation for SLR
Model Adequacy Checking
Regression Models - Introduction
Presentation transcript:

Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

Diagnostic methods Some simple (but subjective) plots. (Now) Formal statistical tests. (Next)

Review of some simple plots … while checking scope of model

Dot Plot

Summarizes quantitative data. Horizontal axis represents measurement scale. Plot one dot for each data point.

Stem-and-Leaf Plot Stem-and-leaf of Shoes N = 139 Leaf Unit = (33)

Stem-and-Leaf Plot Summarizes quantitative data. Each data point is broken down into a “stem” and a “leaf.” First, “stems” are aligned in a column. Then, “leaves” are attached to the stems.

Box Plot

Summarizes quantitative data. Vertical (or horizontal) axis represents measurement scale. Lines in box represent the 25th percentile (“first quartile”), the 50th percentile (“median”), and the 75th percentile (“third quartile”), respectively.

Box Plot (cont’d) “Whiskers” are drawn to the most extreme data points that are not more than 1.5 times the length of the box beyond either quartile. –Whiskers are useful for identifying outliers. “Outliers,” or extreme observations, are denoted by asterisks. –Generally, data points falling beyond the whiskers are considered outliers.

Okay, now the really new stuff…

Simple linear regression model Error terms have mean 0, i.e., E(  i ) = 0.  i and  j are uncorrelated (independent). Error terms have same variance, i.e., Var(  i ) =  2. Error terms  i are normally distributed. The response Y i is a function of a systematic linear component and a random error component: with assumptions that:

Why should we keep nagging ourselves about the model? All of the estimates, confidence intervals, prediction intervals, hypothesis tests, etc. have been developed assuming that the model is correct. If the model is incorrect, then the formulas and methods we use are at risk of being incorrect. (Some are more forgiving than others.)

Things that can go wrong with the model Regression function is not linear. Error terms do not have constant variance. Error terms are not independent. The model fits all but one or a few outlier observations. Error terms are not normally distributed. Important predictor variable(s) has been left out of the model.

Residual analysis … the basic idea We would think the observed residuals: would reflect the properties assumed for the unknown true error terms: So, investigate the observed residuals to see if they behave “properly.”

Some points of clarification about residuals The mean of the residuals, e-bar, is 0. So, no need to check that the mean of the residuals is 0 – the LS estimation method has made it so. The residuals are not independent, since they are all a function of the same estimated regression function.

Example: Alcohol consumption (X) and Arm muscle strength (Y)

A well-behaved “residuals vs. fits” plot

Characteristics of a well-behaved “residual versus fits” plot The residuals “bounce randomly” around the 0 line. (Linear is reasonable). No one residual “stands out” from the basic random pattern of residuals. (No outliers). The residuals roughly form a “horizontal band” around 0 line. (Constant variance).

“Residuals versus predictor” plot offers nothing different.

Example: Is tire tread wear linearly related to mileage? mileagegroove X = mileage in 1000 miles Y = groove depth in mils (0.001 inches)

Example: Is tire tread wear linearly related to mileage?

A “residual versus fits” plot suggesting relationship is not linear

How a non-linear function shows up on a “residual versus fits” plot The residuals depart from 0 in a systematic fashion, such as being positive for small X values, negative for medium X values, and positive again for large X values.

Example: How is plutonium activity related to alpha particle counts?

A residual versus fits plot suggesting non-constant error variance

How non-constant error variance shows up on a “residual vs. fits” plot The plot has a “fanning” effect, such as the residuals being close to 0 for small X values and being much more spread out for large X values. The “fanning” effect can also be in the reverse direction. Or, the spread of the residuals can vary in some complex fashion.

Example: Relationship between tobacco use and alcohol use? Region Alcohol Tobacco North Yorkshire Northeast EastMidlands WestMidlands EastAnglia Southeast Southwest Wales Scotland Northern Ireland Family Expenditure Survey of British Dept. of Employment X = average weekly expenditure on tobacco Y = average weekly expenditure on alcohol

Example: Relationship between tobacco use and alcohol use?

A “residual versus fits” plot suggesting an outlier exists. “outlier”

How large does a residual need to be before being flagged? The magnitude of the residuals depends on the units of the response variable. Make the residuals “unitless” by dividing by their standard deviation. That is, use “standardized residuals.” Then, an observation with a standardized residual greater than 2 or smaller than -2 should be flagged for further investigation.

Standardized residuals versus fits plot

Minitab identifies observations with large standardized residuals … Unusual Observations Obs Tobacco Alcohol Fit SE Fit Resid St Resid R R denotes an observation with a large standardized residual.

Anscombe data set #3

A “residual versus fits” plot suggesting an outlier exists.

How an outlier shows up on a “residuals vs. fits” plot The observation’s residual stands apart from the basic random pattern of the rest of the residuals. The random pattern of the residual plot can even disappear if one outlier really deviates from the straight line of the rest of the data.

Other simple plots that might help spot an outlier Boxplots Stem-n-leaf plots Dotplots

Boxplot of residuals for Alcohol (Y) and Tobacco (X) example

Dotplot of residuals for Alcohol (Y) and Tobacco (X) example

“Residuals vs. order” plots to assess non-independence of error terms If the data are obtained in a time (or space) sequence, a “residuals vs. order” plot helps to see if there is any correlation between error terms that are near each other in the sequence. A horizontal band bouncing randomly around 0 suggests errors are independent, while a systematic pattern suggests not.

“Residuals vs order” plots suggesting non-independence of error terms

Normal probability plot to assess normality of error terms Plot of residuals on horizontal axis against expected values of the residuals under normality (normal scores) on vertical axis. Plot that is nearly linear suggests normality of error terms.

Normal probability plot interpretation skewed right skewed left normal

Normal probability plot for Alcohol (X) and Strength (Y) example

Normal probability plot for Tree diameter (X) and C-dating Age (Y)

“Residuals vs omitted predictors” plots To determine whether there are any other key variables that could provide additional predictive power to the response. Look for systematic patterns. If the plot reveals that the residuals vary systematically, we don’t say the original model is wrong. It’s just that it can be improved.

“Residuals vs omitted” plot

In summary, …

Nonlinearity of regression function Scatter plot of response versus predictor (Standardized) residuals versus fits plot (Standardized) residuals versus predictor plot

Nonconstancy or error variance (Standardized) residuals versus fits plot (Standardized) residuals versus predictor plot

Presence of outliers (Standardized) residuals versus fits plot (Standardized) residuals versus predictor plot Box plots, stem-n-leaf plots, dot plots of (standardized) residuals

Non-independence of error terms (Standardized) residuals versus order plot

Non-normality of error terms Normality probability plots Box plots, dotplots, stem-n-leaf plots Mean far from median?

Residual vs … plots in Minitab Stat >> Regression >> Regression. Specify predictor and response. Under Graphs…, specify whether regular or standardized residuals desired. Select which residual plots are desired. If residual versus predictor plot desired, specify predictor in box. Select OK.

Boxplots, dotplots, etc. of residuals Stat >> Regression >> Regression … Specify predictor and response. Under Storage…, select residuals and/or standardized residuals. They will be stored in worksheet. Then … Graph >> Boxplot… or Graph >>Dotplot… or Graph>>Stemleaf…