Diagnostics and Transformation for SLR

Slides:



Advertisements
Similar presentations
Assumptions underlying regression analysis
Advertisements

Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
STA305 week 31 Assessing Model Adequacy A number of assumptions were made about the model, and these need to be verified in order to use the model for.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Prediction, Correlation, and Lack of Fit in Regression (§11. 4, 11
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 13 Nonlinear and Multiple Regression.
Class 16: Thursday, Nov. 4 Note: I will you some info on the final project this weekend and will discuss in class on Tuesday.
Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section ) –Outliers and Influential Points (Section 6.7) Homework.
Lecture 18: Thurs., Nov. 6th Chapters 8.3.2, 8.4, Outliers and Influential Observations Transformations Interpretation of log transformations (8.4)
Lecture 25 Multiple Regression Diagnostics (Sections )
Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.
Lecture 25 Regression diagnostics for the multiple linear regression model Dealing with influential observations for multiple linear regression Interaction.
Multivariate Data Analysis Chapter 4 – Multiple Regression.
Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review
Stat 112: Lecture 14 Notes Finish Chapter 6:
Class 6: Tuesday, Sep. 28 Section 2.4. Checking the assumptions of the simple linear regression model: –Residual plots –Normal quantile plots Outliers.
Lecture 24 Multiple Regression (Sections )
Lecture 24: Thurs., April 8th
Lecture 20 Simple linear regression (18.6, 18.9)
Regression Diagnostics Checking Assumptions and Data.
Lecture 19 Transformations, Predictions after Transformations Other diagnostic tools: Residual plot for nonconstant variance, histogram to check normality.
Stat 112: Lecture 16 Notes Finish Chapter 6: –Influential Points for Multiple Regression (Section 6.7) –Assessing the Independence Assumptions and Remedies.
Business Statistics - QBM117 Statistical inference for regression.
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Regression Chapter 14.
Correlation & Regression
Objectives of Multiple Regression
Regression and Correlation Methods Judy Zhong Ph.D.
Inference for regression - Simple linear regression
1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE.
© 2004 Prentice-Hall, Inc.Chap 15-1 Basic Business Statistics (9 th Edition) Chapter 15 Multiple Regression Model Building.
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
Regression Analysis Week 8 DIAGNOSTIC AND REMEDIAL MEASURES Residuals The main purpose examining residuals Diagnostic for Residuals Test involving residuals.
Stat 112 Notes 16 Today: –Outliers and influential points in multiple regression (Chapter 6.7)
Dr. C. Ertuna1 Issues Regarding Regression Models (Lesson - 06/C)
Copyright ©2011 Brooks/Cole, Cengage Learning Inference about Simple Regression Chapter 14 1.
1 Regression Analysis The contents in this chapter are from Chapters of the textbook. The cntry15.sav data will be used. The data collected 15 countries’
Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory.
REGRESSION DIAGNOSTICS Fall 2013 Dec 12/13. WHY REGRESSION DIAGNOSTICS? The validity of a regression model is based on a set of assumptions. Violation.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Model Building and Model Diagnostics Chapter 15.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
Stat 112 Notes 14 Assessing the assumptions of the multiple regression model and remedies when assumptions are not met (Chapter 6).
Individual observations need to be checked to see if they are: –outliers; or –influential observations Outliers are defined as observations that differ.
Lab 4 Multiple Linear Regression. Meaning  An extension of simple linear regression  It models the mean of a response variable as a linear function.
STA302/1001 week 11 Regression Models - Introduction In regression models, two types of variables that are studied:  A dependent variable, Y, also called.
Yandell – Econ 216 Chap 15-1 Chapter 15 Multiple Regression Model Building.
Predicting Energy Consumption in Buildings using Multiple Linear Regression Introduction Linear regression is used to model energy consumption in buildings.
Chapter 15 Multiple Regression Model Building
Inference for Least Squares Lines
Statistical Data Analysis - Lecture /04/03
Multiple Regression Prof. Andy Field.
Statistics for Managers using Microsoft Excel 3rd Edition
Chapter 11: Simple Linear Regression
Analysis of Variance in Matrix form
Regression Diagnostics
Chapter 12: Regression Diagnostics
Chapter 13 Simple Linear Regression
Diagnostics and Transformation for SLR
Stats Club Marnie Brennan
Chapter 14 – Correlation and Simple Regression
Residuals The residuals are estimate of the error
Statistical Assumptions for SLR
Chapter 4, Regression Diagnostics Detection of Model Violation
Three Measures of Influence
Regression Forecasting and Model Building
Model Adequacy Checking
Diagnostics and Remedial Measures
Regression Models - Introduction
Presentation transcript:

Diagnostics and Transformation for SLR In general, it makes sense to base inference and conclusions only on valid models. So we need to make sure we are fitting an appropriate model. For this we need to plot the data! Example… STA302/1001 week 7

Influential Points, Outliers and Leverage Points Observations whose inclusion/exclusion result in substantial changes in the fitted model are said to be influential. A point can be outlying in any (or all) of the value of the explanatory variable, the dependent variable or its residual. Outlier with respect to the residual represents model failure, i.e., line doesn’t fit this point adequately. These are typically outliers with respect to the dependent variable. Outlier with respect to the explanatory variable are called leverage points. They may be influential, uniquely determine the regression coefficient and possibly cause the S.E. of the regression coefficient to be smaller than they would be if the point was removed. Textbook distinguish between “good” leverage points that follow the pattern of the data and “bad” leverage points that are influential. STA302/1001 week 7

Quantifying Leverage To determine if a point is a leverage point we calculate the following… STA302/1001 week 7

Measuring Influence of the ith Observation There are three main measurements for assessing the influence of an observation. Each of these measures uses different aspect of the fitted model to assess the influence of an observation. Notation: Subscript (i) indicates that the ith observation has been deleted from the data and regression was re-fit using remaining n-1 data points. STA302/1001 week 7

Measurement I for Influence – DFBETAS This measure examines how estimates of β0 and β1 change with and without the ith observation. It is the difference in beta’s defined by… Interpretation…. STA302/1001 week 7

Measurement II for Influence – DFFITS This measure examines how the ith predicted value changes with and without the ith observation in the model. It is the difference in predicted values defined by:… Interpretation… STA302/1001 week 7

Measurement III for Influence – Cook’s Distance Cook’s distance measures how much fit of all points changes with and without the ith observation in data. It is defined by:… Interpretation… STA302/1001 week 7

Residuals The residuals are estimate of the error term, εi , in the model. What do we know about the ei? Further, since ei are linear function of the Yi they are random variables with mean and variance…. Further, they have a Normal distribution but they are NOT uncorrelated. However, as n  ∞, with number of predictors stay constant, we have that the correlation in ei’s goes to 0 and the variance become constant. So we will ignore these problems with using ei as estimates of εi. STA302/1001 week 7

Possible Departures from Model Assumptions We will use the residuals to examine the following possible departures from the simple linear regression model with normal errors. The regression function is not linear, i.e, the straight line model is not appropriate. The error terms do not have constant variance. The error terms are not normally distributed. There are outliers and /or influential points. STA302/1001 week 7

Residual Plots Residual plots are used to check the model assumptions. We look for evidence of any of the possible departure described above. The recommended plots are: residuals versus the predicted values, residuals versus the Xi and a normal quantile plot of the residuals. STA302/1001 week 7

Other Diagnostics tools Univariate analysis of standardized residuals such as stem-and-leaf plot, box-plot and histogram is useful for examining departure from the normal distribution. Absolute value of residuals versus predicted (fitted) values is useful in examining if the variance of the errors is constant. This plot show non-constant variance more sharply. Standardized residuals versus time or other spatial sequence in observations helps indicate correlation in observations. Standardize residuals versus potential other predictors. This plot helps us determine whether we should include the other predictor in the model. STA302/1001 week 7

What to do if Assumptions are Violated? Abandon simple linear regression for something else (usually more complicated). Some examples of alternative models: weighted least square – appropriate model if the variance is non-constant. methods that allow for non-normal errors (STA303). methods that allow for correlated errors (e.g., time series, longitudinal models). polynomial or other non-linear models. STA302/1001 week 7

Dealing with Outliers / Influential points First, check for clerical / measurement error. Consider transformation if the points come from a skewed distribution or distribution with long tail. Use robust regression which is appropriate when errors are from a distribution with heavy tails. Consider reporting results with and without the outliers. Think about whether an outlier is beyond the region where linear model holds; then fit the model on restricted range of the independent variable to exclude unusual points. For gross outliers that are probably mistakes, consider deleting them but be cautious if there is no evidence of mistake. STA302/1001 week 7

Transformations Transformations are used as a remedy for non-linearity, non- constant variance and non-normality. If relationship is non-linear but variance of Y is approximately constant, try to find a transformation of X that results in a linear relationship. Most common monotonic transformations are: If not a straight line and non-constant variance, transform Y. If straight line and non-constant variance, transform both X and Y or use weighted least square. Transforming changes the relative spacing of the observations. STA302/1001 week 7

Transformation to Stabilize the Variance If Y has a distribution with mean and variance . Then the mean and variance of Z = f (Y) are approximately, Proof: This result gets used to derive variance stabilizing transformations. STA302/1001 week 7

Examples STA302/1001 week 7

SAS Example In an industrial laboratory, under uniform conditions, batches of electrical insulating fluid were subjected to constant voltages until the insulating property of the fluids broke down. Seven different voltage levels, space 2 kV apart from 26 to 38 kV, were studied. The measured responses were the times, in minutes, until breakdown. STA302/1001 week 7

Interpreting log-transformed Data If logY = β0 + β1X + ε then, . The errors are multiplicative. Increase in X of 1 unit is associated with a multiplicative change in Y by a factor of . Example: If Y = β0 + β1logX + ε, for each k-fold change in X, Y changes by β1logk. Example: if X is cut in half, Y changes, on average by β1log(½). STA302/1001 week 7

Violation of Normality of ε’s By the Central Limit Theorem, linear combinations of random variables are approximately normally distributed, no matter what their original distribution is. So CIs and tests for β0, β1, and E(Y | X) are robust against non- normality (i.e., have approximately the correct coverage or approximately the correct P-value) Prediction Intervals are not robust against departure from Normality because they are for one point. STA302/1001 week 7

Relative Importance of Assumptions The most important assumption is that the form of the model is appropriate and E(ε) = 0. The second most important assumption is independence of observations. The third important assumption is the constant variance. The least important assumption is Normality of the residuals, because of the CLT. It is, however, a necessary assumption for PI’s. STA302/1001 week 7