12/17/2015330 lecture 111 STATS 330: Lecture 11. 12/17/2015330 lecture 112 Outliers and high-leverage points  An outlier is a point that has a larger.

Slides:



Advertisements
Similar presentations
1 Outliers and Influential Observations KNN Ch. 10 (pp )
Advertisements

Prediction, Correlation, and Lack of Fit in Regression (§11. 4, 11
Chapter 3 Bivariate Data
Chapter 8 Linear Regression © 2010 Pearson Education 1.
Regression Analysis Using Excel. Econometrics Econometrics is simply the statistical analysis of economic phenomena Here, we just summarize some of the.
5/18/ lecture 101 STATS 330: Lecture 10. 5/18/ lecture 102 Diagnostics 2 Aim of today’s lecture  To describe some more remedies for non-planar.
Class 16: Thursday, Nov. 4 Note: I will you some info on the final project this weekend and will discuss in class on Tuesday.
Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section ) –Outliers and Influential Points (Section 6.7) Homework.
Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.
Multivariate Data Analysis Chapter 4 – Multiple Regression.
Class 6: Tuesday, Sep. 28 Section 2.4. Checking the assumptions of the simple linear regression model: –Residual plots –Normal quantile plots Outliers.
Lecture 24: Thurs., April 8th
Regression Diagnostics Checking Assumptions and Data.
Linear Regression Analysis 5E Montgomery, Peck and Vining 1 Chapter 6 Diagnostics for Leverage and Influence.
CHAPTER 3 Describing Relationships
Getting Started with Hypothesis Testing The Single Sample.
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Haroon Alam, Mitchell Sanders, Chuck McAllister- Ashley, and Arjun Patel.
So are how the computer determines the size of the intercept and the slope respectively in an OLS regression The OLS equations give a nice, clear intuitive.
Forecasting Revenue: An Example of Regression Model Building Setting: Possibly a large set of predictor variables used to predict future quarterly revenues.
Conditions of applications. Key concepts Testing conditions of applications in complex study design Residuals Tests of normality Residuals plots – Residuals.
1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE.
1 Chapter 3: Examining Relationships 3.1Scatterplots 3.2Correlation 3.3Least-Squares Regression.
Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.
1 Reg12M G Multiple Regression Week 12 (Monday) Quality Control and Critical Evaluation of Regression Results An example Identifying Residuals Leverage:
Analysis of Residuals Data = Fit + Residual. Residual means left over Vertical distance of Y i from the regression hyper-plane An error of “prediction”
Regression Model Building LPGA Golf Performance
Anaregweek11 Regression diagnostics. Regression Diagnostics Partial regression plots Studentized deleted residuals Hat matrix diagonals Dffits, Cook’s.
Review of Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
Stat 112 Notes 16 Today: –Outliers and influential points in multiple regression (Chapter 6.7)
Dr. C. Ertuna1 Issues Regarding Regression Models (Lesson - 06/C)
6-3 Multiple Regression Estimation of Parameters in Multiple Regression.
Worked Example Using R. > plot(y~x) >plot(epsilon1~x) This is a plot of residuals against the exploratory variable, x.
© Department of Statistics 2012 STATS 330 Lecture 23: Slide 1 Stats 330: Lecture 23.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 3 Describing Relationships 3.2 Least-Squares.
Diagnostics – Part II Using statistical tests to check to see if the assumptions we made about the model are realistic.
Outliers and influential data points. No outliers?
Applied Quantitative Analysis and Practices LECTURE#31 By Dr. Osman Sadiq Paracha.
Linear Models Alan Lee Sample presentation for STATS 760.
CHAPTER 3 Describing Relationships
Applied Quantitative Analysis and Practices LECTURE#30 By Dr. Osman Sadiq Paracha.
Lecture 13 Diagnostics in MLR Added variable plots Identifying outliers Variance Inflation Factor BMTRY 701 Biostatistical Methods II.
Statistical Data Analysis 2010/2011 M. de Gunst Lecture 10.
Linear Regression Linear Regression. Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Purpose Understand Linear Regression. Use R functions.
1 Regression Review Population Vs. Sample Regression Line Residual and Standard Error of Regression Interpretation of intercept & slope T-test, F-test.
1 Reg12W G Multiple Regression Week 12 (Wednesday) Review of Regression Diagnostics Influence statistics Multicollinearity Examples.
Occasionally, we are able to see clear violations of the constant variance assumption by looking at a residual plot - characteristic “funnel” shape… often.
Individual observations need to be checked to see if they are: –outliers; or –influential observations Outliers are defined as observations that differ.
Lab 4 Multiple Linear Regression. Meaning  An extension of simple linear regression  It models the mean of a response variable as a linear function.
CHAPTER 3 Describing Relationships
Chapter 6 Diagnostics for Leverage and Influence
Multiple Regression Prof. Andy Field.
CHAPTER 3 Describing Relationships
Non-Linear Models Tractable non-linearity Intractable non-linearity
Regression Diagnostics
Regression Model Building - Diagnostics
Diagnostics and Transformation for SLR
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
Three Measures of Influence
Regression Model Building - Diagnostics
Outliers and Influence Points
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
Diagnostics and Transformation for SLR
CHAPTER 3 Describing Relationships
Presentation transcript:

12/17/ lecture 111 STATS 330: Lecture 11

12/17/ lecture 112 Outliers and high-leverage points  An outlier is a point that has a larger or smaller y value that the model would suggest Can be due to a genuine large error  Can be caused by typographical errors in recording the data  A high leverage point is a point with extreme values of the explanatory variables

12/17/ lecture 113 Outliers  The effect of an outlier depends on whether it is also a high leverage point  A “high leverage” outlier Can attract the fitted plane, distorting the fit, sometimes extremely In extreme cases may not have a big residual In extreme cases can increase R 2  A “low leverage” outlier Does not distort the fit to the same extent Usually has a big residual Inflates standard errors, decreases R 2

12/17/ lecture 114 No outliers No high- leverage points Low leverage Outlier: big residual High leverage Not an outlier High- leverage outlier

12/17/ lecture 115 Example: the education data (ignoring urban) High leverage point

12/17/ lecture 116 An outlier also? Residual somewhat extreme

12/17/ lecture 117 Measuring leverage It can be shown (see eg STATS 310) that the fitted value of case i is related to the response data y 1,…,y n by the equation The h ij depend on the explanatory variables. The quantities h ii are called “hat matrix diagonals” (HMD’s) and measure the influence y i has on the ith fitted value. They can also be interpreted as the distance between the X-data for the ith case and the average x-data for all the cases. Thus, they directly measure how extreme the x-values of each point are

12/17/ lecture 118 Interpreting the HMDs  Each HMD lies between 0 and 1  The average HMD is p/n (p=no of reg coefficients, p=k+1)  An HMD more than 3p/n is considered extreme

12/17/ lecture 119 Example: the education data educ.lm<-lm(educ~percapita+under18, data=educ.df) hatvalues(educ.lm)[50] > 9/50 [1] 0.18 Clearly extreme! n=50, p=3

12/17/ lecture 1110 Studentized residuals  How can we recognize a big residual? How big is big?  The actual size depends on the units in which the y-variable is measured, so we need to standardize them.  Can divide by their standard deviations  Variance of a typical residual e is var(e) = (1-h)  2 where h is the hat matrix diagonal for the point.

12/17/ lecture 1111 Studentized residuals (2)  “Internally studentised” (Called “standardised” in R)  “Externally studentised” (Called “studentised” in R) s 2 is Usual Estimate of  2 s 2 i is estimate of  2 after deleting the ith data point

12/17/ lecture 1112 Studentized residuals (3)  How big is big?  Both types of studentised residual are approximately distributed as standard normals when the model is OK and there are no outliers. (in fact the externally studentised one has a t- distribution)  Thus, studentised residuals should be between -2 and 2 with approximately 95% probability.

12/17/ lecture 1113 Studentized residuals (4) Calculating in R: library(MASS) # load the MASS library stdres(educ.lm) #internally studentised (standardised in R) studres(educ.lm) #externally studentised (studentised in R) > stdres(educ.lm)[50] > studres(educ.lm)[50]

What does studentised mean? 12/17/ lecture 1114

12/17/ lecture 1115 Recognizing outliers  If a point is a low influence outlier, the residual will usually be large, so large residual and a low HMD indicates an outlier  If a point is a high leverage outlier, then a large error usually will cause a large residual.  However, in extreme cases, a high leverage outlier may not have a very big residual, depending on how much the point attracts the fitted plane. Thus, if a point has a large HMD, and the residual is not particularly big, we can’t always tell if a point is an outlier or not.

High-leverage outlier 12/17/ lecture 1116 Small residual!

12/17/ lecture 1117 Leverage-residual plot Can plot standardised residuals versus leverage (HMD’s): the leverage-residual plot (LR plot) Point 50 is high leverage, big residual, is an outlier plot(educ.lm, which=5)

Interpreting LR plots 12/17/ lecture 1118 leverage Standardised residualStandardised residual High leverage outlier High leverage, outlier Possible high leverage outlier OK Low leverage outlier 3p/n

12/17/ lecture 1119 Residuals and HMD’s No big studentized residuals, no big HMD’s (3p/n = 0.2 for this example)

12/17/ lecture 1120 Residuals and HMD’s (2) One big studentized residual, no big HMD’s (3p/n = 0.2 for this example). Line moves a bit Point 24

12/17/ lecture 1121 Residuals and HMD’s (3) No big studentized residual, one big HMD, pt 1. (3p/n = 0.2 for this example). Line hardly moves.Pt 1 is high leverage but not influential. Point 1

12/17/ lecture 1122 Residuals and HMD’s (4) One big studentized residual, one big HMD, both pt 1. (3p/n = 0.2 for this example). Line moves but residual is large. Pt 1 is influential Point 1

12/17/ lecture 1123 Residuals and HMD’s (5) No big studentized residuals, big HMD, 3p/n= 0.2. Point 1 is high leverage and influential Point 1

12/17/ lecture 1124 Influential points  How can we tell if a high-leverage point/outlier is affecting the regression?  By deleting the point and refitting the regression: a large change in coefficients means the point is affecting the regression  Such points are called influential points  Don’t want analysis to be driven by one or two points

12/17/ lecture 1125 “Leave-one out” measures  We can calculate a variety of measures by leaving out each data point in turn, and looking at the change in key regression quantities such as Coefficients Fitted values Standard errors  We discuss each in turn

12/17/ lecture 1126 Example: education data With point 50Without point 50 Const percap under

12/17/ lecture 1127 Standardized difference in coefficients: DFBETAS Formula: Problem when: Greater than 1 in absolute value This is the criterion coded into R

12/17/ lecture 1128 Standardized difference in fitted values: DFFITS Formula: Problem when: Greater than  (p/(N-p) ) in absolute value (p=number of regression coefficients)

12/17/ lecture 1129 COV RATIO & Cooks D Cov Ratio: Measures change in the standard errors of the estimated coefficients Problem indicated: when Cov Ratio more than 1+3p/n or less than 1-3p/n Cook’s D Measures overall change in the coefficients Problem when: More than qf(0.50, p,n-p) (lower 50% of F distribution), roughly 1 in most cases

12/17/ lecture 1130 Calculations > influence.measures(educ.lm) Influence measures of lm(formula = educ ~ under18 + percap, data = educ.df) dfb.1. dfb.un18 dfb.prcp dffit cov.r cook.d hat inf e * e * e * > p=3, n=50, 3p/n=0.18, 3  (p/(n-p)) =0.758, qf(0.5,3,47)=

12/17/ lecture 1131 Plotting influence # set up plot window with 2 x 4 array of plots par(mfrow=c(2,4)) # plot dffbetas, dffits, cov ratio, # cooks D, HMD’s influenceplots(educ.lm)

12/17/ lecture 1132

12/17/ lecture 1133 Remedies for outliers  Correct typographical errors in the data  Delete a small number of points and refit (don’t want fitted regression to be determined by one or two influential points)  Report existence of outliers separately: they are often of scientific interest  Don’t delete too many points (1 or 2 max)

12/17/ lecture 1134 Summary: Doing it in R  LR plot: plot(educ.lm, which=5)  Full diagnostic display plot(educ.lm)  Influence measures: influence.measures(educ.lm)  Plots of influence measures par(mfrow=c(2,4)) influenceplots(educ.lm)

12/17/ lecture 1135 HMD Summary  Hat matrix diagonals Measure the effect of a point on its fitted value Measure how outlying the x-values are (how “high –leverage” a point is) Are always between 0 and 1 with bigger values indicating high leverage Points with HMD’s more than 3p/n are considered “high leverage”