1 Outliers and Influential Observations KNN Ch. 10 (pp. 390-406)

Slides:



Advertisements
Similar presentations
Variation, uncertainties and models Marian Scott School of Mathematics and Statistics, University of Glasgow June 2012.
Advertisements

Qualitative predictor variables
More on understanding variance inflation factors (VIFk)
Prediction, Correlation, and Lack of Fit in Regression (§11. 4, 11
1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Simple Linear Regression Estimates for single and mean responses.
Analysis of Economic Data
LINEAR REGRESSION: Evaluating Regression Models Overview Assumptions for Linear Regression Evaluating a Regression Model.
LINEAR REGRESSION: Evaluating Regression Models. Overview Assumptions for Linear Regression Evaluating a Regression Model.
Multivariate Data Analysis Chapter 4 – Multiple Regression.
Linear statistical models 2008 Model diagnostics  Residual analysis  Outliers  Dependence  Heteroscedasticity  Violations of distributional assumptions.
1 Chapter 9 Variable Selection and Model building Ray-Bing Chen Institute of Statistics National University of Kaohsiung.
Regression Diagnostics Checking Assumptions and Data.
Linear Regression Analysis 5E Montgomery, Peck and Vining 1 Chapter 6 Diagnostics for Leverage and Influence.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 11 th Edition.
C82MCP Diploma Statistics School of Psychology University of Nottingham 1 Linear Regression and Linear Prediction Predicting the score on one variable.
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Multiple Regression Dr. Andy Field.
Conditions of applications. Key concepts Testing conditions of applications in complex study design Residuals Tests of normality Residuals plots – Residuals.
Correlation & Regression
Multiple Linear Regression Response Variable: Y Explanatory Variables: X 1,...,X k Model (Extension of Simple Regression): E(Y) =  +  1 X 1 +  +  k.
Simple linear regression and correlation analysis
1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE.
Multiple Linear Regression - Matrix Formulation Let x = (x 1, x 2, …, x n )′ be a n  1 column vector and let g(x) be a scalar function of x. Then, by.
M23- Residuals & Minitab 1  Department of ISM, University of Alabama, ResidualsResiduals A continuation of regression analysis.
Name: Angelica F. White WEMBA10. Teach students how to make sound decisions and recommendations that are based on reliable quantitative information During.
Basics of Regression Analysis. Determination of three performance measures Estimation of the effect of each factor Explanation of the variability Forecasting.
Introduction to Linear Regression
Analysis of Residuals Data = Fit + Residual. Residual means left over Vertical distance of Y i from the regression hyper-plane An error of “prediction”
Regression Model Building LPGA Golf Performance
Anaregweek11 Regression diagnostics. Regression Diagnostics Partial regression plots Studentized deleted residuals Hat matrix diagonals Dffits, Cook’s.
Dr. C. Ertuna1 Issues Regarding Regression Models (Lesson - 06/C)
1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Summarizing Bivariate Data Non-linear Regression Example.
© Department of Statistics 2012 STATS 330 Lecture 23: Slide 1 Stats 330: Lecture 23.
Multiple Regression I 1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 4 Multiple Regression Analysis (Part 1) Terry Dielman.
12/17/ lecture 111 STATS 330: Lecture /17/ lecture 112 Outliers and high-leverage points  An outlier is a point that has a larger.
Outliers and influential data points. No outliers?
Applied Quantitative Analysis and Practices LECTURE#31 By Dr. Osman Sadiq Paracha.
Trees Example More than one variable. The residual plot suggests that the linear model is satisfactory. The R squared value seems quite low though,
Applied Quantitative Analysis and Practices LECTURE#30 By Dr. Osman Sadiq Paracha.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
Applied Quantitative Analysis and Practices LECTURE#28 By Dr. Osman Sadiq Paracha.
Individual observations need to be checked to see if they are: –outliers; or –influential observations Outliers are defined as observations that differ.
Agenda 1.Exam 2 Review 2.Regression a.Prediction b.Polynomial Regression.
Simple and multiple regression analysis in matrix form Least square Beta estimation Beta Simple linear regression Multiple regression with two predictors.
David Housman for Math 323 Probability and Statistics Class 05 Ion Sensitive Electrodes.
Chapter 15 Multiple Regression Model Building
Chapter 20 Linear and Multiple Regression
Chapter 6 Diagnostics for Leverage and Influence
Multiple Regression Prof. Andy Field.
Non-Linear Models Tractable non-linearity Intractable non-linearity
Regression Diagnostics
Statistics in MSmcDESPOT
...Relax... 9/21/2018 ST3131, Lecture 3 ST5213 Semester II, 2000/2001
بحث في التحليل الاحصائي SPSS بعنوان :
Regression Model Building - Diagnostics
Inference for Regression Lines
CHAPTER 29: Multiple Regression*
Solutions for Tutorial 3
Solutions of Tutorial 10 SSE df RMS Cp Radjsq SSE1 F Xs c).
Multiple Linear Regression
Unit 3 – Linear regression
The greatest blessing in life is
Solution 9 1. a) From the matrix plot, 1) The assumption about linearity seems ok; 2).The assumption about measurement errors can not be checked at this.
Three Measures of Influence
Regression Model Building - Diagnostics
Outliers and Influence Points
Solutions of Tutorial 9 SSE df RMS Cp Radjsq SSE1 F Xs c).
SA3101 Final Examination Solution
Linear Regression and Correlation
Essentials of Statistics for Business and Economics (8e)
Presentation transcript:

1 Outliers and Influential Observations KNN Ch. 10 (pp )

2  At times data sets have observations that are outlying or extreme.  These outliers usually have a strong effect on the regression analysis.  We have to identify such observations and then decide if they need to be eliminated or if their influence needs to be reduced.  When dealing with more than one variable, simple plots (boxplots, scatterplots etc.) may not be useful to identify outliers and we have to use the residuals or functions of residuals.  We will now look at some of these functions. Outlying Observations

3 Previously, we examined:  Residuals  Semistudentized Residuals We will now introduce a few refinements that are more effective in identifying Y outliers. First we need to recall the Hat Matrix. Residuals and Semistudentized Residuals

4 Leverages  We previously defined the Hat matrix as H = X(X’X) -1 X’  Using the hat matrix, and e = (I-H)Y  The diagonal elements of the hat matrix, h ii, 0< h ii < 1, are called Leverages  These are used to detect influential X observations. Leverage values are useful for detecting hidden extrapolations when p > 3

5 Measures for Y-outlier detection  An estimator of the st. deviation of the i-th residual is  Therefore, dividing each residual by its st. deviation we obtain the Studentized Residuals:

6 Measures for Y-outlier detection Another effective measure for Y outlier identification is obtained when we delete observation i, fit the regression function to the remaining n – 1 observations, and obtain the expected value for that observation given its X levels. The differences between the predicted and the actually observed value produces a deleted residual. This can be also expressed using a leverage value.  Deleted Residuals:  Studentized Deleted Residuals

7 Criterion for Outliers:  In order to establish that the i th observation is an outlier we have to compare the value of t i with t, where t is the 100*(1-  /2n) th percentile of the t distribution with (n-p-1) degrees of freedom. Detection of outlying Y Observations

8 Outlying X Observations  The average value is Criterion for Outliers:  If h ii > 2 p/n, then observation i is an X outlier.

9 X1 X3 X2 Y A Simple Example

10 Regression Analysis The regression equation is Y = X X X3 Predictor Coef StDev T P Constant X X X S = R-Sq = 95.7% R-Sq(adj) = 95.6% Analysis of Variance Source DF SS MS F P Regression Residual Error Total Y Pred YResid.Stud.Res.Del. Stud. Res.h ii A Simple Example (continued)

11 Influence of Outlying X/Y Observations  Influence on single fitted value: influence that case i has on the fitted value. Omission is the test.  Exclusion causes major changes in fitted regression function; then a case is indeed influential. Criteria for Influential observations:  if |DFFITS i | >1 (small to medium data sets)  Or if |DFFITS i | > (large data sets) Where:

12 Influence of Outlying X/Y Observations  An aggregate measure is also required: One which measures the effect of omission of case i on all n “fitted” values, not just the i-th fitted value.  Statistic is Cook’s Distance: Criterion for Influential Observations:  Compare D i with the F distribution with (p, n-p) degrees of freedom. If the percentile (that D i cuts off from the left side of the distribution curve) is 10 or 20 the observation has little influence, if this percentile is 50 or more the influence is large.

13  Another measure is required: One which measures the effect of omission of case i on OLS estimates of regression coefficients (betas).  Here, c kk is the k-th diagonal element of (X’X) -1 Criteria for Influential observations:  if |DFBETAS i | > 1 for small data sets, or  if |DFBETAS i | > for large data sets. Influence of outliers on betas

14 Regression Analysis The regression equation is Y = X X X3 Predictor Coef StDev T P Constant X X X S = R-Sq = 95.7% R-Sq(adj) = 95.6% Analysis of Variance Source DF SS MS F P Regression Residual Error Total YResid.Stud.Res.Del. Stud. Res.h ii A Simple Example DFFITSCOOKD