Individual observations need to be checked to see if they are: –outliers; or –influential observations Outliers are defined as observations that differ.

Slides:



Advertisements
Similar presentations
Linear Regression (C7-9 BVD). * Explanatory variable goes on x-axis * Response variable goes on y-axis * Don’t forget labels and scale * Statplot 1 st.
Advertisements

12-1 Multiple Linear Regression Models Introduction Many applications of regression analysis involve situations in which there are more than.
12 Multiple Linear Regression CHAPTER OUTLINE
Chapter 8 Linear Regression © 2010 Pearson Education 1.
Class 16: Thursday, Nov. 4 Note: I will you some info on the final project this weekend and will discuss in class on Tuesday.
Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section ) –Outliers and Influential Points (Section 6.7) Homework.
Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.
Lecture 25 Regression diagnostics for the multiple linear regression model Dealing with influential observations for multiple linear regression Interaction.
Class 6: Tuesday, Sep. 28 Section 2.4. Checking the assumptions of the simple linear regression model: –Residual plots –Normal quantile plots Outliers.
Lecture 24: Thurs., April 8th
Copyright (c) Bani K. Mallick1 STAT 651 Lecture #20.
Regression Diagnostics Checking Assumptions and Data.
Linear Regression Analysis 5E Montgomery, Peck and Vining 1 Chapter 6 Diagnostics for Leverage and Influence.
CHAPTER 3 Describing Relationships
Stat 112: Lecture 16 Notes Finish Chapter 6: –Influential Points for Multiple Regression (Section 6.7) –Assessing the Independence Assumptions and Remedies.
Conditions of applications. Key concepts Testing conditions of applications in complex study design Residuals Tests of normality Residuals plots – Residuals.
Correlation & Regression
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Section 10-3 Regression.
Relationship of two variables
1 Chapter 3: Examining Relationships 3.1Scatterplots 3.2Correlation 3.3Least-Squares Regression.
Inference for Regression
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Stat 112 Notes 15 Today: –Outliers and influential points. Homework 4 due on Thursday.
Linear Regression Chapter 8.
4.3 Diagnostic Checks VO Verallgemeinerte lineare Regressionsmodelle.
Least-Squares Regression--- Prediction, Outliers, Influential Points and Extrapolation Section Part IISection Part II.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 12: Analyzing the Association Between Quantitative Variables: Regression Analysis Section.
Anaregweek11 Regression diagnostics. Regression Diagnostics Partial regression plots Studentized deleted residuals Hat matrix diagonals Dffits, Cook’s.
Review of Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
Stat 112 Notes 16 Today: –Outliers and influential points in multiple regression (Chapter 6.7)
Univariate Linear Regression Problem Model: Y=  0 +  1 X+  Test: H 0 : β 1 =0. Alternative: H 1 : β 1 >0. The distribution of Y is normal under both.
Chapter 8 Linear Regression *The Linear Model *Residuals *Best Fit Line *Correlation and the Line *Predicated Values *Regression.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 3 Describing Relationships 3.2 Least-Squares.
WARM-UP Do the work on the slip of paper (handout)
Ch14: Linear Least Squares 14.1: INTRO: Fitting a pth-order polynomial will require finding (p+1) coefficients from the data. Thus, a straight line (p=1)
Correlation/Regression - part 2 Consider Example 2.12 in section 2.3. Look at the scatterplot… Example 2.13 shows that the prediction line is given by.
Introduction to Statistical Modelling Example: Body and heart weights of cats. The R data frame cats, and the variables therein, are made available by.
CHAPTER 3 Describing Relationships
Applied Quantitative Analysis and Practices LECTURE#30 By Dr. Osman Sadiq Paracha.
Statistical Data Analysis 2010/2011 M. de Gunst Lecture 10.
Occasionally, we are able to see clear violations of the constant variance assumption by looking at a residual plot - characteristic “funnel” shape… often.
Statistics 350 Lecture 2. Today Last Day: Section Today: Section 1.6 Homework #1: Chapter 1 Problems (page 33-38): 2, 5, 6, 7, 22, 26, 33, 34,
MATH 2311 Section 5.4. Residuals Examples: Interpreting the Plots of Residuals The plot of the residual values against the x values can tell us a lot.
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
Lecture Slides Elementary Statistics Twelfth Edition
Chapter 6 Diagnostics for Leverage and Influence
Statistics 101 Chapter 3 Section 3.
CHAPTER 3 Describing Relationships
Multiple Linear Regression
Cautions about Correlation and Regression
Lecture Slides Elementary Statistics Thirteenth Edition
Diagnostics and Transformation for SLR
Residuals The residuals are estimate of the error
Multiple Linear Regression
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
Three Measures of Influence
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
Checking the data and assumptions before the final analysis.
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
Problems of Tutorial 9 (Problem 4.12, Page 120) Download the “Data for Exercise ” from the class website. The data consist of 1 response variable.
Diagnostics and Transformation for SLR
Chapter 3.2 Regression Wisdom.
Chapter 9 Regression Wisdom.
Correlation/Regression - part 2
CHAPTER 3 Describing Relationships
Presentation transcript:

Individual observations need to be checked to see if they are: –outliers; or –influential observations Outliers are defined as observations that differ from the majority of cases in the data set. They can be outliers in either –the covariate (x) direction; or –in the response (y) direction Depending upon the dimension, it may be easy or difficult to find outliers in the covariate (x) direction. –one x: easy do a univariate plot (boxplot shows outliers) –two x’s: do a scatterplot of one against the other –multiple x’s: more difficult…

Outliers in the y direction could be due to: –epsilon (the error function) may be unusually large –recording errors (in either x’s or y) –missing covariate(s) Outliers can often be found, but causes and solutions (should they be excluded?) are often difficult. Try fitting the model with and without the outliers - if no substantive change in results, then remove them; if there is a change in the results, then be careful!. Can additional data be collected? Outliers can often be the most interesting cases… –see Figures 6.6 (a-c) on page 185

An easy check for possible outliers in the y- direction is to use the Studentized residuals They are approximately N(0,1), so if |d i |>2.5 or so, then the ith observation is a possible outlier in the response direction. Plots of these residuals will usually show these points clearly…try normal quantile plots of the d i. But be careful: 6.6(a) would clearly show up, but 6.6(b) would not…

An individual observation is influential if the conclusions of the analysis done without the observation is vastly different from the conclusions with the observation included - see Fig. 6.7 (a-b) on page 187. The “hat” matrix gives us information about the leverage that individual points have since we have that; so, large values of h ii (close to 1), relative to the other h’s mean that the ith observation has high leverage in the sense that the ith fitted value is “attracted” to the response of the ith observation. (Think about the simple linear regression case…)

Some properties of the leverage h ii : –it is a function of the explanatory variables but not y – – it is small for cases near the centroid of the X space and large for cases far away. The centroid is – where p=# of explanatory variables; –so, the average leverage is Thus, one way to check for large leverage is to compare h ii with the mean and if h ii is bigger than 2 times h-bar, it’s usually considered a high leverage observation. Your author says: “Cases with high leverage need to be identified and examined carefully”

Another way to check for an influential point is to see what happens when that point is removed and the regression is done without it… there are several statistics we can compute that take this particular idea and use it: The first is called Cook’s D defined as here h ii is the ith leverage and d i is the ith Studentized residual. Note that both a large leverage and a large d i are required to make Cook’s influential. How big does it have to be? Values > 1 (or even >.5) are given in the literature as influential…

Another quantity of interest is called the ith PRESS residual: Here the (i) indicates the ith case is removed; notice that large leverage makes these PRESS residuals large. Only the the original residuals and the leverages from the regression on the full data set are needed to compute these statistics. These are called PRESS residuals because their sum of squares is called the “prediction error sum of squares” Let’s go through the forestry example in section on page …