Outliers and influential data points. No outliers?

Slides:

Advertisements

Similar presentations

1 Outliers and Influential Observations KNN Ch. 10 (pp )

Advertisements

12-1 Multiple Linear Regression Models Introduction Many applications of regression analysis involve situations in which there are more than.

12 Multiple Linear Regression CHAPTER OUTLINE

Prediction, Correlation, and Lack of Fit in Regression (§11. 4, 11

1 1 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.

1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Simple Linear Regression Estimates for single and mean responses.

Class 16: Thursday, Nov. 4 Note: I will you some info on the final project this weekend and will discuss in class on Tuesday.

1 Chapter 3 Multiple Linear Regression Ray-Bing Chen Institute of Statistics National University of Kaohsiung.

Part I – MULTIVARIATE ANALYSIS C2 Multiple Linear Regression I

Linear statistical models 2008 Model diagnostics  Residual analysis  Outliers  Dependence  Heteroscedasticity  Violations of distributional assumptions.

Simple Linear Regression Analysis

Linear and generalised linear models

Regression Diagnostics Checking Assumptions and Data.

Linear Regression Analysis 5E Montgomery, Peck and Vining 1 Chapter 6 Diagnostics for Leverage and Influence.

Linear regression models in matrix terms. The regression function in matrix terms.

Matrix Approach to Simple Linear Regression KNNL – Chapter 5.

1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.

Conditions of applications. Key concepts Testing conditions of applications in complex study design Residuals Tests of normality Residuals plots – Residuals.

Correlation & Regression

Advantages of Multivariate Analysis Close resemblance to how the researcher thinks. Close resemblance to how the researcher thinks. Easy visualisation.

Simple linear regression and correlation analysis

1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE.

Model Checking Using residuals to check the validity of the linear regression model assumptions.

Multiple Linear Regression - Matrix Formulation Let x = (x 1, x 2, …, x n )′ be a n  1 column vector and let g(x) be a scalar function of x. Then, by.

© 2002 Prentice-Hall, Inc.Chap 14-1 Introduction to Multiple Regression Model.

M23- Residuals & Minitab 1  Department of ISM, University of Alabama, ResidualsResiduals A continuation of regression analysis.

Chapter 12, Part 2 STA 291 Summer I Mean and Standard Deviation The five-number summary is not the most common way to describe a distribution numerically.

1 1 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.

Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 12: Analyzing the Association Between Quantitative Variables: Regression Analysis Section.

Anaregweek11 Regression diagnostics. Regression Diagnostics Partial regression plots Studentized deleted residuals Hat matrix diagonals Dffits, Cook’s.

Stat 112 Notes 16 Today: –Outliers and influential points in multiple regression (Chapter 6.7)

Dr. C. Ertuna1 Issues Regarding Regression Models (Lesson - 06/C)

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 12 Analyzing the Association Between Quantitative Variables: Regression Analysis Section.

1 Introduction What does it mean when there is a strong positive correlation between x and y ? Regression analysis aims to find a precise formula to relate.

Agresti/Franklin Statistics, 1 of 88 Chapter 11 Analyzing Association Between Quantitative Variables: Regression Analysis Learn…. To use regression analysis.

12/17/ lecture 111 STATS 330: Lecture /17/ lecture 112 Outliers and high-leverage points  An outlier is a point that has a larger.

Trees Example More than one variable. The residual plot suggests that the linear model is satisfactory. The R squared value seems quite low though,

Lecture 13 Diagnostics in MLR Added variable plots Identifying outliers Variance Inflation Factor BMTRY 701 Biostatistical Methods II.

Statistical Data Analysis 2010/2011 M. de Gunst Lecture 10.

Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.

1 Reg12W G Multiple Regression Week 12 (Wednesday) Review of Regression Diagnostics Influence statistics Multicollinearity Examples.

Lab 4 Multiple Linear Regression. Meaning  An extension of simple linear regression  It models the mean of a response variable as a linear function.

DATA ANALYSIS AND MODEL BUILDING LECTURE 9 Prof. Roland Craigwell Department of Economics University of the West Indies Cave Hill Campus and Rebecca Gookool.

1 AAEC 4302 ADVANCED STATISTICAL METHODS IN AGRICULTURAL RESEARCH Part II: Theory and Estimation of Regression Models Chapter 5: Simple Regression Theory.

1 Simple Linear Regression Example - mammals Response variable: gestation (length of pregnancy) days Explanatory: brain weight.

Unit 9: Dealing with Messy Data I: Case Analysis

Chapter 20 Linear and Multiple Regression

CHAPTER 3 Describing Relationships

Chapter 6 Diagnostics for Leverage and Influence

CHAPTER 3 Describing Relationships

Statistical Quality Control, 7th Edition by Douglas C. Montgomery.

Multiple Linear Regression

Regression Diagnostics

...Relax... 9/21/2018 ST3131, Lecture 3 ST5213 Semester II, 2000/2001

Lecture 18 Outline: 1. Role of Variables in a Regression Equation

Simple Linear Regression

Regression Model Building - Diagnostics

Diagnostics and Transformation for SLR

Motivational Examples Three Types of Unusual Observations

LESSON 4.4. MULTIPLE LINEAR REGRESSION. Residual Analysis

Solution 8 12/4/2018 F P1 P2 RESI1 SRES1 TRES1 HI1 FITS1

CHAPTER 3 Describing Relationships

Linear regression Fitting a straight line to observations.

Section 2: Linear Regression.

Three Measures of Influence

Regression Model Building - Diagnostics

Outliers and Influence Points

Essentials of Statistics for Business and Economics (8e)

Diagnostics and Transformation for SLR

Presentation transcript:

Outliers and influential data points

No outliers?

An outlier? Influential?

Impact on regression analyses Not every outlier strongly influences the estimated regression function. Always determine if estimated regression function is unduly influenced by one or a few cases. Simple plots for simple linear regression. Summary measures for multiple linear regression.

The hat matrix H

Least squares estimates The regression model Fitted values

Identifying outlying Y values

Residuals Standardized residuals –also called internally studentized residuals Deleted residuals Deleted t residuals –also called studentized deleted residuals –also called externally studentized residuals

Residuals Ordinary residuals defined for each observation, i = 1, …, n: Using matrix notation:

Variance of the residuals Residual vector Variance matrix Variance of the i th residual Estimated variance of the i th residual

Standardized residuals Standardized residuals defined for each observation, i = 1, …, n: Standardized residuals quantify how large the residuals are in standard deviation units. Standardized residuals larger than 2 or smaller than -2 suggest that the y values are unusual.

An outlying y value?

x y FITS1 HI1 s(e) RESI1 SRES S = Unusual Observations Obs x y Fit SE Fit Residual St Resid R R denotes an observation with a large standardized residual

Deleted residuals If observed y i is extreme, it may “pull” the fitted equation towards itself, thereby yielding a small ordinary residual. Delete the i th case, estimate the regression function using remaining n-1 cases, and use the x values to predict the response for the i th case. Deleted residual

Deleted t residuals A deleted t residual is just a standardized deleted residual: The deleted t residuals follow a t distribution with ((n-1)-p) degrees of freedom.

x y RESI1 TRES

Row x y RESI1 SRES1 TRES

Identifying outlying X values

Use the diagonal elements, h ii, of the hat matrix H to identify outlying X values. The h ii are called leverages.

Properties of the leverages (h ii ) The h ii is a measure of the distance between the X values for the i th case and the means of the X values for all n cases. The h ii is a number between 0 and 1, inclusive. The sum of the h ii equals p, the number of parameters.

HI Sum of HI1 =

Properties of the leverages (h ii ) If the i th case is outlying in terms of its X values, it has a large leverage value h ii, and therefore exercises substantial leverage in determining the fitted value.

Using leverages to identify outlying X values Minitab flags any observations whose leverage value, h ii, is more than 3 times larger than the mean leverage value…. …or if it’s greater than 0.99.

Unusual Observations Obs x y Fit SE Fit Residual St Resid X X denotes an observation whose X value gives it large influence. x y HI

x y HI Unusual Observations Obs x y Fit SE Fit Residual St Resid RX R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large influence.

Identifying influential cases

Influence A case is influential if its exclusion causes major changes in the estimated regression function.

Identifying influential cases Difference in fits, DFITS Cook’s distance measure

DFITS The difference in fits … … represent the number of standard deviations that the fitted value increases or decreases when the i th case is included.

DFITS A case is influential if the absolute value of its DFIT value is … … greater than 1 for small to medium data sets …greater than for large data sets

x y DFIT

x y DFIT

Cook’s distance Cook’s distance measure … … considers the influence of the i th case on all n fitted values.

Cook’s distance Relate D i to the F(p, n-p) distribution. If D i is greater than the 50th percentile, F(0.50, p, n-p), then the i th case has lots of influence.

x y COOK

x y COOK