Copyright (c) Bani K. Mallick1 STAT 651 Lecture #20.

Slides:



Advertisements
Similar presentations
Chapter 12 Inference for Linear Regression
Advertisements

Copyright © 2009 Pearson Education, Inc. Chapter 29 Multiple Regression.
Copyright © 2010 Pearson Education, Inc. Slide
Inference for Regression
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Linear regression models
Objectives (BPS chapter 24)
LECTURE 3 Introduction to Linear Regression and Correlation Analysis
© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.
LINEAR REGRESSION: Evaluating Regression Models Overview Assumptions for Linear Regression Evaluating a Regression Model.
LINEAR REGRESSION: Evaluating Regression Models. Overview Assumptions for Linear Regression Evaluating a Regression Model.
Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.
Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.
Copyright (c) Bani Mallick1 Lecture 4 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #4 Probability The bell-shaped (normal) curve Normal probability.
Lecture 24: Thurs., April 8th
Lecture 20 Simple linear regression (18.6, 18.9)
Copyright (c) Bani K. Mallick1 STAT 651 Lecture #13.
Introduction to Probability and Statistics Linear Regression and Correlation.
Regression Diagnostics Checking Assumptions and Data.
Quantitative Business Analysis for Decision Making Simple Linear Regression.
Copyright (c) Bani K. Mallick1 STAT 651 Lecture # 12.
Copyright (c) Bani Mallick1 STAT 651 Lecture # 11.
Copyright (c)Bani K. Mallick1 STAT 651 Lecture #21.
Correlation and Regression Analysis
McGraw-Hill/Irwin Copyright © 2013 by The McGraw-Hill Companies, Inc. All rights reserved. A PowerPoint Presentation Package to Accompany Applied Statistics.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Regression Chapter 14.
Chapter 12 Section 1 Inference for Linear Regression.
Simple Linear Regression Analysis
Two-Way Analysis of Variance STAT E-150 Statistical Methods.
Correlation & Regression
Inference for regression - Simple linear regression
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Section 10-3 Regression.
Chapter 13: Inference in Regression
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on the Least-Squares Regression Model and Multiple Regression 14.
Simple Linear Regression Models
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
Inferences for Regression
Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
OPIM 303-Lecture #8 Jose M. Cruz Assistant Professor.
1 Chapter 10 Correlation and Regression 10.2 Correlation 10.3 Regression.
Chapter 10 Correlation and Regression
MBP1010H – Lecture 4: March 26, Multiple regression 2.Survival analysis Reading: Introduction to the Practice of Statistics: Chapters 2, 10 and 11.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
Review of Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
Multiple Regression Petter Mostad Review: Simple linear regression We define a model where are independent (normally distributed) with equal.
Stat 112 Notes 16 Today: –Outliers and influential points in multiple regression (Chapter 6.7)
Dr. C. Ertuna1 Issues Regarding Regression Models (Lesson - 06/C)
Simple Linear Regression ANOVA for regression (10.2)
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Copyright ©2011 Brooks/Cole, Cengage Learning Inference about Simple Regression Chapter 14 1.
STA 286 week 131 Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression.
Slide 1 DSCI 5340: Predictive Modeling and Business Forecasting Spring 2013 – Dr. Nick Evangelopoulos Lecture 2: Review of Multiple Regression (Ch. 4-5)
1 Regression Analysis The contents in this chapter are from Chapters of the textbook. The cntry15.sav data will be used. The data collected 15 countries’
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Regression Chapter 14.
ANOVA, Regression and Multiple Regression March
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Simple Linear Regression Analysis Chapter 13.
Lecturer: Ing. Martina Hanová, PhD.. Regression analysis Regression analysis is a tool for analyzing relationships between financial variables:  Identify.
Inference for Least Squares Lines
AP Statistics Chapter 14 Section 1.
Inferences for Regression
Inference for Regression
Chapter 12: Regression Diagnostics
Correlation and Regression
CHAPTER 29: Multiple Regression*
CHAPTER 26: Inference for Regression
Chapter 12 Inference on the Least-squares Regression Line; ANOVA
CHAPTER 12 More About Regression
Inferences for Regression
Presentation transcript:

Copyright (c) Bani K. Mallick1 STAT 651 Lecture #20

Copyright (c) Bani K. Mallick2 Topics in Lecture #20 Outliers and Leverage Cook’s distance

Copyright (c) Bani K. Mallick3 Book Chapters in Lecture #20 Small part of Chapter 11.2

Copyright (c) Bani K. Mallick4 Relevant SPSS Tutorials Regression diagnostics Diagnostics for problem points

Copyright (c) Bani K. Mallick5 Lecture 19 Review: Population Slope and Intercept If then we have a graph like this: X This is the mean of Y for those whose independent variable is X

Copyright (c) Bani K. Mallick6 Lecture 19 Review: Population Slope and Intercept If then we have a graph like this: X Note how the mean of Y does not depend on X: Y and X are independent

Copyright (c) Bani K. Mallick7 Lecture 19 Review: Linear Regression If then Y and X are independent So, we can test the null hypothesis that Y and X are independent by testing The p-value in regression tables tests this hypothesis

Copyright (c) Bani K. Mallick8 Lecture 19 Review: Regression The standard deviation of the errors  is to be called This means that every subpopulation who share the same value of X have Mean = Standard deviation =

Copyright (c) Bani K. Mallick9 Lecture 19 Review: Regression The least squares estimate is a random variable Its estimated standard deviation is

Copyright (c) Bani K. Mallick10 Lecture 19 Review: Regression The (1  100% Confidence interval for the population slope is

Copyright (c) Bani K. Mallick11 Lecture 19 Review: Residuals You can check the assumption that the errors are normally distributed by constructing a q-q plot of the residuals

Copyright (c) Bani K. Mallick12 Leverage and Outliers Outliers in Linear Regression are difficult to diagnose They depend crucially on where X is * * * * * A boxplot of Y would think this is an outlier, when in reality it fits the line quite well

Copyright (c) Bani K. Mallick13 Outliers and Leverage It’s also the case than one observation can have a dramatic impact on the fit * * * * * This is called a leverage value because its X is so far from the rest, and as we’ll see, it exerts a lot of leverage in determining the line

Copyright (c) Bani K. Mallick14 Outliers and Leverage It’s also the case than one observation can have a dramatic impact on the fit * * * * *

Copyright (c) Bani K. Mallick15 Outliers and Leverage It’s also the case than one observation can have a dramatic impact on the fit * * * * *

Copyright (c) Bani K. Mallick16 Outliers and Leverage It’s also the case than one observation can have a dramatic impact on the fit * * * * * * * The slope of the line depends crucially on the value far to the right

Copyright (c) Bani K. Mallick17 Outliers and Leverage But Outliers can occur * * * * * * * This point is simply too high for its value of X Line with Outlier Line without Outlier

Copyright (c) Bani K. Mallick18 Outliers and Leverage A leverage point is an observation with a value of X that is outlying among the X values An outlier is an observation of Y that seems not to agree with the main trend of the data Outliers and leverage values can distort the fitted least squares line It is thus important to have diagnostics to detect when disaster might strike

Copyright (c) Bani K. Mallick19 Outliers and Leverage We have three methods for diagnosing high leverage values and outliers Leverage plots: For a single X, these are basically the same as boxplots of the X-space (leverage) Cook’s distance (measures how much the fitted line changes if the observation is deleted) Residual Plots

Copyright (c) Bani K. Mallick20 Outliers and Leverage Leverage plots: You plot the leverage against the observation number (first observation in your data file = #1, second = #2, etc.) Leverage for observation j is defined as In effect, you measure the distance of an observation to its mean in relation to the total distance of the X’s

Copyright (c) Bani K. Mallick21 Outliers and Leverage Remember the GPA and Height Example Are there any obvious outliers/leverage points?

Copyright (c) Bani K. Mallick22 Outliers and Leverage Remember the GPA and Height Example Are there any obvious outliers/leverage points? Not Really!

Copyright (c) Bani K. Mallick23 Outliers and Leverage The leverage plot should show nothing really dramatic This is just normal Scatter. Takes Experience to read

Copyright (c) Bani K. Mallick24 Outliers and Leverage The Cook’s Distance for an observation is defined as follows Compute the fitted values with all the data Compute the fitted values with observation j deleted Compute the sum of the squared differences Measures how much the line changes when an observation is deleted

Copyright (c) Bani K. Mallick25 Outliers and Leverage The Cook’s Distance plot should show nothing really dramatic This is just normal Scatter. Takes Experience to read

Copyright (c) Bani K. Mallick26 Outliers and Leverage The residual plot is a plot of the residuals (on the y-axis) against the predicted values (on the x-axis) You should look for values which seem quite extreme

Copyright (c) Bani K. Mallick27 Outliers and Leverage The residual plot should show nothing really dramatic This is just normal Scatter. No massive Outliers. Takes Experience to read

Copyright (c) Bani K. Mallick28 Outliers and Leverage A much more difficult example occurs with the stenotic kids Coefficients a (Constant) Body Surface Area Model 1 BStd. Error Unstandardized Coefficients Beta Standardi zed Coefficien ts tSig.Lower BoundUpper Bound 95% Confidence Interval for B Dependent Variable: Log(1+Aortic Valve Area) a.

Copyright (c) Bani K. Mallick29 Outliers and Leverage A much more difficult example occurs with the stenotic kids Note: outlier?

Copyright (c) Bani K. Mallick30 Outliers and Leverage This makes sense, since the data show no unusual X- values Scatterplot comes next

Copyright (c) Bani K. Mallick31 Outliers and Leverage A much more difficult example occurs with the stenotic kids Note: outlier?

Copyright (c) Bani K. Mallick32 Outliers and Leverage Wow! This is a case that there is a noticeable outlier, but not too high leverage

Copyright (c) Bani K. Mallick33 Outliers and Leverage Wow!

Copyright (c) Bani K. Mallick34 Outliers and Leverage: Low Leverage Outliers Coefficients: All Stenotic Kids a (Constant) Body Surface Area Model 1 BStd. Error Unstandardized Coefficients Beta Standardi zed Coefficien ts tSig.Lower BoundUpper Bound 95% Confidence Interval for B Dependent Variable: Log(1+Aortic Valve Area) a. Stenotic Kids, Outlier Removed a 8.207E (Constant) Body Surface Area Model 1 BStd. Error Unstandardized Coefficients Beta Standardi zed Coefficien ts tSig.Lower BoundUpper Bound 95% Confidence Interval for B Dependent Variable: Log(1 + Aortic Valve Area) a.

Copyright (c) Bani K. Mallick35 Remember: Outliers Inflate Variance! ANOVA b a E Regression Residual Total Model 1 Sum of SquaresdfMean SquareFSig. Predictors: (Constant), Body Surface Area a. Dependent Variable: Log(1+Aortic Valve Area) b. ANOVA b a E Regression Residual Total Model 1 Sum of SquaresdfMean SquareFSig. Predictors: (Constant), Body Surface Area a. Dependent Variable: Log(1 + Aortic Valve Area) b.

Copyright (c) Bani K. Mallick36 Outliers and Leverage The effect of a high leverage outlier is often to inflate your estimate of With the outlier, the MSE (mean squared residual) = Without the outlier, the MSE (mean squared residual) is = So, a single outlier in 56 observations increases your estimate of by over 50%! This becomes important later!

Copyright (c) Bani K. Mallick37 Base Pay and Age in Construction No outliers Not a strong trend, but in the expected direction

Copyright (c) Bani K. Mallick38 Base Pay and Age in Construction Not even close to normally distributed Cries out for a transformation

Copyright (c) Bani K. Mallick39 Log(Base Pay) and Age in Construction Expected trend, but weak Odd data structure: salaries were rounded in clumps of $5,000

Copyright (c) Bani K. Mallick40 Log(Base Pay) and Age in Construction Much better residual plot Good time to remember why we want data to be normally distributed

Copyright (c) Bani K. Mallick41 Log(Base Pay) and Age in Construction No real massive influential points, according to Cook’s distances

Copyright (c) Bani K. Mallick42 Log(Base Pay) and Age in Construction Note the statistically significant effect: do we have 99% confidence?

Copyright (c) Bani K. Mallick43 Log(Base Pay-$30,000) and Age in Construction