Multiple Linear Regression

Slides:



Advertisements
Similar presentations
1 Outliers and Influential Observations KNN Ch. 10 (pp )
Advertisements

STA305 week 31 Assessing Model Adequacy A number of assumptions were made about the model, and these need to be verified in order to use the model for.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Prediction, Correlation, and Lack of Fit in Regression (§11. 4, 11
HSRP 734: Advanced Statistical Methods July 24, 2008.
Simple Linear Regression 1. Correlation indicates the magnitude and direction of the linear relationship between two variables. Linear Regression: variable.
Some Terms Y =  o +  1 X Regression of Y on X Regress Y on X X called independent variable or predictor variable or covariate or factor Which factors.
LINEAR REGRESSION: Evaluating Regression Models Overview Assumptions for Linear Regression Evaluating a Regression Model.
LINEAR REGRESSION: Evaluating Regression Models. Overview Assumptions for Linear Regression Evaluating a Regression Model.
Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section ) –Outliers and Influential Points (Section 6.7) Homework.
Psych 524 Andrew Ainsworth Data Screening 1. Data check entry One of the first steps to proper data screening is to ensure the data is correct Check out.
Lecture 25 Multiple Regression Diagnostics (Sections )
REGRESSION What is Regression? What is the Regression Equation? What is the Least-Squares Solution? How is Regression Based on Correlation? What are the.
Regression Diagnostics Using Residual Plots in SAS to Determine the Appropriateness of the Model.
Lecture 6: Multiple Regression
Lecture 24 Multiple Regression (Sections )
Copyright (c) Bani K. Mallick1 STAT 651 Lecture #20.
Regression Diagnostics - I
Regression Diagnostics Checking Assumptions and Data.
Linear Regression Analysis 5E Montgomery, Peck and Vining 1 Chapter 6 Diagnostics for Leverage and Influence.
Topic 5 – Partial Correlations; Diagnostics & Remedial Measures
EPI809/Spring Testing Individual Coefficients.
This Week Continue with linear regression Begin multiple regression –Le 8.2 –C & S 9:A-E Handout: Class examples and assignment 3.
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Simple Linear Regression Analysis
1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE.
© 2002 Prentice-Hall, Inc.Chap 14-1 Introduction to Multiple Regression Model.
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
6-3 Multiple Regression Estimation of Parameters in Multiple Regression.
Regression For the purposes of this class: –Does Y depend on X? –Does a change in X cause a change in Y? –Can Y be predicted from X? Y= mX + b Predicted.
Stat 112 Notes 15 Today: –Outliers and influential points. Homework 4 due on Thursday.
Multilevel Linear Modeling aka HLM. The Design We have data at two different levels In this case, 7,185 students (Level 1) Nested within 160 Schools (Level.
Analysis of Residuals Data = Fit + Residual. Residual means left over Vertical distance of Y i from the regression hyper-plane An error of “prediction”
6-1 Introduction To Empirical Models Based on the scatter diagram, it is probably reasonable to assume that the mean of the random variable Y is.
Anaregweek11 Regression diagnostics. Regression Diagnostics Partial regression plots Studentized deleted residuals Hat matrix diagonals Dffits, Cook’s.
Stat 112 Notes 16 Today: –Outliers and influential points in multiple regression (Chapter 6.7)
Univariate Linear Regression Problem Model: Y=  0 +  1 X+  Test: H 0 : β 1 =0. Alternative: H 1 : β 1 >0. The distribution of Y is normal under both.
Dr. C. Ertuna1 Issues Regarding Regression Models (Lesson - 06/C)
6-3 Multiple Regression Estimation of Parameters in Multiple Regression.
INDE 6335 ENGINEERING ADMINISTRATION SURVEY DESIGN Dr. Christopher A. Chung Dept. of Industrial Engineering.
REGRESSION DIAGNOSTICS Fall 2013 Dec 12/13. WHY REGRESSION DIAGNOSTICS? The validity of a regression model is based on a set of assumptions. Violation.
Simple Linear Regression. Data available : (X,Y) Goal : To predict the response Y. (i.e. to obtain the fitted response function f(X)) Least Squares Fitting.
1 Experimental Statistics - week 12 Chapter 12: Multiple Regression Chapter 13: Variable Selection Model Checking.
Ch14: Linear Least Squares 14.1: INTRO: Fitting a pth-order polynomial will require finding (p+1) coefficients from the data. Thus, a straight line (p=1)
Outliers and influential data points. No outliers?
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Chapter 17 Simple Linear Regression and Correlation.
Maths Study Centre CB Open 11am – 5pm Semester Weekdays
Individual observations need to be checked to see if they are: –outliers; or –influential observations Outliers are defined as observations that differ.
Lab 4 Multiple Linear Regression. Meaning  An extension of simple linear regression  It models the mean of a response variable as a linear function.
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
Unit 9: Dealing with Messy Data I: Case Analysis
Inference for Least Squares Lines
Chapter 6 Diagnostics for Leverage and Influence
Linear Regression.
Multiple Regression Prof. Andy Field.
Statistics for the Social Sciences
Multiple Linear Regression
Non-Linear Models Tractable non-linearity Intractable non-linearity
The slope, explained variance, residuals
Diagnostics and Transformation for SLR
Stats Club Marnie Brennan
Residuals The residuals are estimate of the error
Introduction to Regression
Three Measures of Influence
Outliers and Influence Points
Regression Forecasting and Model Building
Linear Regression and Correlation
Diagnostics and Transformation for SLR
Presentation transcript:

Multiple Linear Regression Regression Diagnostics

Find Scores That Contribute to violation of assumptions. Are suspect because they are far removed from the centroid (multidimensional mean) Have undue influence on the solution.

Outliers Among the Predictors Leverage, hi or Hat Diagonal The larger this statistic, the greater the distance between the data point and the centroid in p-dimensional space. Investigate cases with hi greater than 2(p-1)/N. p is the number of parameters in the model, including the intercept.

Distance from the Regression Surface Standardized Residual (aka Studentized Residual) Difference between actual Y and predicted Y divided by an appropriate standard error Rstudent (aka Studentized Deleted Residual) – same except for each case the regression surface is that obtained when this individual case is removed. Investigate if greater than 2.

Influence on the Solution Cook’s D – how much would the regression surface change if this case were removed Investigate cases with D > 1. Dfbetas – how much would one parameter (slope or intercept) change if this case were removed Investigate cases with values > 2.

SAS Code data regdiag; input SpermCount Together LastEjac @@; SR_LastEjac = sqrt(1+LastEjac); cards; *<data here>; proc univariate plot; var SpermCount -- SR_LastEjac; proc reg; model SpermCount = Together SR_LastEjac / influence r ; run; *<nonsignificant results>; data culled; set regdiag; If SpermCount < 700; proc reg; model SpermCount = Together SR_LastEjac / influence r ; *<Significant results>; title 'One Outlier Culled'; run;

Simple Example Y = sperm count X1 = % time recently spent with mate X2 = time since last ejaculation Output Statistics Obs Student Residual Cook's D RStudent Hat Diag H DFBETAS Intercept Together SR_LastEjac 5 1.012 0.426 1.0139 0.5551 -0.0160 0.0288 -0.0236 8 -0.183 0.006 -0.1715 0.3605 -0.0959 0.1083 0.0437 9 -1.240 0.098 -1.2906 0.1600 -0.0398 -0.2265 0.0999 10 -1.270 0.261 -1.3296 0.3270 -0.2614 -0.2321 0.4657 11 2.643 1.183 6.9409 0.3369 1.6194 1.0137 -2.6903

Leverage Investigate cases with values greater than 2(3)/11 = .55. Case 5 is above this cutoff. It is a univariate outlier on the LastEjac variable. Further investigation indicates the case is valid, so we retain it.

Residuals Case 11 has large residuals, it should be investigated. Notice that Rstudent is much larger than the standardized residual This indicates that removing this case has a large effect on the solution. Output Statistics Obs Student Residual Cook's D RStudent Hat Diag H DFBETAS Intercept Together SR_LastEjac 11 2.643 1.183 6.9409 0.3369 1.6194 1.0137 -2.6903

Influence Case 11 has a high value of Cook’s D. It has a high Dfbeta for the time since last ejaculation predictor, even after I transformed that variable to reduce skewness. Upon investigation, it was found that this subject did not follow the instructions for gathering the data. His scores were deleted.

Plots of Residuals These can also be useful, but It takes some practice to get good at detecting problems from such plots Plot the residual versus predicted Y

Heteroscedasticity

Trying Squaring One Predictor

Residuals not Normal and Variance not Constant