Model Adequacy Testing Assumptions, Checking for Outliers, and More.

Slides:



Advertisements
Similar presentations
Transformations & Data Cleaning
Advertisements

Statistical Techniques I EXST7005 Multiple Regression.
Model Adequacy Running a Real Regression Analysis
Prediction, Correlation, and Lack of Fit in Regression (§11. 4, 11
Introduction and Overview
LECTURE 3 Introduction to Linear Regression and Correlation Analysis
LINEAR REGRESSION: Evaluating Regression Models Overview Assumptions for Linear Regression Evaluating a Regression Model.
LINEAR REGRESSION: Evaluating Regression Models. Overview Assumptions for Linear Regression Evaluating a Regression Model.
BA 555 Practical Business Analysis
Statistics for Managers Using Microsoft® Excel 5th Edition
Lecture 25 Multiple Regression Diagnostics (Sections )
Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.
Multiple Regression Models: Some Details & Surprises Review of raw & standardized models Differences between r, b & β Bivariate & Multivariate patterns.
Multivariate Data Analysis Chapter 4 – Multiple Regression.
Class 6: Tuesday, Sep. 28 Section 2.4. Checking the assumptions of the simple linear regression model: –Residual plots –Normal quantile plots Outliers.
Lecture 6: Multiple Regression
Lecture 24 Multiple Regression (Sections )
Lecture 24: Thurs., April 8th
1 4. Multiple Regression I ECON 251 Research Methods.
Regression Diagnostics Checking Assumptions and Data.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 11 th Edition.
Business Statistics - QBM117 Statistical inference for regression.
Stat 112: Lecture 9 Notes Homework 3: Due next Thursday
Correlation and Regression Analysis
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Multiple Regression Dr. Andy Field.
Linear Regression 2 Sociology 5811 Lecture 21 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
Simple Linear Regression Analysis
1 Simple Linear Regression 1. review of least squares procedure 2. inference for least squares lines.
Relationships Among Variables
Multiple Linear Regression A method for analyzing the effects of several predictor variables concurrently. - Simultaneously - Stepwise Minimizing the squared.
Multiple Regression. Multiple regression  Previously discussed the one predictor scenario  Multiple regression is the case of having two or more independent.
Forecasting Revenue: An Example of Regression Model Building Setting: Possibly a large set of predictor variables used to predict future quarterly revenues.
Objectives of Multiple Regression
Inference for regression - Simple linear regression
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
Forecasting Revenue: An Example of Regression Model Building Setting: Possibly a large set of predictor variables used to predict future quarterly revenues.
Basic linear regression and multiple regression Psych Fraley.
Roger B. Hammer Assistant Professor Department of Sociology Oregon State University Conducting Social Research Multiple Collinearity, Serial Correlation,
Notes Bivariate Data Chapters Bivariate Data Explores relationships between two quantitative variables.
The intelligent and valid application of analytic methods requires knowledge of the rationale, hence the assumptions, behind them. ~Elazar Pedhazur.
Extension to Multiple Regression. Simple regression With simple regression, we have a single predictor and outcome, and in general things are straightforward.
Notes Bivariate Data Chapters Bivariate Data Explores relationships between two quantitative variables.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
Regression Chapter 16. Regression >Builds on Correlation >The difference is a question of prediction versus relation Regression predicts, correlation.
Stat 112 Notes 9 Today: –Multicollinearity (Chapter 4.6) –Multiple regression and causal inference.
Right Hand Side (Independent) Variables Ciaran S. Phibbs.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
I271B QUANTITATIVE METHODS Regression and Diagnostics.
Case Selection and Resampling Lucila Ohno-Machado HST951.
Applied Quantitative Analysis and Practices LECTURE#30 By Dr. Osman Sadiq Paracha.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
 Seeks to determine group membership from predictor variables ◦ Given group membership, how many people can we correctly classify?
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
Regression. Why Regression? Everything we’ve done in this class has been regression: When you have categorical IVs and continuous DVs, the ANOVA framework.
Chapter 12 REGRESSION DIAGNOSTICS AND CANONICAL CORRELATION.
Stats Methods at IC Lecture 3: Regression.
Predicting Energy Consumption in Buildings using Multiple Linear Regression Introduction Linear regression is used to model energy consumption in buildings.
Inference for Least Squares Lines
Multiple Regression Prof. Andy Field.
Regression Analysis Simple Linear Regression
Multiple Regression.
Regression.
Chapter 8 Part 2 Linear Regression
Diagnostics and Transformation for SLR
CH2. Cleaning and Transforming Data
Regression Diagnostics
Regression Forecasting and Model Building
Diagnostics and Transformation for SLR
Presentation transcript:

Model Adequacy Testing Assumptions, Checking for Outliers, and More

Normal distribution of residuals Our normality assumption applies to the residuals Our normality assumption applies to the residuals One can simply save them and plot a density curve/histogram One can simply save them and plot a density curve/histogram Often a quantile-quantile plot is readily available, and here we hope to find most of our data along a 45 degree line Often a quantile-quantile plot is readily available, and here we hope to find most of our data along a 45 degree line *After fitting the model, models/graphs/basic diagnostic plots in R-commander

Homoscedasticity We can check a plot of the residuals vs our predicted values to get a sense of the spread along the regression line We can check a plot of the residuals vs our predicted values to get a sense of the spread along the regression line We prefer to see kind of a blob about the zero line (our mean), with no readily discernable pattern We prefer to see kind of a blob about the zero line (our mean), with no readily discernable pattern This would mean that the residuals don’t get overly large for certain areas of the regression line relative to others This would mean that the residuals don’t get overly large for certain areas of the regression line relative to others

Collinearity Multiple regression is capable of analyzing data with correlated predictor variables. Multiple regression is capable of analyzing data with correlated predictor variables. However, problems can arise from situations in which two or more variables are highly intercorrelated. However, problems can arise from situations in which two or more variables are highly intercorrelated. Perfect collinearity Perfect collinearity –Occurs if predictors are linear functions of each other (ex., age and year of birth), when the researcher creates dummy variables for all values of a categorical variable rather than leaving one out, and when there are fewer observations than variables –No unique regression solution Less than perfect (the usual problem) Less than perfect (the usual problem) –Inflates standard errors and makes assessment of the relative importance of the predictors unreliable. –Also means that a small number of cases potentially can affect results strongly

Collinearity Simple and Multi- Collinearity Simple and Multi- Collinearity –When two or more variables are highly correlated –Can be detected by looking at the zero order correlations. –Better is to regress each IV on all other variables and look for large R 2 s Although our estimates of our coefficients are not biased, they become inefficient Although our estimates of our coefficients are not biased, they become inefficient –Jump around a lot from sample to sample

Collinearity diagnostics Tolerance Tolerance –Proportion of a predictors’ variance not accounted for by other variables –Looking for tolerance values that are small, close to zero  Not contributing anything new to the model –tolerance = 1/VIF VIF VIF –Variance inflation factor –Looking for VIF values that are large  E.g. individual VIF greater than 10 should be inspected –VIF=1/tolerance Other Indicators of Collinearity Other Indicators of Collinearity –Eigenvalues  Small values, close to zero –Condition index  Large values (15+)

Dealing with collinearity Collinearity not necessarily a problem if only want to predict, not explain Collinearity not necessarily a problem if only want to predict, not explain –Inefficiency of coefficients may not pose a real problem Larger N might help reduce standard error of our coefficients Larger N might help reduce standard error of our coefficients Combine variables to create a composite, Remove variable Combine variables to create a composite, Remove variable –Must be theoretically feasible Centering the data (subtracting the mean) Centering the data (subtracting the mean) –Interpretation of coefficients will change as variables are now centered on zero Recognize its presence and live with the consequences Recognize its presence and live with the consequences

Regression Diagnostics Of course all of the previous information would be relatively useless if we are not meeting our assumptions and/or have overly influential data points Of course all of the previous information would be relatively useless if we are not meeting our assumptions and/or have overly influential data points –In fact, you shouldn’t be really looking at the results unless you test assumptions and look for outliers, even though this requires running the analysis to begin with Various tools are available for the detection of outliers Various tools are available for the detection of outliers Classical methods Classical methods –Standardized Residuals (ZRESID) –Studentized Residuals (SRESID) –Studentized Deleted Residuals (SDRESID) Ways to think about outliers Ways to think about outliers –Leverage –Discrepancy –Influence Thinking ‘robustly’ Thinking ‘robustly’

Regression Diagnostics Standardized Residuals (ZRESID) Standardized Residuals (ZRESID) –Standardized errors in prediction  Mean 0, Sd = std. error of estimate  To standardize, divide each residual by its s.e.e. –At best an initial indicator (e.g. the +2 rule of thumb), but because the case itself determines what the mean residual would be, almost useless Studentized Residuals (SRESID) Studentized Residuals (SRESID) –Same thing but studentized residual recognizes that the error associated with predicting values far from the mean of X is larger than the error associated with predicting values closer to the mean of X –standard error is multiplied by a value that will allow the result to take this into account Studentized Deleted Residuals (SDRESID) Studentized Deleted Residuals (SDRESID) –Studentized in which the standard error is calculated with the case in question removed from the others

Regression Diagnostics Mahalanobis’ Distance Mahalanobis’ Distance –Mahalanobis distance is the distance of a case from the centroid of the remaining points (point where the means meet in n-dimensional space) Cook’s Distance Cook’s Distance –Identifies an influential data point whether in terms of predictor or DV –A measure of how much the residuals of all cases would change if a particular case were excluded from the calculation of the regression coefficients. –With larger (relative) values, excluding a case would change the coefficients substantially. DfBeta DfBeta –Change in the regression coefficient that results from the exclusion of a particular case –Note that you get DfBetas for each coefficient associated with the predictors

Regression Diagnostics Leverage assesses outliers among the predictors Leverage assesses outliers among the predictors –Mahalanobis distance  Relatively high Mahalanobis suggests an outlier on one or more variables Discrepancy Discrepancy –Measures the extent to which a case is in line with others Influence Influence –A product of leverage and discrepancy –How much would the coefficients change if the case were deleted?  Cook’s distance, dfBetas

Outliers Influence plots Influence plots With a couple measures of ‘outlierness’ we can construct a scatterplot to note especially problematic cases With a couple measures of ‘outlierness’ we can construct a scatterplot to note especially problematic cases –After fitting a regression model in R- commander, i.e. running the analysis, this graph is available via point and click Here we have what is actually a 3-d plot, with 2 outlier measures on the x and y axes (studentized residuals and ‘hat’ values, a measure of leverage) and a third in terms of the size of the circle (Cook’s distance) Here we have what is actually a 3-d plot, with 2 outlier measures on the x and y axes (studentized residuals and ‘hat’ values, a measure of leverage) and a third in terms of the size of the circle (Cook’s distance) For this example, case 35 appears to be a problem For this example, case 35 appears to be a problem

Outliers It should be clear to interested readers whatever has been done to deal with outliers It should be clear to interested readers whatever has been done to deal with outliers Use appropriate software to perform robust regression (e.g. least trimmed squares) and compare and contrast the results with classical approaches Use appropriate software to perform robust regression (e.g. least trimmed squares) and compare and contrast the results with classical approaches –Applications such as S-plus, R, and even SAS and Stata provide methods of robust regression analysis

Summary: Outliers No matter the analysis, some cases will be the ‘most extreme’. However, none may really qualify as being overly influential. No matter the analysis, some cases will be the ‘most extreme’. However, none may really qualify as being overly influential. Whatever you do, always run some diagnostic analysis and do not ignore influential cases Whatever you do, always run some diagnostic analysis and do not ignore influential cases It should be clear to interested readers whatever has been done to deal with outliers It should be clear to interested readers whatever has been done to deal with outliers As noted before, the best approach to dealing with outliers when they do occur is to run a robust regression with capable software As noted before, the best approach to dealing with outliers when they do occur is to run a robust regression with capable software

Suppressor variables There are a couple of ways in which suppression can occur or be talked of, but the gist is that this masks the impact the predictor would have on the dependent if the third variable did not exist There are a couple of ways in which suppression can occur or be talked of, but the gist is that this masks the impact the predictor would have on the dependent if the third variable did not exist In general suppression occurs when  i falls outside the range of 0  r yi In general suppression occurs when  i falls outside the range of 0  r yi Suppression in MR can entail some different relationships among IVs Suppression in MR can entail some different relationships among IVs –For example one suppressor relationship would be where two variables, X 1 and X 2, are positively related to Y, but when the equation comes out we get  Y-hat = b 1 X 1 – b 2 X 2 + a Three kinds to be discussed Three kinds to be discussed –Classical –Net –Cooperative

Suppression When dealing with standardized regression coefficients, note that When dealing with standardized regression coefficients, note that

Suppression Consider the following relationships Consider the following relationships a. Complete independence: R 2 Y.12 = 0 b. Partial independence: R 2 Y.12 = 0 but r 12 0, d. Partial independence again, both r Y1 and r Y2 ≠ 0, but r 12 = 0

Suppression e. Normal situation, redundancy: no simple correlation = 0 e. Normal situation, redundancy: no simple correlation = 0 –Each semi-partial correlation, and the corresponding beta, will be less than the simple correlation between Xi and Y. This is because the variables share variance and influence f. Classical suppression: r Y2 = 0 f. Classical suppression: r Y2 = 0

Suppression Recall from previously  Recall from previously  If r y2 = 0, then  If r y2 = 0, then  With increasingly shared variance between X 1 and X 2 we will have an inflated beta coefficient for X 1 With increasingly shared variance between X 1 and X 2 we will have an inflated beta coefficient for X 1 X 2 is suppressing the error variance in X 1 X 2 is suppressing the error variance in X 1 In other words, even though X 2 is not correlated with Y, having it in the equation raises the R 2 from what it would have been with just X1. In other words, even though X 2 is not correlated with Y, having it in the equation raises the R 2 from what it would have been with just X1.

Suppression Other suppression situations Other suppression situations Net Net –All rs positive –  2 ends up with a sign opposite that of its simple correlation with Y –It is always the X which has the smaller r yi which ends up with a  of opposite sign –  falls outside of the range 0  r yi, which is always true with any sort of suppression Cooperative Cooperative –Predictors negatively correlated with one another, both positive with DV  Or positively with one another and negatively with Y –Example –Correlation between social aggressiveness (X 1 ) and sales success (Y) =.29 –Correlation between record keeping (X 2 ) and sales success (Y) =.24 –R 12 = -.30 –Regression coefficients for IVs =.398 and.359 respectively

Suppression *For statistically significant IVs Gist: weird stuff can happen in MR, so take note of the relationship of the IVs and how it may affect your overall interpretation Gist: weird stuff can happen in MR, so take note of the relationship of the IVs and how it may affect your overall interpretation Compare the simple correlations of each IV with the DV and compare to their respective beta coefficients* Compare the simple correlations of each IV with the DV and compare to their respective beta coefficients* –If coefficient noticeably larger than simple correlation (absolute value) or of opposite sign one should suspect possible suppression

Model Validation Overfitting Overfitting Validation Validation Bootstrapping Bootstrapping

Overfitting External validity External validity In some cases, some of the variation the parameters chosen are explaining is variation that is idiosyncratic to the sample In some cases, some of the variation the parameters chosen are explaining is variation that is idiosyncratic to the sample –We would not see this variability in the population So the fit of the model is good, but it doesn’t generalize as well as one would think So the fit of the model is good, but it doesn’t generalize as well as one would think Capitalization on chance Capitalization on chance

Overfitting Example from Lattin, Carroll, Green Example from Lattin, Carroll, Green Randomly generated 30 variables to predict an outcome variable Randomly generated 30 variables to predict an outcome variable Using a best subsets approach, 3 variables were found that produce an R 2 of.33 or 33% variance accounted for Using a best subsets approach, 3 variables were found that produce an R 2 of.33 or 33% variance accounted for As one can see, even random data has the capability of appearing to be a decent fit As one can see, even random data has the capability of appearing to be a decent fit

Validation One way to deal with such a problem is with a simple random split One way to deal with such a problem is with a simple random split With large datasets one can randomly split the sample into two sets With large datasets one can randomly split the sample into two sets –Calibration sample: used to estimate the coefficients –Holdout sample: used to validate the model Some suggest a 2:1 or 4:1 split Some suggest a 2:1 or 4:1 split Using the coefficients from the calibration set one can create predicted values for the holdout set Using the coefficients from the calibration set one can create predicted values for the holdout set The squared correlation between the predicted values and observed values can then be compared to the R 2 of the calibration set The squared correlation between the predicted values and observed values can then be compared to the R 2 of the calibration set In previous example of randomly generated data the R 2 for the holdout set was 0 In previous example of randomly generated data the R 2 for the holdout set was 0

Other approaches Jackknife Validation Jackknife Validation –Create estimates with a particular case removed –Use the coefficients obtained from analysis of the n-1 remaining cases to create a predicted value for the case removed –Do for all cases, and then compare the jackknifed R 2 to the original Subsets approach Subsets approach –Create several samples of the data of roughly equal size –Use the holdout approach with one sample, and obtain estimates from the others –Do this for each sample, obtain average estimates

Bootstrap With relatively smaller samples*, cross- validation may not be as feasible With relatively smaller samples*, cross- validation may not be as feasible One may instead resample (with replacement) from the original data to obtain estimates for the coefficients One may instead resample (with replacement) from the original data to obtain estimates for the coefficients –Use what is available to create a sampling distribution of for the values of interest but still large enough such that the bootstrap estimates would be viable * but still large enough such that the bootstrap estimates would be viable

Summary There is a lot to consider when performing multiple regression analysis There is a lot to consider when performing multiple regression analysis Actually running the analysis is just the first step, and if that’s all we are doing, we haven’t done much Actually running the analysis is just the first step, and if that’s all we are doing, we haven’t done much A lot of work will be necessary to make sure that the conclusions drawn will be worthwhile A lot of work will be necessary to make sure that the conclusions drawn will be worthwhile And that’s ok, you can do it! And that’s ok, you can do it!