Unit 9: Dealing with Messy Data I: Case Analysis

Slides:



Advertisements
Similar presentations
1 Outliers and Influential Observations KNN Ch. 10 (pp )
Advertisements

Class 16: Thursday, Nov. 4 Note: I will you some info on the final project this weekend and will discuss in class on Tuesday.
LINEAR REGRESSION: Evaluating Regression Models Overview Assumptions for Linear Regression Evaluating a Regression Model.
LINEAR REGRESSION: Evaluating Regression Models. Overview Assumptions for Linear Regression Evaluating a Regression Model.
Psych 524 Andrew Ainsworth Data Screening 1. Data check entry One of the first steps to proper data screening is to ensure the data is correct Check out.
Lecture 25 Multiple Regression Diagnostics (Sections )
Multivariate Data Analysis Chapter 4 – Multiple Regression.
Examining Relationship of Variables  Response (dependent) variable - measures the outcome of a study.  Explanatory (Independent) variable - explains.
Lecture 24 Multiple Regression (Sections )
Copyright (c) Bani K. Mallick1 STAT 651 Lecture #20.
Regression Diagnostics Checking Assumptions and Data.
Linear Regression Analysis 5E Montgomery, Peck and Vining 1 Chapter 6 Diagnostics for Leverage and Influence.
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Linear Regression 2 Sociology 5811 Lecture 21 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE.
Alcohol consumption and HDI story TotalBeerWineSpiritsOtherHDI Lifetime span Austria13,246,74,11,60,40,75580,119 Finland12,524,592,242,820,310,80079,724.
Roger B. Hammer Assistant Professor Department of Sociology Oregon State University Conducting Social Research Multiple Collinearity, Serial Correlation,
Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.
Basics of Data Cleaning
1 Reg12M G Multiple Regression Week 12 (Monday) Quality Control and Critical Evaluation of Regression Results An example Identifying Residuals Leverage:
Regression Model Building LPGA Golf Performance
Anaregweek11 Regression diagnostics. Regression Diagnostics Partial regression plots Studentized deleted residuals Hat matrix diagonals Dffits, Cook’s.
Review of Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
Dr. C. Ertuna1 Issues Regarding Regression Models (Lesson - 06/C)
6-3 Multiple Regression Estimation of Parameters in Multiple Regression.
Tutorial 4 MBP 1010 Kevin Brown. Correlation Review Pearson’s correlation coefficient – Varies between – 1 (perfect negative linear correlation) and 1.
AP Statistics Semester One Review Part 1 Chapters 1-3 Semester One Review Part 1 Chapters 1-3.
Outliers and influential data points. No outliers?
Applied Quantitative Analysis and Practices LECTURE#31 By Dr. Osman Sadiq Paracha.
Linear Models Alan Lee Sample presentation for STATS 760.
Advanced Statistical Methods: Continuous Variables REVIEW Dr. Irina Tomescu-Dubrow.
Lecture 13 Diagnostics in MLR Added variable plots Identifying outliers Variance Inflation Factor BMTRY 701 Biostatistical Methods II.
1 Reg12W G Multiple Regression Week 12 (Wednesday) Review of Regression Diagnostics Influence statistics Multicollinearity Examples.
Tutorial 5 Thursday February 14 MBP 1010 Kevin Brown.
Individual observations need to be checked to see if they are: –outliers; or –influential observations Outliers are defined as observations that differ.
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
1 Objective Given two linearly correlated variables (x and y), find the linear function (equation) that best describes the trend. Section 10.3 Regression.
Unit 10: Model Assumptions
Lecture 11: Simple Linear Regression
Chapter 20 Linear and Multiple Regression
Unit 3: Inferences about a Single Mean (1 Parameter models)
Inference for Least Squares Lines
Learning Objectives For two quantitative IVs, you will learn:
Chapter 12 Simple Linear Regression and Correlation
Chapter 6 Diagnostics for Leverage and Influence
Correlation, Bivariate Regression, and Multiple Regression
Unit 6: Inferences with 2 Predictors
Unit 4: Inferences about a Single Quantitative Predictor
Regression Diagnostics
Interactive Models: Two Quantitative Variables
Stat 6601 Project: Regression Diagnostics (V&R 6.3)
Slides by JOHN LOUCKS St. Edward’s University.
بحث في التحليل الاحصائي SPSS بعنوان :
Regression Model Building - Diagnostics
Diagnostics and Transformation for SLR
CHAPTER 29: Multiple Regression*
AP Exam Review Chapters 1-10
Chapter 14 – Correlation and Simple Regression
Residuals The residuals are estimate of the error
Chapter 12 Simple Linear Regression and Correlation
Motivational Examples Three Types of Unusual Observations
Multiple Linear Regression
CH2. Cleaning and Transforming Data
Regression Diagnostics
Regression Model Building - Diagnostics
Outliers and Influence Points
SA3101 Final Examination Solution
Linear Regression and Correlation
Essentials of Statistics for Business and Economics (8e)
Diagnostics and Transformation for SLR
Presentation transcript:

Unit 9: Dealing with Messy Data I: Case Analysis

Anscombe’s Quartet lm(y1 ~ x, data = Quartet) Coefficients Estimate SE t-statistic Pr(>|t|) (Intercept) 3.0001 1.1247 2.667 0.02573 * x 0.5001 0.1179 4.241 0.00217 ** --- Sum of squared errors (SSE): 13.8, Error df: 9 R-squared: 0.6665 lm(y2 ~ x, data = Quartet) Estimate SE t-statistic Pr(>|t|) (Intercept) 3.001 1.125 2.667 0.02576 * x 0.500 0.118 4.239 0.00218 ** R-squared: 0.6662 Anscombe, Francis J. (1973) Graphs in statistical analysis. American Statistician, 27, 17–21. see Quartet dataframe in car package

Case Analysis Goal is to identify any unusual or excessively influential data These data point may either bias results and/or reduce power to detect effects (inflate standard errors and/or decrease R2) Three aspects of individual observations we attend to: Leverage Regression Outlier Influence Case Analysis also provides an important first step as you get to “know” your data.

Case Analysis: Unusual and Influential Data setwd('P:\\CourseWebsites\\PSY710\\Data\\Diagnostics') d1 = dfReadDat ('DOSE2.dat') d1$Sex = as.numeric(d1$Sex) - 1.5 m1= lm(SP ~ BAC + TA + Sex, data=d1) modelSummary(m1) Coefficients Estimate SE t-statistic Pr(>|t|) (Intercept) 21.85097 7.38361 2.959 0.00392 ** BAC -196.07232 83.21315 -2.356 0.02058 * TA 0.14553 0.03119 4.666 1.04e-05 *** Sex -20.01198 6.57956 -3.042 0.00307 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Sum of squared errors (SSE): 94386.3, Error df: 92 R-squared: 0.2950

Univariate Statistics and Graphs 1: Univariate Statistics (n’s, means, sd, min/max, shape) varDescribe(d1) var n mean sd median min max skew kurtosis BAC 1 96 0.06 0.04 0.06 0.0 0.14 -0.09 -1.09 TA 2 96 147.61 105.73 119.00 10.0 445.00 0.89 -0.06 Sex 3 96 0.01 0.50 0.50 -0.5 0.50 -0.04 -2.02 FPS 4 96 32.19 37.54 19.46 -98.1 162.74 0.62 1.93

Univariate Statistics and Graphs 2: Univariate Plots (histograms, rug, and density plots) varPlot(d1$FPS, ‘FPS’) See also: hist(), rug(), density() "Descriptive statistics: FPS" n mean sd median min max skew kurtosis 96 32.19 37.54 19.46 -98.1 162.74 0.62 1.93

Bivariate Correlations > corr.test(d1) Correlation matrix BAC TA Sex FPS BAC 1.00 -0.02 -0.07 -0.19 TA -0.02 1.00 -0.08 0.44 Sex -0.07 -0.08 1.00 -0.29 FPS -0.19 0.44 -0.29 1.00

Univariate Statistics and Graphs 3: Bivariate Plots (Scatterplot, Rug, & Density) spm(~FPS + BAC + TA + Sex, data=d1)

Leverage (Cartoon data) 4. Check for high Leverage points Leverage is a property of the predictors (DV is not considered for leverage analysis). An observation will have increased “leverage” on the results as its distance from the mean of all predictors increases. Which points have the most leverage in the 1 predictor example below?

Leverage Hat values (hi) provide an index of leverage. In the one predictor case hi = 1/N + (Xi – X)2 / Σ(Xj- X)2 With multiple predictors, hi measures the distance from the centroid (point of means) of the Xs. Hat values are bounded between 1/N and 1. The mean Hat value is P/N Rules of thumb hi > 3 * h for small samples (< 100) hi > 2 * h for large samples Do NOT blindly apply rules of thumb. Hat values should be separated from distribution of hi. View a histogram of hi NOTE: Mahalanobis (Maha) distance = (N - 1)(hi - 1/N). SPSS reports centered leverage (h - 1/N)

Leverage (Cartoon data) High leverage values are not always bad. In fact, in some cases they are good. Must also consider if they are regression outliers. WHY? R2 = SSE(Mean-only) – SSE(A) SSE(Mean-only) SEbi = sy (1-R2Y) 1 — * ———— * ———— si (N-k-1) (1-R2i) High leverage points that are fit well by model increase the difference between SSE(Mean-only) and SSE(A) which increases R2 High leverage points that are fit well also increase variance for predictor. This reduces the SE for predictors and yields more power. Well fit, high leverage points do NOT alter b’s

Leverage (Real Data) modelCaseAnalysis(m1, Type='hatvalues')

Regression Outlier (Cartoon data) 5. Check for Regression Outliers An observation that is not adequately fit by the regression model (i.e., falls very far from the prediction line) In essence, a regression outlier is a discrepant score with a large residual (ei). Which point(s) are Regression Outliers?

Regression Outlier There are multiple quantitative indicators to identify regression outliers including raw residuals (ei), standardized residuals (e'i), and studentized residuals (t'i ). The preferred index is the studentized residual. t'i = ei / (SEe(-i) * (1-hi)) t'i follows a t-distribution with n-P-1 degrees of freedom Can use Bonferroni correction to test t’s for the studentized residuals. But again, not blindly. Should view a histogram of t'i . NOTE: SPSS calls these Studentized Deleted Residuals. Cohen calls these Externally Studentized Residual

Regression Outliers (Cartoon data) Regression outliers are always bad but they can have two different types of bad effects. WHY R2 = SSE(Mean-only) – SSE(A) SSE(Mean-only) SEbi = sy (1-R2Y) 1 — * ———— * ———— si (N-k-1) (1-R2i) Regression outliers increase SSE(A) which decreases R2. Decreased R2 leads to increased SEs for b’s. If outlier also has leverage can alter (increase or decrease) b’s

Regression Outlier (Real Data) modelCaseAnalysis(m1, Type='residuals')

Regression Outlier (Real Data) outlierTest(m1, cutoff= .05) rstudent unadjusted p-value Bonferonni p 0125 -4.39553 2.9872e-05 0.0028677

Influence (Cartoon data) An observation is “influential if it substantially alters the fitted regression model (i.e., the coefficients and/or intercept). Two commonly used assessment methods: Cooks distance dfBetas Which point(s) have the most Influence?

Cook's Distance Cook’s distance (Di) provides a single summary statistic to index how much influence each score has on the overall model. Cooks distance is based on both the “outlierness” (standardized residual) and leverage characteristics of the observation. Di = (E'i2 / P) * (hi / (1-hi)) Di > 4 / (N – P) has been proposed as a very liberal cutoff (identifies a lot of influential points). Di > qf(.5,P,N-P) has also been employed as very conservative. Identification of problematic scores should be considered in the context of the overall distribution of Di

Cook's Distance (Real Data) modelCaseAnalysis(m1, Type='cooksd')

Influence Bubble Plot (Real Data) modelCaseAnalysis(m1,Type='influenceplot') What are the expected effects of each of these points on the model?

dfBetas dfBetaij is an index of how much each regression coefficient (j= 0 – k) would change if the ith score was deleted. dfBetaij = bj – bj(-1) dfBetas (preferred) is the standardized form of the index dfBetas = dfBeta / SE bj(-i) |dfBetas| > 2 may be problematic. |dfBetas| > (2 / N) in larger samples (Belsley et al., 1980) Consider distribution with histogram! Also can visualize with added variable plot Problem is there can be many dfBetas (a set for each predictor and intercept). Most helpful when there is one “critical/focal effect.”

dfBetas (Real Data) lm.caseAnalysis(m1,Type='dfbetas')

Added Variable Plot (Real Data)

Impact on SEs In addition to altering regression coefficients (and reducing R2), problematic scores can increase the SEs (i.e., precision of estimation) of the regression coefficients. COVRATIO is an index that indicates how individual scores affect the overall precision of estimation (joint confidence region for set of coefficients) of the regression coefficients Observations that decrease the precision of estimation have COVRATIOS < 1.0. Belsley et al., (1980) proposed a cut off of: COVRATIOi < | 3* P/N -1 |

Impact on Ses (Real Data) modelCaseAnalysis(m1,Type='covratio')

Enter the Real World So what do you do????

Overall Impact of Problem Scores: Real Data Coefficients Estimate SE t-statistic Pr(>|t|) (Intercept) 21.85097 7.38361 2.959 0.00392 ** BAC -196.07232 83.21315 -2.356 0.02058 * TA 0.14553 0.03119 4.666 1.04e-05 *** Sex -20.01198 6.57956 -3.042 0.00307 ** --- Sum of squared errors (SSE): 94386.3, Error df: 92 R-squared: 0.2950 d2 = lm.removeCases(d1,c('0125')) m2 = lm(SP~BAC + BaseSTL + Sex, data=d2) summary(m2) Estimate SE t-statistic Pr(>|t|) (Intercept) 26.4196 6.8223 3.873 0.000203 *** BAC -243.1829 76.7423 -3.169 0.002085 ** TA 0.1415 0.0285 4.964 3.2e-06 *** Sex -17.6754 6.0319 -2.930 0.004281 ** Sum of squared errors (SSE): 77856.2, Error df: 91 R-squared: 0.3330

Four Examples with Fake Data