Presentation is loading. Please wait.

Presentation is loading. Please wait.

© Willett, Harvard University Graduate School of Education, 11/13/2015S052/I.1(c) – Slide 1 More details can be found in the “Course Objectives and Content”

Similar presentations


Presentation on theme: "© Willett, Harvard University Graduate School of Education, 11/13/2015S052/I.1(c) – Slide 1 More details can be found in the “Course Objectives and Content”"— Presentation transcript:

1

2 © Willett, Harvard University Graduate School of Education, 11/13/2015S052/I.1(c) – Slide 1 More details can be found in the “Course Objectives and Content” handout on the course webpage. Multiple Regression Analysis (MRA) Multiple Regression Analysis (MRA) Do your residuals meet the required assumptions? Test for residual normality Use influence statistics to detect atypical datapoints If your residuals are not independent, replace OLS by GLS regression analysis Use Individual growth modeling Specify a Multi-level Model If your sole predictor is continuous, MRA is identical to correlational analysis If your sole predictor is dichotomous, MRA is identical to a t-test If your several predictors are categorical, MRA is identical to ANOVA If time is a predictor, you need discrete- time survival analysis… If your outcome is categorical, you need to use… Binomial logistic regression analysis (dichotomous outcome) Multinomial logistic regression analysis (polytomous outcome) If you have more predictors than you can deal with, Create taxonomies of fitted models and compare them. Form composites of the indicators of any common construct. Conduct a Principal Components Analysis Use Cluster Analysis Use non-linear regression analysis. Transform the outcome or predictor If your outcome vs. predictor relationship is non-linear, How do you deal with missing data? S052/§I.1(c): Applied Data Analysis Roadmap of the Course – What Is Today’s Topic? Today’s Topic Area

3 © Willett, Harvard University Graduate School of Education, 11/13/2015S052/I.1(c) – Slide 2 S052/§I.1(c): Applied Data Analysis Where Does Today’s Topic Appear in the Printed Syllabus? Please check inter-connections among the Roadmap, the Daily Topic Area, the Printed Syllabus, and the content of today’s class when you pre-read the day’s materials. Syllabus Section I.1(c)Detecting Influential Data-Points, and Assessing Their Impact On Model Fit Syllabus Section I.1(c), on Detecting Influential Data-Points, and Assessing Their Impact On Model Fit, includes: Story So Far –Anything We Need Still Need to Consider? (Slides 3-4). Data-points Can Be Atypical In Two Important Ways (Slide 5) Three Useful “Influence” Statistics (Slides 6-8). Estimating And Inspecting The Influence Statistics (Slides 9-12). Looking for Interesting Groupings Among the Atypical Data-Points (Slides 13-15). Conducting A Reasonable Sensitivity Analysis (Slides 16-17) Appendix 1: Technical Definitions of the PRESS, HAT & COOK’S D Statistics (Slides 18-20). Syllabus Section I.1(c)Detecting Influential Data-Points, and Assessing Their Impact On Model Fit Syllabus Section I.1(c), on Detecting Influential Data-Points, and Assessing Their Impact On Model Fit, includes: Story So Far –Anything We Need Still Need to Consider? (Slides 3-4). Data-points Can Be Atypical In Two Important Ways (Slide 5) Three Useful “Influence” Statistics (Slides 6-8). Estimating And Inspecting The Influence Statistics (Slides 9-12). Looking for Interesting Groupings Among the Atypical Data-Points (Slides 13-15). Conducting A Reasonable Sensitivity Analysis (Slides 16-17) Appendix 1: Technical Definitions of the PRESS, HAT & COOK’S D Statistics (Slides 18-20).

4 © Willett, Harvard University Graduate School of Education, 11/13/2015S052/I.1(c) – Slide 3 S052/I.1(c): Detecting Influential Data-Points, and Assessing Their Impact The Story So Far – What Have We Still To Consider? Is there any reason we might not trust the parameter estimates, statistical inference and goodness-of-fit statistics obtained in this “final model? Two Issues Have Gone Unexamined:  Atypical data points may be present in the point-cloud and driving the findings.  Need to check this.  Make sure all is well.  Need to check that the usual regression assumptions are met! Two Issues Have Gone Unexamined:  Atypical data points may be present in the point-cloud and driving the findings.  Need to check this.  Make sure all is well.  Need to check that the usual regression assumptions are met! Why Wait Until The Final Model To Check These Issues Out?  Probably should have checked them earlier, but we must certainly check them here!  If we find anything strange, we can always go back and refit the earlier models too! Why Wait Until The Final Model To Check These Issues Out?  Probably should have checked them earlier, but we must certainly check them here!  If we find anything strange, we can always go back and refit the earlier models too!

5 © Willett, Harvard University Graduate School of Education, 11/13/2015S052/I.1(c) – Slide 4 S052/I.1(c): Detecting Influential Data-Points, and Assessing Their Impact Could There Be A Problem With Atypical Data Points in the ILLCAUSE Analysis? Think back to our initial exploratory analyses … Three Important Questions: 1.How do we locate atypical data-points (and, what do we mean by “atypical”)? 2.How do we evaluate, and compare, the impact of each atypical data- point on the fitted regression model? 3.Does it matter if the atypical data-points occur in groups or families? Three Important Questions: 1.How do we locate atypical data-points (and, what do we mean by “atypical”)? 2.How do we evaluate, and compare, the impact of each atypical data- point on the fitted regression model? 3.Does it matter if the atypical data-points occur in groups or families? “Houston, we have a problem … ?” And things may be worse than we think … This is a simple example with only few predictors:  Atypical data-points show up clearly … And things may be worse than we think … This is a simple example with only few predictors:  Atypical data-points show up clearly … And, all atypical data- points are not created equal:  The impact of each point on the fit depends on where it sits in the point-cloud. And, all atypical data- points are not created equal:  The impact of each point on the fit depends on where it sits in the point-cloud. With many predictors in an analysis, it’s not so easy to spot atypical data-points:  Detection usually depends on how you look at the data. With many predictors in an analysis, it’s not so easy to spot atypical data-points:  Detection usually depends on how you look at the data.

6 © Willett, Harvard University Graduate School of Education, 11/13/2015S052/I.1(c) – Slide 5 S052/I.1(c): Detecting Influential Data-Points, and Assessing Their Impact Data-Points Can Be “Atypical” In Two Important Ways “Atypicality”Atypicality “Atypicality”Atypicality How Do High-Leverage Points Affect The Fit?  May affect parameter estimates alot: Particularly the estimated slope. May lead to big changes in the estimated intercept, because of the “see-saw” effect.  May cause unpredictable fluctuations in SSError, with contingent impact on: Goodness-of-fit (R 2 ), Statistical inference:  standard errors  t-statistics  p-values.  hypothesis testing. How Do High-Leverage Points Affect The Fit?  May affect parameter estimates alot: Particularly the estimated slope. May lead to big changes in the estimated intercept, because of the “see-saw” effect.  May cause unpredictable fluctuations in SSError, with contingent impact on: Goodness-of-fit (R 2 ), Statistical inference:  standard errors  t-statistics  p-values.  hypothesis testing. How Do Outliers Affect The Fit?  May not affect parameter estimates much: May impact the estimated intercept, a little. May leave the estimated slope unchanged.  Will usually inflate the SSError, causing: Big reduction in goodness-of-fit statistics (R 2 ), Big impact on inferential statistics:  Bigger standard errors,  Smaller t-statistics,  Bigger p-values,  Less power for the analysis (i.e., harder to reject the null hypothesis). How Do Outliers Affect The Fit?  May not affect parameter estimates much: May impact the estimated intercept, a little. May leave the estimated slope unchanged.  Will usually inflate the SSError, causing: Big reduction in goodness-of-fit statistics (R 2 ), Big impact on inferential statistics:  Bigger standard errors,  Smaller t-statistics,  Bigger p-values,  Less power for the analysis (i.e., harder to reject the null hypothesis). “Outliers” are “Extreme-on-Y” “Outliers” are “Extreme-on-Y” “High-Leverage” data-points are “Extreme-on-X” “High-Leverage” data-points are “Extreme-on-X”

7 © Willett, Harvard University Graduate School of Education, 11/13/2015S052/I.1(c) – Slide 6 S052/I.1(c): Detecting Influential Data-Points, and Assessing Their Impact Three Useful “Influence” Statistics, Each Responsible for a Different Job! To assess the overall impact of the point on the regression fit use … You need one statistic to identify an atypical data- point’s impact! Cook’s D Statistic To detect one that is “Extreme-on-Y” use... To detect one that is “Extreme-on-X” use … You need two statistics to identify an atypical data-point’s location! PRESS ResidualHAT Statistic How do you detect atypical data-points in a large multi- dimensional point-cloud? Decision-Rule: To locate atypical cases seek those with “large” values (relatively speaking) of the Influence Statistics – see Handout I.1(c).1

8 © Willett, Harvard University Graduate School of Education, 11/13/2015S052/I.1(c) – Slide 7 *---------------------------------------------------------------------------------* Estimate and output influence statistics for the final regression model *---------------------------------------------------------------------------------*; PROC REG DATA=ILLCAUSE; VAR ILLCAUSE ILL AGE SES; M6: MODEL ILLCAUSE = ILL AGE ILLxAGE SES; * Output influence statistics into temporary dataset for further diagnosis; OUTPUT OUT=DIAGNOSE PREDICTED=PRED PRESS=PRESS RSTUDENT=STDPRESS COOKD=COOKD H=HAT; *---------------------------------------------------------------------------------* Estimate and output influence statistics for the final regression model *---------------------------------------------------------------------------------*; PROC REG DATA=ILLCAUSE; VAR ILLCAUSE ILL AGE SES; M6: MODEL ILLCAUSE = ILL AGE ILLxAGE SES; * Output influence statistics into temporary dataset for further diagnosis; OUTPUT OUT=DIAGNOSE PREDICTED=PRED PRESS=PRESS RSTUDENT=STDPRESS COOKD=COOKD H=HAT; Note: All other variables in the dataset, including the ID variable, outcome and predictors automatically enter the DIAGNOSE dataset. S052/I.1(c): Detecting Influential Data-Points, and Assessing Their Impact Estimating and Inspecting the Influence Statistics is Easy! In Handout I.2(c).1, I estimate, display and summarize influence statistics for Final Model M6: OUTput influence statistics into it. PRESS Raw PRESS Residual Cook’s D Statistic HAT Statistic Standardized Influence Statistic COOKD HATH STDPRESSRSTUDENT Name of New Variable SAS Command Create a new (temporary) dataset, called DIAGNOSE. Put PREDICTED values of the outcome into the new DIAGNOSE dataset, & call them PRED.

9 © Willett, Harvard University Graduate School of Education, 11/13/2015S052/I.1(c) – Slide 8 S052/I.1(c): Detecting Influential Data-Points, and Assessing Their Impact Estimating and Inspecting the Influence Statistics Inspect the influence statistics using exploratory analysis … *--------------------------------------------------------------* Identify the most influential data points *-------------------------------------------------------------*; * Get some sense of the magnitudes of the influence statistics; PROC PLOT DATA=DIAGNOSE; PLOT (STDPRESS HAT COOKD)*ID = '+'; * Identify the extreme and influential cases, by ID; PROC UNIVARIATE DATA=DIAGNOSE; ID ID; VAR STDPRESS HAT COOKD; *--------------------------------------------------------------* Identify the most influential data points *-------------------------------------------------------------*; * Get some sense of the magnitudes of the influence statistics; PROC PLOT DATA=DIAGNOSE; PLOT (STDPRESS HAT COOKD)*ID = '+'; * Identify the extreme and influential cases, by ID; PROC UNIVARIATE DATA=DIAGNOSE; ID ID; VAR STDPRESS HAT COOKD; Notice, in these displays, I’ve used the standardized version of the PRESS residual: This is because I have some sense of what its magnitude actually means (i.e., a value greater than  2 is big!!!). It’s probably best to use the raw PRESS residual when you test for residual normality, however, in subsequent analyses. Notice, in these displays, I’ve used the standardized version of the PRESS residual: This is because I have some sense of what its magnitude actually means (i.e., a value greater than  2 is big!!!). It’s probably best to use the raw PRESS residual when you test for residual normality, however, in subsequent analyses. Plot each influence statistic versus Subject ID in order to identify which children have the largest values of each statistic. Obtain univariate descriptive summaries of the distribution of each influence statistic, with extreme values labeled by the case ID value (to easily identify the atypical cases).

10 © Willett, Harvard University Graduate School of Education, 11/13/2015S052/I.1(c) – Slide 9 S ‚ t ‚ u 3 ˆ + d ‚ + e ‚ + n ‚ + t ‚ + i ‚ + z 2 ˆ + e ‚ + + d ‚ + + + + ‚ + + + + R ‚ + + + + + e ‚ + ++ s 1 ˆ + + + + + i ‚ + + + + ++ d ‚ + + + + + + u ‚ ++ + + + + + + + + ++ + a ‚ +++ + ++ + + ++ ++ + ++ + l ‚ + + + + + + + + + + 0 ˆ ++ + + + + + + + w ‚ ++ + + + + + + ++ ++ ++ + i ‚+ + + + +++ +++ + + ++ + t ‚ ++ + + + ++ + + h ‚ ++ + ++ + + + +++ + + o ‚ + + + + + + ++ + u -1 ˆ + + ++ + + + t ‚ + + +++ ++ + ‚ + + C ‚ + + u ‚ + r ‚ + + + r -2 ˆ e ‚ + n ‚ t ‚ + + ‚ O ‚ + b -3 ˆ + s ‚ Šˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒ 300 350 400 450 500 550 600 650 700 750 ID S ‚ t ‚ u 3 ˆ + d ‚ + e ‚ + n ‚ + t ‚ + i ‚ + z 2 ˆ + e ‚ + + d ‚ + + + + ‚ + + + + R ‚ + + + + + e ‚ + ++ s 1 ˆ + + + + + i ‚ + + + + ++ d ‚ + + + + + + u ‚ ++ + + + + + + + + ++ + a ‚ +++ + ++ + + ++ ++ + ++ + l ‚ + + + + + + + + + + 0 ˆ ++ + + + + + + + w ‚ ++ + + + + + + ++ ++ ++ + i ‚+ + + + +++ +++ + + ++ + t ‚ ++ + + + ++ + + h ‚ ++ + ++ + + + +++ + + o ‚ + + + + + + ++ + u -1 ˆ + + ++ + + + t ‚ + + +++ ++ + ‚ + + C ‚ + + u ‚ + r ‚ + + + r -2 ˆ e ‚ + n ‚ t ‚ + + ‚ O ‚ + b -3 ˆ + s ‚ Šˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒ 300 350 400 450 500 550 600 650 700 750 ID S052/I.1(c): Detecting Influential Data-Points, and Assessing Their Impact Are There Any Cases With Extreme Values of the PRESS Residual? Plot of Standardized PRESS Residuals vs. Subject ID STDPRESS Moments N 194 Sum Weights 194 Mean -0.00076 Sum Observations -0.1470 Std Deviation 1.011123 Variance 1.02237 Skewness 0.178864 Kurtosis 0.80517 Location Variability Mean -0.00076 Std Deviation 1.01112 Median -0.11796 Variance 1.02237 Mode -0.33345 Range 5.98343 Interquartile Range 1.21688 Quantile Estimate 100% Max 3.004771 99% 2.842520 95% 1.689922 90% 1.360723 75% Q3 0.530923 50% Median -0.117958 25% Q1 -0.685952 10% -1.118615 5% -1.461300 1% -2.796148 0% Min -2.978657 STDPRESS Moments N 194 Sum Weights 194 Mean -0.00076 Sum Observations -0.1470 Std Deviation 1.011123 Variance 1.02237 Skewness 0.178864 Kurtosis 0.80517 Location Variability Mean -0.00076 Std Deviation 1.01112 Median -0.11796 Variance 1.02237 Mode -0.33345 Range 5.98343 Interquartile Range 1.21688 Quantile Estimate 100% Max 3.004771 99% 2.842520 95% 1.689922 90% 1.360723 75% Q3 0.530923 50% Median -0.117958 25% Q1 -0.685952 10% -1.118615 5% -1.461300 1% -2.796148 0% Min -2.978657 Extreme Observations --------Lowest-------- Value ID Obs -2.97866 444 49 -2.79615 307 7 -2.56942 441 46 -2.49632 617 125 -2.23499 424 40 Extreme Observations --------Lowest-------- Value ID Obs -2.97866 444 49 -2.79615 307 7 -2.56942 441 46 -2.49632 617 125 -2.23499 424 40 Extreme Observations -------Highest------- Value ID Obs 2.36606 726 183 2.48018 702 158 2.69661 621 129 2.84252 423 39 3.00477 553 86 Extreme Observations -------Highest------- Value ID Obs 2.36606 726 183 2.48018 702 158 2.69661 621 129 2.84252 423 39 3.00477 553 86

11 © Willett, Harvard University Graduate School of Education, 11/13/2015S052/I.1(c) – Slide 10 S052/I.1(c): Detecting Influential Data-Points, and Assessing Their Impact Are There Any Cases With Extreme Values of the HAT Statistic? ‚ 0.08 ˆ ‚ ‚ + ‚ 0.07 ˆ ‚ ‚ + ‚ 0.06 ˆ ‚ L ‚ + e ‚ + + v 0.05 ˆ + + e ‚ + r ‚ + + a ‚ + + + + + g ‚ + + + + + e 0.04 ˆ + + + + ++ + + + + ‚ + + ++ + + ‚ + ++ + ++ + ‚ ++ + + + + + + ++ + + ++ + ‚ + + + + + 0.03 ˆ ++++ ++ + + + + ‚ + + ++ + ++ ++ + ‚ + + + + + + ‚ + ++ + + ++++ ‚ + ++ + + +++ + + ++ 0.02 ˆ + + + + +++ ++ ‚ + + + ++ + + ++ + + + ‚ ++ + + + + ++ ++++ + +++++ ‚ + + +++ + +++++++++ + + ++ ++ ‚ + + + + + + + ++ 0.01 ˆ + ++ ‚ Šƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒ 300 360 420 480 540 600 660 720 780 ID ‚ 0.08 ˆ ‚ ‚ + ‚ 0.07 ˆ ‚ ‚ + ‚ 0.06 ˆ ‚ L ‚ + e ‚ + + v 0.05 ˆ + + e ‚ + r ‚ + + a ‚ + + + + + g ‚ + + + + + e 0.04 ˆ + + + + ++ + + + + ‚ + + ++ + + ‚ + ++ + ++ + ‚ ++ + + + + + + ++ + + ++ + ‚ + + + + + 0.03 ˆ ++++ ++ + + + + ‚ + + ++ + ++ ++ + ‚ + + + + + + ‚ + ++ + + ++++ ‚ + ++ + + +++ + + ++ 0.02 ˆ + + + + +++ ++ ‚ + + + ++ + + ++ + + + ‚ ++ + + + + ++ ++++ + +++++ ‚ + + +++ + +++++++++ + + ++ ++ ‚ + + + + + + + ++ 0.01 ˆ + ++ ‚ Šƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒ 300 360 420 480 540 600 660 720 780 ID Plot of HAT Statistic vs. Subject ID HAT Moments N 205 Sum Weights 205 Mean 0.0259 Sum Observations 5.3046 Std Deviation 0.0118 Variance 0.0001 Skewness 0.8435 Kurtosis 0.6955 Location Variability Mean 0.0259 Std Deviation 0.0118 Median 0.0236 Variance 0.0001 Mode 0.0112 Range 0.0636 Interquartile Range 0.01870 Quantile Estimate 100% Max 0.0739455 99% 0.0537198 95% 0.0449016 90% 0.0414564 75% Q3 0.0342913 50% Median 0.0236324 25% Q1 0.0155929 10% 0.0132485 5% 0.0111869 1% 0.0104275 0% Min 0.0103506 HAT Moments N 205 Sum Weights 205 Mean 0.0259 Sum Observations 5.3046 Std Deviation 0.0118 Variance 0.0001 Skewness 0.8435 Kurtosis 0.6955 Location Variability Mean 0.0259 Std Deviation 0.0118 Median 0.0236 Variance 0.0001 Mode 0.0112 Range 0.0636 Interquartile Range 0.01870 Quantile Estimate 100% Max 0.0739455 99% 0.0537198 95% 0.0449016 90% 0.0414564 75% Q3 0.0342913 50% Median 0.0236324 25% Q1 0.0155929 10% 0.0132485 5% 0.0111869 1% 0.0104275 0% Min 0.0103506 Extreme Observations -------Highest------- Value ID Obs 0.0514490 636 144 0.0525675 336 36 0.0537198 568 101 0.0667425 322 22 0.0739455 700 156 Extreme Observations -------Highest------- Value ID Obs 0.0514490 636 144 0.0525675 336 36 0.0537198 568 101 0.0667425 322 22 0.0739455 700 156 Extreme Observations -------Lowest------- Value ID Obs 0.0103506 332 32 0.0103872 512 61 0.0104275 327 27 0.0105420 504 53 0.0108084 508 57 Extreme Observations -------Lowest------- Value ID Obs 0.0103506 332 32 0.0103872 512 61 0.0104275 327 27 0.0105420 504 53 0.0108084 508 57

12 © Willett, Harvard University Graduate School of Education, 11/13/2015S052/I.1(c) – Slide 11 ‚ 0.08 ˆ ‚ C ‚ o ‚ o 0.07 ˆ + + k ‚ ' ‚ s ‚ + 0.06 ˆ D ‚ ‚ I ‚ n 0.05 ˆ f ‚ l ‚ + u ‚ + + e 0.04 ˆ n ‚ + + c ‚ + e ‚ 0.03 ˆ S ‚ + + t ‚ + + + a ‚ + t 0.02 ˆ + + + i ‚ + s ‚ + + + t ‚ ++ i 0.01 ˆ + ++ + + c ‚ + + + + ++ + + + ‚ + + +++ + + + + ++ ++ ++ +++ ‚ +++ ++ +++ ++++++++ ++ + ++ + ++ 0.00 ˆ +++++++ ++ + ++ +++++++++ ++++++++++ +++++++++ ‚ Šƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒ 300 360 420 480 540 600 660 720 780 ID ‚ 0.08 ˆ ‚ C ‚ o ‚ o 0.07 ˆ + + k ‚ ' ‚ s ‚ + 0.06 ˆ D ‚ ‚ I ‚ n 0.05 ˆ f ‚ l ‚ + u ‚ + + e 0.04 ˆ n ‚ + + c ‚ + e ‚ 0.03 ˆ S ‚ + + t ‚ + + + a ‚ + t 0.02 ˆ + + + i ‚ + s ‚ + + + t ‚ ++ i 0.01 ˆ + ++ + + c ‚ + + + + ++ + + + ‚ + + +++ + + + + ++ ++ ++ +++ ‚ +++ ++ +++ ++++++++ ++ + ++ + ++ 0.00 ˆ +++++++ ++ + ++ +++++++++ ++++++++++ +++++++++ ‚ Šƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒ 300 360 420 480 540 600 660 720 780 ID S052/I.1(c): Detecting Influential Data-Points, and Assessing Their Impact Do Any Cases With Extreme Values of the Cook’s D Statistic? Plot of Cook’s D Statistic vs. Subject ID Moments N 194 Sum Weights 194 Mean 0.0058 Sum Observations 1.1307 Std Deviation 0.0116 Variance 0.0001 Skewness 3.4726 Kurtosis 13.63 Location Variability Mean 0.0058 Std Deviation 0.0116 Median 0.0014 Variance 0.0001 Mode 0.0001 Range 0.0712 Interquartile Range 0.0046 Quantile Estimate 100% Max 7.11460E-02 99% 6.99197E-02 95% 2.86361E-02 90% 1.70598E-02 75% Q3 5.09389E-03 50% Median 1.41318E-03 25% Q1 4.49886E-04 10% 1.22396E-04 5% 4.67553E-05 1% 2.59383E-06 0% Min 6.91849E-09 Moments N 194 Sum Weights 194 Mean 0.0058 Sum Observations 1.1307 Std Deviation 0.0116 Variance 0.0001 Skewness 3.4726 Kurtosis 13.63 Location Variability Mean 0.0058 Std Deviation 0.0116 Median 0.0014 Variance 0.0001 Mode 0.0001 Range 0.0712 Interquartile Range 0.0046 Quantile Estimate 100% Max 7.11460E-02 99% 6.99197E-02 95% 2.86361E-02 90% 1.70598E-02 75% Q3 5.09389E-03 50% Median 1.41318E-03 25% Q1 4.49886E-04 10% 1.22396E-04 5% 4.67553E-05 1% 2.59383E-06 0% Min 6.91849E-09 Extreme Observations ------Highest------ Value ID Obs 0.0427484 424 40 0.0439993 322 22 0.0636503 444 49 0.0699197 553 86 0.0711460 307 7 Extreme Observations ------Highest------ Value ID Obs 0.0427484 424 40 0.0439993 322 22 0.0636503 444 49 0.0699197 553 86 0.0711460 307 7 Extreme Observations -------Lowest--------- Value ID Obs 6.91849E-09 614 122 2.59383E-06 426 41 2.59699E-06 569 102 1.67677E-05 304 4 1.67807E-05 510 59 Extreme Observations -------Lowest--------- Value ID Obs 6.91849E-09 614 122 2.59383E-06 426 41 2.59699E-06 569 102 1.67677E-05 304 4 1.67807E-05 510 59

13 © Willett, Harvard University Graduate School of Education, 11/13/2015S052/I.1(c) – Slide 12 Table 1.1(c).1. Estimates of influence statistics, by case ID, from fitted Model M6 describing the relationship between children’s understanding of illness causality and health status, controlling for child age and family SES. S052/I.1(c): Detecting Influential Data-Points, and Assessing Their Impact So, Which Cases Are Individually Atypical, and How? It’s useful to summarize the influence statistics, by case … IDPRESSHATCook’s DPreliminary Statistical Conclusion #5533.000.070Extreme-on-Y, impacts fit #4232.84Extreme-on-Y #6212.70Extreme-on-Y #7022.48Extreme-on-Y #617-2.49Extreme-on-Y #441-2.57Extreme-on-Y #307-2.800.071Extreme-on-Y, impacts fit #444-2.980.064Extreme-on-Y, impacts fit #7000.074Extreme-on-X #3220.067Extreme-on-X But, it’s always better to examine the cases substantively in order to understand who they are – as follows, in Handout I.1(c).2. Three potentially important individual atypical cases?

14 © Willett, Harvard University Graduate School of Education, 11/13/2015S052/I.1(c) – Slide 13 *-----------------------------------------------------------------------* Identify & label influential cases by ID#, and prune dataset leaving only them *----------------------------------------------------------------------*; DATA HICASES; SET ILLCAUSE; * Identify and label the potential outliers; IF ID=307 OR ID=423 OR ID=441 OR ID=444 OR ID=553 OR ID=617 OR ID=621 OR ID=702 THEN HIPRESS=1; ELSE HIPRESS=0; * Identify and label the potential high leverage cases; IF ID=322 OR ID=700 THEN HIHAT=1; ELSE HIHAT=0; * Identify and label the potential cases with overall influence; IF ID=307 OR ID=444 OR ID=553 THEN HICOOKSD=1; ELSE HICOOKSD=0; * Drop all cases except those that were identified as atypical; IF HIPRESS=0 AND HIHAT=0 AND HICOOKSD=0 THEN DELETE; *-----------------------------------------------------------------------------* Sort atypical cases and list their selected characteristics *-----------------------------------------------------------------------------*; * Sort and print out the details of the atypical cases; PROC SORT DATA=HICASES; BY DESCENDING ILLCAUSE HEALTH AGE SES; PROC PRINT DATA=HICASES; VAR ID ILLCAUSE HEALTH AGE SES HICOOKSD HIPRESS HIHAT; FORMAT HEALTH HFMT.; * Seek out interesting clusters of influential cases; PROC G3D DATA=HICASES; SCATTER AGE*SES=ILLCAUSE / GRID ROTATE=75; *-----------------------------------------------------------------------* Identify & label influential cases by ID#, and prune dataset leaving only them *----------------------------------------------------------------------*; DATA HICASES; SET ILLCAUSE; * Identify and label the potential outliers; IF ID=307 OR ID=423 OR ID=441 OR ID=444 OR ID=553 OR ID=617 OR ID=621 OR ID=702 THEN HIPRESS=1; ELSE HIPRESS=0; * Identify and label the potential high leverage cases; IF ID=322 OR ID=700 THEN HIHAT=1; ELSE HIHAT=0; * Identify and label the potential cases with overall influence; IF ID=307 OR ID=444 OR ID=553 THEN HICOOKSD=1; ELSE HICOOKSD=0; * Drop all cases except those that were identified as atypical; IF HIPRESS=0 AND HIHAT=0 AND HICOOKSD=0 THEN DELETE; *-----------------------------------------------------------------------------* Sort atypical cases and list their selected characteristics *-----------------------------------------------------------------------------*; * Sort and print out the details of the atypical cases; PROC SORT DATA=HICASES; BY DESCENDING ILLCAUSE HEALTH AGE SES; PROC PRINT DATA=HICASES; VAR ID ILLCAUSE HEALTH AGE SES HICOOKSD HIPRESS HIHAT; FORMAT HEALTH HFMT.; * Seek out interesting clusters of influential cases; PROC G3D DATA=HICASES; SCATTER AGE*SES=ILLCAUSE / GRID ROTATE=75; S052/I.1(c): Detecting Influential Data-Points, and Assessing Their Impact Describing Each Atypical Case in More Detail In Handout I.2(c).2, I begin to parse the substantive identity of atypical cases -- here’s one approach … There are many other, equally reasonable, ways of doing the same thing … use your imagination, and come up with something better! Create a three dimensional scatterplot of the atypical cases, to check if there are any natural groupings Print a list of atypical cases, accompanied by outcome and predictor values. Drop all cases, except the atypical ones, from the new dataset Classify the atypical cases by creating new categorical variables, HIPRESS, HIHAT & HICOOKSD Create a new SAS dataset, called HICASES, to contain only the atypical cases.

15 © Willett, Harvard University Graduate School of Education, 11/13/2015S052/I.1(c) – Slide 14 S052/I.1(c): Detecting Influential Data-Points, and Assessing Their Impact Are There Substantively-Interesting Groupings of Atypical Points? Because the Cook’s D statistic only summarizes the impact of individual atypical datapoints on the fitted regression model, it’s useful to ask whether neighboring atypical datapoints can be clustered into interesting sub-groups for examination in a responsible sensitivity analyses … Computer output, not an APA-Style Table Obs ID ILLCAUSE HEALTH AGE SES HICOOKSD HIPRESS HIHAT 1 702 6.000 Healthy 135 3 0 1 0 2 621 5.857 Healthy 112 1 0 1 0 3 423 5.143 Asthmatic 112 2 0 1 0 4 553 4.571 Asthmatic 80 4 1 1 0 5 322 3.286 Diabetic 190 5 0 0 1 6 700 3.000 Healthy 68 4 0 0 1 7 307 2.857 Diabetic 194 4 1 1 0 8 444 2.571 Asthmatic 181 4 1 1 0 9 617 2.286 Healthy 83 1 0 1 0 10 441 1.571 Asthmatic 78 2 0 1 0 It’s usually easier to make these kinds of decisions using a thoughtful graphical display of some kind … At this point, I favor ignoring the values of the influence statistics – they were needed to identify the atypical data-points. Instead, I favor inspecting the characteristics of the atypical cases in order to assess whether pairs, or groups, of cases sit close together in the point cloud.

16 D D A A A A H H H H 307 322 444 702 423 621 553 700 441 617 © Willett, Harvard University Graduate School of Education, 11/13/2015S052/I.1(c) – Slide 15 S052/I.1(c): Detecting Influential Data-Points, and Assessing Their Impact Are There Substantively-Interesting Groupings Of Atypical Points? D = Diabetic A =Asthmatic H = Healthy D = Diabetic A =Asthmatic H = Healthy Figure 1.1(c). Three- dimensional scatterplot of ILLCAUSE by AGE and SES for 10 atypical cases identified with influence statistics in M6.

17 © Willett, Harvard University Graduate School of Education, 11/13/2015S052/I.1(c) – Slide 16 *------------------------------------------------------------------------------* Create reduced datasets with atypical data points temporarily omitted Individually and in small groups, based on the previous analyses and graph *------------------------------------------------------------------------------*; * Datasets with healthy children temporarily omitted; DATA TEMP_IC1; SET ILLCAUSE; IF ID=617 THEN DELETE; DATA TEMP_IC2; SET ILLCAUSE; IF ID=700 THEN DELETE; DATA TEMP_IC3; SET ILLCAUSE; IF ID=621 OR ID=702 THEN DELETE; * Datasets with asthmatic children temporarily omitted; DATA TEMP_IC4; SET ILLCAUSE; IF ID=441 THEN DELETE; DATA TEMP_IC5; SET ILLCAUSE; IF ID=553 THEN DELETE; DATA TEMP_IC6; SET ILLCAUSE; IF ID=423 THEN DELETE; DATA TEMP_IC7; SET ILLCAUSE; IF ID=444 THEN DELETE; * Datasets with diabetic children temporarily omitted; DATA TEMP_IC8; SET ILLCAUSE; IF ID=307 OR ID=322 THEN DELETE; *------------------------------------------------------------------------------* Refit the final model in each of the temporarily reduced datasets *------------------------------------------------------------------------------*; * Regression analyses with healthy children temporarily omitted; PROC REG DATA=TEMP_IC1; M6_1: MODEL ILLCAUSE = ILL AGE ILLxAGE SES; PROC REG DATA=TEMP_IC2; M6_2: MODEL ILLCAUSE = ILL AGE ILLxAGE SES; PROC REG DATA=TEMP_IC3; M6_3: MODEL ILLCAUSE = ILL AGE ILLxAGE SES; * Regression analyses with asthmatic children temporarily omitted; PROC REG DATA=TEMP_IC4; M6_4: MODEL ILLCAUSE = ILL AGE ILLxAGE SES; PROC REG DATA=TEMP_IC5; M6_5: MODEL ILLCAUSE = ILL AGE ILLxAGE SES; PROC REG DATA=TEMP_IC6; M6_6: MODEL ILLCAUSE = ILL AGE ILLxAGE SES; PROC REG DATA=TEMP_IC7; M6_7: MODEL ILLCAUSE = ILL AGE ILLxAGE SES; * Regression analyses with diabetic children temporarily omitted; PROC REG DATA=TEMP_IC8; M6_8: MODEL ILLCAUSE = ILL AGE ILLxAGE SES; S052/I.1(c): Detecting Influential Data-Points, and Assessing Their Impact Conducting A Reasonable Sensitivity Analysis It may not be ethical nor reasonable to eliminate atypical children entirely – it’s better to conduct sensible sensitivity analyses:  Refit final model, M6, repeatedly, while temporarily omitting atypical data points either singly or in sensible sub-groups.  Obtain parameter estimates, products of statistical inference, and goodness-of-fit statistics in each fit.  Compare and contrast the impact of each temporary omission on the regression analysis … Handout I.1(c).3 It may not be ethical nor reasonable to eliminate atypical children entirely – it’s better to conduct sensible sensitivity analyses:  Refit final model, M6, repeatedly, while temporarily omitting atypical data points either singly or in sensible sub-groups.  Obtain parameter estimates, products of statistical inference, and goodness-of-fit statistics in each fit.  Compare and contrast the impact of each temporary omission on the regression analysis … Handout I.1(c).3 Create supplemental SAS datasets (TEMP_IC1, TEMPT_IC2, etc), each with different sensible permutations of the atypical data points omitted. Repeatedly fit final model M6 in each of the temporary datasets.

18 © Willett, Harvard University Graduate School of Education, 11/13/2015S052/I.1(c) – Slide 17 Table 1.1(c).2. Parameter estimates, approximate p-values, and associated goodness-of-fit statistics for a sensitivity analysis of Model M6, the final model in an earlier taxonomy of fitted multiple regression models describing the relationship between children’s understanding of illness causality and their health status, controlling for child age and family SES. S052/I.1(c): Detecting Influential Data-Points, and Assessing Their Impact How Is The Regression Fit Affected By the Omission of Atypical Datapoints? Cases Omitted Nature of Omitted Point(s) Parameter Estimates Int’ cept ILLAGE ILL × AGE SESR2R2 SSE None---2.14 ***.030.020 *** -.0065 ** -.100 *.67465.8 #617IC=2, Healthy, Age 7, SES=12.22 *** -.034.020 *** -.0061 ** -.108 *.67963.7 #700IC=3, Healthy, Age 6, SES=42.14 ***.020.020 *** -.0064 ** -.098 *.67265.8 #621,702IC=6, Healthy, Age-9/11, SES=1-32.09 ***.086.020 *** -.0066 ** -.104 *.68661.2 #441IC=2,Asthmatic, Age 7, SES=22.15 ***.120.020 *** -.0070 ** -.108 *.67463.5 #553IC=5,Asthmatic, Age 7, SES=42.16 *** -.045.020 *** -.0059 ** -.117 *.68862.8 #423IC=5,Asthmatic, Age 9, SES=22.12 *** -.022.020 *** -.0063 ** -.091 *.68663.1 #444IC=3, Asthmatic, Age-15, SES=42.11 *** -.041.020 *** -.0059 ** -.085 *.68462.8 #307,322IC=3,Diabetic, Age-16, SES=4-52.08 *** -.123.020 *** -.0053 ** -.066 *.68961.9 Estimates of intercept fluctuate a little, but this is not important as intercept is fitted value of ILLCAUSE for a child of zero AGE and SES (a mythical child). Stat. sig. main effect of SES is maintained across all models Values of R 2 & SSE statistics maintained across all models Main effect of HEALTH status fluctuates a lot, but is n.s., and there is an interaction with AGE present in the model Stat. sig. main effect of AGE is maintained in all models Stat. sig. interaction of ILL & AGE is maintained in all models

19 © Willett, Harvard University Graduate School of Education, 11/13/2015S052/I.1(c) – Slide 18 S052/I.1(c): Detecting Influential Data-Points, and Assessing Their Impact Appendix 1: Definition of the PRESS Residual PRESS (or DELETED) residual is defined to better detect the location of atypical data points that are Extreme-on-Y. X Y + PRESS Residual PRESS Residual Regular Residual Regular Residual Fitted line, all data Fitted line, all data minus current case i th case Why make this modification? Offers some protection against the impact that each case has on its own residual. Is a less biased estimate of the case’s residual. Why make this modification? Offers some protection against the impact that each case has on its own residual. Is a less biased estimate of the case’s residual. PRESS Residual Each person in the dataset has a PRESS residual. Almost identical to regular raw residual … formed by subtracting a “predicted” value from an “observed” value. But, the definition of “predicted value” is modified: It is the predicted value obtained after a subsidiary regression fit that includes all of the data except the particular person in question!! The person is then put back into the dataset before PRESS is estimated for the next case. PRESS Residual Each person in the dataset has a PRESS residual. Almost identical to regular raw residual … formed by subtracting a “predicted” value from an “observed” value. But, the definition of “predicted value” is modified: It is the predicted value obtained after a subsidiary regression fit that includes all of the data except the particular person in question!! The person is then put back into the dataset before PRESS is estimated for the next case.

20 © Willett, Harvard University Graduate School of Education, 11/13/2015S052/I.1(c) – Slide 19 S052/I.1(c): Detecting Influential Data-Points, and Assessing Their Impact Appendix 1: Definition of the HAT Statistic HAT statistic has been defined to better detect the location of atypical data points that are Extreme-on-X. X Y + HAT statistic, in standard deviation units Fitted line, all data i th case HAT Statistic  Each case in the dataset has a value for the HAT statistic.  Measures the “distance” of the case from the center of the “see- saw,” in the plane of all the predictors.  Sometimes called the “Leverage” statistic. HAT Statistic  Each case in the dataset has a value for the HAT statistic.  Measures the “distance” of the case from the center of the “see- saw,” in the plane of all the predictors.  Sometimes called the “Leverage” statistic.

21 © Willett, Harvard University Graduate School of Education, 11/13/2015S052/I.1(c) – Slide 20 S052/I.1(c): Detecting Influential Data-Points, and Assessing Their Impact Appendix 1: Definition of the Cook’s D Statistic The Cook’s D Statistic has been defined to summarize the overall impact of each data point on the regression fit X Y Fitted line, all data Cook’s D Statistic Each person in the dataset has a value of Cook’s D statistic. It summarizes how much all the regression parameters (the intercept and the slopes) differ as a whole, when you remove that person’s data from the fit. Each person is returned to the dataset before estimating D for the next case. Cook’s D Statistic Each person in the dataset has a value of Cook’s D statistic. It summarizes how much all the regression parameters (the intercept and the slopes) differ as a whole, when you remove that person’s data from the fit. Each person is returned to the dataset before estimating D for the next case. Fitted line, all data minus current case + i th case


Download ppt "© Willett, Harvard University Graduate School of Education, 11/13/2015S052/I.1(c) – Slide 1 More details can be found in the “Course Objectives and Content”"

Similar presentations


Ads by Google