© Willett, Harvard University Graduate School of Education, 11/13/2015S052/I.1(c) – Slide 1 More details can be found in the “Course Objectives and Content”

Slides:



Advertisements
Similar presentations
Statistical Techniques I EXST7005 Multiple Regression.
Advertisements

ADVANCED STATISTICS FOR MEDICAL STUDIES Mwarumba Mwavita, Ph.D. School of Educational Studies Research Evaluation Measurement and Statistics (REMS) Oklahoma.
Copyright © Allyn & Bacon (2007) Statistical Analysis of Data Graziano and Raulin Research Methods: Chapter 5 This multimedia product and its contents.
Inference for Regression
STA305 week 31 Assessing Model Adequacy A number of assumptions were made about the model, and these need to be verified in order to use the model for.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Prediction, Correlation, and Lack of Fit in Regression (§11. 4, 11
Class 16: Thursday, Nov. 4 Note: I will you some info on the final project this weekend and will discuss in class on Tuesday.
LECTURE 3 Introduction to Linear Regression and Correlation Analysis
© Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 1 S052/§I.1(a): Applied Data Analysis Roadmap of the Course – What Is Today’s.
Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section ) –Outliers and Influential Points (Section 6.7) Homework.
Chapter 12 Simple Regression
Lecture 25 Regression diagnostics for the multiple linear regression model Dealing with influential observations for multiple linear regression Interaction.
Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review
Class 6: Tuesday, Sep. 28 Section 2.4. Checking the assumptions of the simple linear regression model: –Residual plots –Normal quantile plots Outliers.
Lecture 24: Thurs., April 8th
Regression Diagnostics Checking Assumptions and Data.
Stat 112: Lecture 16 Notes Finish Chapter 6: –Influential Points for Multiple Regression (Section 6.7) –Assessing the Independence Assumptions and Remedies.
Correlation and Regression Analysis
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Unit 5c: Adding Predictors to the Discrete Time Hazard Model © Andrew Ho, Harvard Graduate School of EducationUnit 5c– Slide 1
S052/Shopping Presentation – Slide #1 © Willett, Harvard University Graduate School of Education S052: Applied Data Analysis Shopping Presentation: A.
Unit 5c: Adding Predictors to the Discrete Time Hazard Model © Andrew Ho, Harvard Graduate School of EducationUnit 5c– Slide 1
Unit 4c: Taxonomies of Logistic Regression Models © Andrew Ho, Harvard Graduate School of EducationUnit 4c – Slide 1
Unit 3b: From Fixed to Random Intercepts © Andrew Ho, Harvard Graduate School of EducationUnit 3b – Slide 1
Correlation & Regression
Unit 4c: Taxonomies of Logistic Regression Models © Andrew Ho, Harvard Graduate School of EducationUnit 4c – Slide 1
Unit 4b: Fitting the Logistic Model to Data © Andrew Ho, Harvard Graduate School of EducationUnit 4b – Slide 1
1 Doing Statistics for Business Doing Statistics for Business Data, Inference, and Decision Making Marilyn K. Pelosi Theresa M. Sandifer Chapter 11 Regression.
Regression and Correlation Methods Judy Zhong Ph.D.
© Willett, Harvard University Graduate School of Education, 8/27/2015S052/I.3(c) – Slide 1 More details can be found in the “Course Objectives and Content”
Inference for regression - Simple linear regression
Chapter 13: Inference in Regression
Hypothesis Testing in Linear Regression Analysis
Simple Linear Regression
© 2002 Prentice-Hall, Inc.Chap 14-1 Introduction to Multiple Regression Model.
1 Chapter 3: Examining Relationships 3.1Scatterplots 3.2Correlation 3.3Least-Squares Regression.
Unit 5b: The Logistic Regression Approach to Life Table Analysis © Andrew Ho, Harvard Graduate School of EducationUnit 5b– Slide 1
The Goal of MLR  Types of research questions answered through MLR analysis:  How accurately can something be predicted with a set of IV’s? (ex. predicting.
Stat 112 Notes 15 Today: –Outliers and influential points. Homework 4 due on Thursday.
Unit 1c: Detecting Influential Data Points and Assessing Their Impact © Andrew Ho, Harvard Graduate School of EducationUnit 1c – Slide 1
© Willett, Harvard University Graduate School of Education, 10/23/2015S052/I.1(b) – Slide 1 S052/§I.1(b): Applied Data Analysis Roadmap of the Course.
MBP1010H – Lecture 4: March 26, Multiple regression 2.Survival analysis Reading: Introduction to the Practice of Statistics: Chapters 2, 10 and 11.
Regression Model Building LPGA Golf Performance
Review of Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
S052/Shopping Presentation – Slide #1 © Willett, Harvard University Graduate School of Education S052: Applied Data Analysis What Would You Like To Know.
Stat 112 Notes 16 Today: –Outliers and influential points in multiple regression (Chapter 6.7)
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 13 Multiple Regression Section 13.3 Using Multiple Regression to Make Inferences.
Dr. C. Ertuna1 Issues Regarding Regression Models (Lesson - 06/C)
Copyright ©2011 Brooks/Cole, Cengage Learning Inference about Simple Regression Chapter 14 1.
Slide 1 DSCI 5340: Predictive Modeling and Business Forecasting Spring 2013 – Dr. Nick Evangelopoulos Lecture 2: Review of Multiple Regression (Ch. 4-5)
Unit 3a: Introducing the Multilevel Regression Model © Andrew Ho, Harvard Graduate School of EducationUnit 3a – Slide 1
Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory.
Chapter 22: Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
© Willett, Harvard University Graduate School of Education, 12/16/2015S052/I.1(d) – Slide 1 More details can be found in the “Course Objectives and Content”
ANOVA, Regression and Multiple Regression March
© Willett, Harvard University Graduate School of Education, 1/19/2016S052/I.2(a) – Slide 1 More details can be found in the “Course Objectives and Content”
 Assumptions are an essential part of statistics and the process of building and testing models.  There are many different assumptions across the range.
Chapter 1 Introduction to Statistics. Section 1.1 Fundamental Statistical Concepts.
Significance Tests for Regression Analysis. A. Testing the Significance of Regression Models The first important significance test is for the regression.
© Willett, Harvard University Graduate School of Education, 3/1/2016S052/III.1(b) – Slide 1 S052/III.1(b): Applied Data Analysis Roadmap of the Course.
Individual observations need to be checked to see if they are: –outliers; or –influential observations Outliers are defined as observations that differ.
Unit 2a: Dealing “Empirically” with Nonlinear Relationships © Andrew Ho, Harvard Graduate School of EducationUnit 2a – Slide 1
PSY 325 AID Education Expert/psy325aid.com FOR MORE CLASSES VISIT
What Types Of Data Are Collected? What Kinds Of Question Can Be Asked Of Those Data?  Do people who say they study for more hours also think they’ll.
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
Stats Methods at IC Lecture 3: Regression.
POSC 202A: Lecture Lecture: Substantive Significance, Relationship between Variables 1.
CHAPTER 29: Multiple Regression*
Multiple Linear Regression
Presentation transcript:

© Willett, Harvard University Graduate School of Education, 11/13/2015S052/I.1(c) – Slide 1 More details can be found in the “Course Objectives and Content” handout on the course webpage. Multiple Regression Analysis (MRA) Multiple Regression Analysis (MRA) Do your residuals meet the required assumptions? Test for residual normality Use influence statistics to detect atypical datapoints If your residuals are not independent, replace OLS by GLS regression analysis Use Individual growth modeling Specify a Multi-level Model If your sole predictor is continuous, MRA is identical to correlational analysis If your sole predictor is dichotomous, MRA is identical to a t-test If your several predictors are categorical, MRA is identical to ANOVA If time is a predictor, you need discrete- time survival analysis… If your outcome is categorical, you need to use… Binomial logistic regression analysis (dichotomous outcome) Multinomial logistic regression analysis (polytomous outcome) If you have more predictors than you can deal with, Create taxonomies of fitted models and compare them. Form composites of the indicators of any common construct. Conduct a Principal Components Analysis Use Cluster Analysis Use non-linear regression analysis. Transform the outcome or predictor If your outcome vs. predictor relationship is non-linear, How do you deal with missing data? S052/§I.1(c): Applied Data Analysis Roadmap of the Course – What Is Today’s Topic? Today’s Topic Area

© Willett, Harvard University Graduate School of Education, 11/13/2015S052/I.1(c) – Slide 2 S052/§I.1(c): Applied Data Analysis Where Does Today’s Topic Appear in the Printed Syllabus? Please check inter-connections among the Roadmap, the Daily Topic Area, the Printed Syllabus, and the content of today’s class when you pre-read the day’s materials. Syllabus Section I.1(c)Detecting Influential Data-Points, and Assessing Their Impact On Model Fit Syllabus Section I.1(c), on Detecting Influential Data-Points, and Assessing Their Impact On Model Fit, includes: Story So Far –Anything We Need Still Need to Consider? (Slides 3-4). Data-points Can Be Atypical In Two Important Ways (Slide 5) Three Useful “Influence” Statistics (Slides 6-8). Estimating And Inspecting The Influence Statistics (Slides 9-12). Looking for Interesting Groupings Among the Atypical Data-Points (Slides 13-15). Conducting A Reasonable Sensitivity Analysis (Slides 16-17) Appendix 1: Technical Definitions of the PRESS, HAT & COOK’S D Statistics (Slides 18-20). Syllabus Section I.1(c)Detecting Influential Data-Points, and Assessing Their Impact On Model Fit Syllabus Section I.1(c), on Detecting Influential Data-Points, and Assessing Their Impact On Model Fit, includes: Story So Far –Anything We Need Still Need to Consider? (Slides 3-4). Data-points Can Be Atypical In Two Important Ways (Slide 5) Three Useful “Influence” Statistics (Slides 6-8). Estimating And Inspecting The Influence Statistics (Slides 9-12). Looking for Interesting Groupings Among the Atypical Data-Points (Slides 13-15). Conducting A Reasonable Sensitivity Analysis (Slides 16-17) Appendix 1: Technical Definitions of the PRESS, HAT & COOK’S D Statistics (Slides 18-20).

© Willett, Harvard University Graduate School of Education, 11/13/2015S052/I.1(c) – Slide 3 S052/I.1(c): Detecting Influential Data-Points, and Assessing Their Impact The Story So Far – What Have We Still To Consider? Is there any reason we might not trust the parameter estimates, statistical inference and goodness-of-fit statistics obtained in this “final model? Two Issues Have Gone Unexamined:  Atypical data points may be present in the point-cloud and driving the findings.  Need to check this.  Make sure all is well.  Need to check that the usual regression assumptions are met! Two Issues Have Gone Unexamined:  Atypical data points may be present in the point-cloud and driving the findings.  Need to check this.  Make sure all is well.  Need to check that the usual regression assumptions are met! Why Wait Until The Final Model To Check These Issues Out?  Probably should have checked them earlier, but we must certainly check them here!  If we find anything strange, we can always go back and refit the earlier models too! Why Wait Until The Final Model To Check These Issues Out?  Probably should have checked them earlier, but we must certainly check them here!  If we find anything strange, we can always go back and refit the earlier models too!

© Willett, Harvard University Graduate School of Education, 11/13/2015S052/I.1(c) – Slide 4 S052/I.1(c): Detecting Influential Data-Points, and Assessing Their Impact Could There Be A Problem With Atypical Data Points in the ILLCAUSE Analysis? Think back to our initial exploratory analyses … Three Important Questions: 1.How do we locate atypical data-points (and, what do we mean by “atypical”)? 2.How do we evaluate, and compare, the impact of each atypical data- point on the fitted regression model? 3.Does it matter if the atypical data-points occur in groups or families? Three Important Questions: 1.How do we locate atypical data-points (and, what do we mean by “atypical”)? 2.How do we evaluate, and compare, the impact of each atypical data- point on the fitted regression model? 3.Does it matter if the atypical data-points occur in groups or families? “Houston, we have a problem … ?” And things may be worse than we think … This is a simple example with only few predictors:  Atypical data-points show up clearly … And things may be worse than we think … This is a simple example with only few predictors:  Atypical data-points show up clearly … And, all atypical data- points are not created equal:  The impact of each point on the fit depends on where it sits in the point-cloud. And, all atypical data- points are not created equal:  The impact of each point on the fit depends on where it sits in the point-cloud. With many predictors in an analysis, it’s not so easy to spot atypical data-points:  Detection usually depends on how you look at the data. With many predictors in an analysis, it’s not so easy to spot atypical data-points:  Detection usually depends on how you look at the data.

© Willett, Harvard University Graduate School of Education, 11/13/2015S052/I.1(c) – Slide 5 S052/I.1(c): Detecting Influential Data-Points, and Assessing Their Impact Data-Points Can Be “Atypical” In Two Important Ways “Atypicality”Atypicality “Atypicality”Atypicality How Do High-Leverage Points Affect The Fit?  May affect parameter estimates alot: Particularly the estimated slope. May lead to big changes in the estimated intercept, because of the “see-saw” effect.  May cause unpredictable fluctuations in SSError, with contingent impact on: Goodness-of-fit (R 2 ), Statistical inference:  standard errors  t-statistics  p-values.  hypothesis testing. How Do High-Leverage Points Affect The Fit?  May affect parameter estimates alot: Particularly the estimated slope. May lead to big changes in the estimated intercept, because of the “see-saw” effect.  May cause unpredictable fluctuations in SSError, with contingent impact on: Goodness-of-fit (R 2 ), Statistical inference:  standard errors  t-statistics  p-values.  hypothesis testing. How Do Outliers Affect The Fit?  May not affect parameter estimates much: May impact the estimated intercept, a little. May leave the estimated slope unchanged.  Will usually inflate the SSError, causing: Big reduction in goodness-of-fit statistics (R 2 ), Big impact on inferential statistics:  Bigger standard errors,  Smaller t-statistics,  Bigger p-values,  Less power for the analysis (i.e., harder to reject the null hypothesis). How Do Outliers Affect The Fit?  May not affect parameter estimates much: May impact the estimated intercept, a little. May leave the estimated slope unchanged.  Will usually inflate the SSError, causing: Big reduction in goodness-of-fit statistics (R 2 ), Big impact on inferential statistics:  Bigger standard errors,  Smaller t-statistics,  Bigger p-values,  Less power for the analysis (i.e., harder to reject the null hypothesis). “Outliers” are “Extreme-on-Y” “Outliers” are “Extreme-on-Y” “High-Leverage” data-points are “Extreme-on-X” “High-Leverage” data-points are “Extreme-on-X”

© Willett, Harvard University Graduate School of Education, 11/13/2015S052/I.1(c) – Slide 6 S052/I.1(c): Detecting Influential Data-Points, and Assessing Their Impact Three Useful “Influence” Statistics, Each Responsible for a Different Job! To assess the overall impact of the point on the regression fit use … You need one statistic to identify an atypical data- point’s impact! Cook’s D Statistic To detect one that is “Extreme-on-Y” use... To detect one that is “Extreme-on-X” use … You need two statistics to identify an atypical data-point’s location! PRESS ResidualHAT Statistic How do you detect atypical data-points in a large multi- dimensional point-cloud? Decision-Rule: To locate atypical cases seek those with “large” values (relatively speaking) of the Influence Statistics – see Handout I.1(c).1

© Willett, Harvard University Graduate School of Education, 11/13/2015S052/I.1(c) – Slide 7 * * Estimate and output influence statistics for the final regression model * *; PROC REG DATA=ILLCAUSE; VAR ILLCAUSE ILL AGE SES; M6: MODEL ILLCAUSE = ILL AGE ILLxAGE SES; * Output influence statistics into temporary dataset for further diagnosis; OUTPUT OUT=DIAGNOSE PREDICTED=PRED PRESS=PRESS RSTUDENT=STDPRESS COOKD=COOKD H=HAT; * * Estimate and output influence statistics for the final regression model * *; PROC REG DATA=ILLCAUSE; VAR ILLCAUSE ILL AGE SES; M6: MODEL ILLCAUSE = ILL AGE ILLxAGE SES; * Output influence statistics into temporary dataset for further diagnosis; OUTPUT OUT=DIAGNOSE PREDICTED=PRED PRESS=PRESS RSTUDENT=STDPRESS COOKD=COOKD H=HAT; Note: All other variables in the dataset, including the ID variable, outcome and predictors automatically enter the DIAGNOSE dataset. S052/I.1(c): Detecting Influential Data-Points, and Assessing Their Impact Estimating and Inspecting the Influence Statistics is Easy! In Handout I.2(c).1, I estimate, display and summarize influence statistics for Final Model M6: OUTput influence statistics into it. PRESS Raw PRESS Residual Cook’s D Statistic HAT Statistic Standardized Influence Statistic COOKD HATH STDPRESSRSTUDENT Name of New Variable SAS Command Create a new (temporary) dataset, called DIAGNOSE. Put PREDICTED values of the outcome into the new DIAGNOSE dataset, & call them PRED.

© Willett, Harvard University Graduate School of Education, 11/13/2015S052/I.1(c) – Slide 8 S052/I.1(c): Detecting Influential Data-Points, and Assessing Their Impact Estimating and Inspecting the Influence Statistics Inspect the influence statistics using exploratory analysis … * * Identify the most influential data points * *; * Get some sense of the magnitudes of the influence statistics; PROC PLOT DATA=DIAGNOSE; PLOT (STDPRESS HAT COOKD)*ID = '+'; * Identify the extreme and influential cases, by ID; PROC UNIVARIATE DATA=DIAGNOSE; ID ID; VAR STDPRESS HAT COOKD; * * Identify the most influential data points * *; * Get some sense of the magnitudes of the influence statistics; PROC PLOT DATA=DIAGNOSE; PLOT (STDPRESS HAT COOKD)*ID = '+'; * Identify the extreme and influential cases, by ID; PROC UNIVARIATE DATA=DIAGNOSE; ID ID; VAR STDPRESS HAT COOKD; Notice, in these displays, I’ve used the standardized version of the PRESS residual: This is because I have some sense of what its magnitude actually means (i.e., a value greater than  2 is big!!!). It’s probably best to use the raw PRESS residual when you test for residual normality, however, in subsequent analyses. Notice, in these displays, I’ve used the standardized version of the PRESS residual: This is because I have some sense of what its magnitude actually means (i.e., a value greater than  2 is big!!!). It’s probably best to use the raw PRESS residual when you test for residual normality, however, in subsequent analyses. Plot each influence statistic versus Subject ID in order to identify which children have the largest values of each statistic. Obtain univariate descriptive summaries of the distribution of each influence statistic, with extreme values labeled by the case ID value (to easily identify the atypical cases).

© Willett, Harvard University Graduate School of Education, 11/13/2015S052/I.1(c) – Slide 9 S ‚ t ‚ u 3 ˆ + d ‚ + e ‚ + n ‚ + t ‚ + i ‚ + z 2 ˆ + e ‚ + + d ‚ ‚ R ‚ e ‚ + ++ s 1 ˆ i ‚ d ‚ u ‚ a ‚ l ‚ ˆ w ‚ i ‚ t ‚ h ‚ o ‚ u -1 ˆ t ‚ ‚ + + C ‚ + + u ‚ + r ‚ r -2 ˆ e ‚ + n ‚ t ‚ + + ‚ O ‚ + b -3 ˆ + s ‚ Šˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒ ID S ‚ t ‚ u 3 ˆ + d ‚ + e ‚ + n ‚ + t ‚ + i ‚ + z 2 ˆ + e ‚ + + d ‚ ‚ R ‚ e ‚ + ++ s 1 ˆ i ‚ d ‚ u ‚ a ‚ l ‚ ˆ w ‚ i ‚ t ‚ h ‚ o ‚ u -1 ˆ t ‚ ‚ + + C ‚ + + u ‚ + r ‚ r -2 ˆ e ‚ + n ‚ t ‚ + + ‚ O ‚ + b -3 ˆ + s ‚ Šˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒ ID S052/I.1(c): Detecting Influential Data-Points, and Assessing Their Impact Are There Any Cases With Extreme Values of the PRESS Residual? Plot of Standardized PRESS Residuals vs. Subject ID STDPRESS Moments N 194 Sum Weights 194 Mean Sum Observations Std Deviation Variance Skewness Kurtosis Location Variability Mean Std Deviation Median Variance Mode Range Interquartile Range Quantile Estimate 100% Max % % % % Q % Median % Q % % % % Min STDPRESS Moments N 194 Sum Weights 194 Mean Sum Observations Std Deviation Variance Skewness Kurtosis Location Variability Mean Std Deviation Median Variance Mode Range Interquartile Range Quantile Estimate 100% Max % % % % Q % Median % Q % % % % Min Extreme Observations Lowest Value ID Obs Extreme Observations Lowest Value ID Obs Extreme Observations Highest Value ID Obs Extreme Observations Highest Value ID Obs

© Willett, Harvard University Graduate School of Education, 11/13/2015S052/I.1(c) – Slide 10 S052/I.1(c): Detecting Influential Data-Points, and Assessing Their Impact Are There Any Cases With Extreme Values of the HAT Statistic? ‚ 0.08 ˆ ‚ ‚ + ‚ 0.07 ˆ ‚ ‚ + ‚ 0.06 ˆ ‚ L ‚ + e ‚ + + v 0.05 ˆ + + e ‚ + r ‚ + + a ‚ g ‚ e 0.04 ˆ ‚ ‚ ‚ ‚ ˆ ‚ ‚ ‚ ‚ ˆ ‚ ‚ ‚ ‚ ˆ + ++ ‚ Šƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒ ID ‚ 0.08 ˆ ‚ ‚ + ‚ 0.07 ˆ ‚ ‚ + ‚ 0.06 ˆ ‚ L ‚ + e ‚ + + v 0.05 ˆ + + e ‚ + r ‚ + + a ‚ g ‚ e 0.04 ˆ ‚ ‚ ‚ ‚ ˆ ‚ ‚ ‚ ‚ ˆ ‚ ‚ ‚ ‚ ˆ + ++ ‚ Šƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒ ID Plot of HAT Statistic vs. Subject ID HAT Moments N 205 Sum Weights 205 Mean Sum Observations Std Deviation Variance Skewness Kurtosis Location Variability Mean Std Deviation Median Variance Mode Range Interquartile Range Quantile Estimate 100% Max % % % % Q % Median % Q % % % % Min HAT Moments N 205 Sum Weights 205 Mean Sum Observations Std Deviation Variance Skewness Kurtosis Location Variability Mean Std Deviation Median Variance Mode Range Interquartile Range Quantile Estimate 100% Max % % % % Q % Median % Q % % % % Min Extreme Observations Highest Value ID Obs Extreme Observations Highest Value ID Obs Extreme Observations Lowest Value ID Obs Extreme Observations Lowest Value ID Obs

© Willett, Harvard University Graduate School of Education, 11/13/2015S052/I.1(c) – Slide 11 ‚ 0.08 ˆ ‚ C ‚ o ‚ o 0.07 ˆ + + k ‚ ' ‚ s ‚ ˆ D ‚ ‚ I ‚ n 0.05 ˆ f ‚ l ‚ + u ‚ + + e 0.04 ˆ n ‚ + + c ‚ + e ‚ 0.03 ˆ S ‚ + + t ‚ a ‚ + t 0.02 ˆ i ‚ + s ‚ t ‚ ++ i 0.01 ˆ c ‚ ‚ ‚ ˆ ‚ Šƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒ ID ‚ 0.08 ˆ ‚ C ‚ o ‚ o 0.07 ˆ + + k ‚ ' ‚ s ‚ ˆ D ‚ ‚ I ‚ n 0.05 ˆ f ‚ l ‚ + u ‚ + + e 0.04 ˆ n ‚ + + c ‚ + e ‚ 0.03 ˆ S ‚ + + t ‚ a ‚ + t 0.02 ˆ i ‚ + s ‚ t ‚ ++ i 0.01 ˆ c ‚ ‚ ‚ ˆ ‚ Šƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒ ID S052/I.1(c): Detecting Influential Data-Points, and Assessing Their Impact Do Any Cases With Extreme Values of the Cook’s D Statistic? Plot of Cook’s D Statistic vs. Subject ID Moments N 194 Sum Weights 194 Mean Sum Observations Std Deviation Variance Skewness Kurtosis Location Variability Mean Std Deviation Median Variance Mode Range Interquartile Range Quantile Estimate 100% Max E-02 99% E-02 95% E-02 90% E-02 75% Q E-03 50% Median E-03 25% Q E-04 10% E-04 5% E-05 1% E-06 0% Min E-09 Moments N 194 Sum Weights 194 Mean Sum Observations Std Deviation Variance Skewness Kurtosis Location Variability Mean Std Deviation Median Variance Mode Range Interquartile Range Quantile Estimate 100% Max E-02 99% E-02 95% E-02 90% E-02 75% Q E-03 50% Median E-03 25% Q E-04 10% E-04 5% E-05 1% E-06 0% Min E-09 Extreme Observations Highest Value ID Obs Extreme Observations Highest Value ID Obs Extreme Observations Lowest Value ID Obs E E E E E Extreme Observations Lowest Value ID Obs E E E E E

© Willett, Harvard University Graduate School of Education, 11/13/2015S052/I.1(c) – Slide 12 Table 1.1(c).1. Estimates of influence statistics, by case ID, from fitted Model M6 describing the relationship between children’s understanding of illness causality and health status, controlling for child age and family SES. S052/I.1(c): Detecting Influential Data-Points, and Assessing Their Impact So, Which Cases Are Individually Atypical, and How? It’s useful to summarize the influence statistics, by case … IDPRESSHATCook’s DPreliminary Statistical Conclusion # Extreme-on-Y, impacts fit # Extreme-on-Y # Extreme-on-Y # Extreme-on-Y # Extreme-on-Y # Extreme-on-Y # Extreme-on-Y, impacts fit # Extreme-on-Y, impacts fit # Extreme-on-X # Extreme-on-X But, it’s always better to examine the cases substantively in order to understand who they are – as follows, in Handout I.1(c).2. Three potentially important individual atypical cases?

© Willett, Harvard University Graduate School of Education, 11/13/2015S052/I.1(c) – Slide 13 * * Identify & label influential cases by ID#, and prune dataset leaving only them * *; DATA HICASES; SET ILLCAUSE; * Identify and label the potential outliers; IF ID=307 OR ID=423 OR ID=441 OR ID=444 OR ID=553 OR ID=617 OR ID=621 OR ID=702 THEN HIPRESS=1; ELSE HIPRESS=0; * Identify and label the potential high leverage cases; IF ID=322 OR ID=700 THEN HIHAT=1; ELSE HIHAT=0; * Identify and label the potential cases with overall influence; IF ID=307 OR ID=444 OR ID=553 THEN HICOOKSD=1; ELSE HICOOKSD=0; * Drop all cases except those that were identified as atypical; IF HIPRESS=0 AND HIHAT=0 AND HICOOKSD=0 THEN DELETE; * * Sort atypical cases and list their selected characteristics * *; * Sort and print out the details of the atypical cases; PROC SORT DATA=HICASES; BY DESCENDING ILLCAUSE HEALTH AGE SES; PROC PRINT DATA=HICASES; VAR ID ILLCAUSE HEALTH AGE SES HICOOKSD HIPRESS HIHAT; FORMAT HEALTH HFMT.; * Seek out interesting clusters of influential cases; PROC G3D DATA=HICASES; SCATTER AGE*SES=ILLCAUSE / GRID ROTATE=75; * * Identify & label influential cases by ID#, and prune dataset leaving only them * *; DATA HICASES; SET ILLCAUSE; * Identify and label the potential outliers; IF ID=307 OR ID=423 OR ID=441 OR ID=444 OR ID=553 OR ID=617 OR ID=621 OR ID=702 THEN HIPRESS=1; ELSE HIPRESS=0; * Identify and label the potential high leverage cases; IF ID=322 OR ID=700 THEN HIHAT=1; ELSE HIHAT=0; * Identify and label the potential cases with overall influence; IF ID=307 OR ID=444 OR ID=553 THEN HICOOKSD=1; ELSE HICOOKSD=0; * Drop all cases except those that were identified as atypical; IF HIPRESS=0 AND HIHAT=0 AND HICOOKSD=0 THEN DELETE; * * Sort atypical cases and list their selected characteristics * *; * Sort and print out the details of the atypical cases; PROC SORT DATA=HICASES; BY DESCENDING ILLCAUSE HEALTH AGE SES; PROC PRINT DATA=HICASES; VAR ID ILLCAUSE HEALTH AGE SES HICOOKSD HIPRESS HIHAT; FORMAT HEALTH HFMT.; * Seek out interesting clusters of influential cases; PROC G3D DATA=HICASES; SCATTER AGE*SES=ILLCAUSE / GRID ROTATE=75; S052/I.1(c): Detecting Influential Data-Points, and Assessing Their Impact Describing Each Atypical Case in More Detail In Handout I.2(c).2, I begin to parse the substantive identity of atypical cases -- here’s one approach … There are many other, equally reasonable, ways of doing the same thing … use your imagination, and come up with something better! Create a three dimensional scatterplot of the atypical cases, to check if there are any natural groupings Print a list of atypical cases, accompanied by outcome and predictor values. Drop all cases, except the atypical ones, from the new dataset Classify the atypical cases by creating new categorical variables, HIPRESS, HIHAT & HICOOKSD Create a new SAS dataset, called HICASES, to contain only the atypical cases.

© Willett, Harvard University Graduate School of Education, 11/13/2015S052/I.1(c) – Slide 14 S052/I.1(c): Detecting Influential Data-Points, and Assessing Their Impact Are There Substantively-Interesting Groupings of Atypical Points? Because the Cook’s D statistic only summarizes the impact of individual atypical datapoints on the fitted regression model, it’s useful to ask whether neighboring atypical datapoints can be clustered into interesting sub-groups for examination in a responsible sensitivity analyses … Computer output, not an APA-Style Table Obs ID ILLCAUSE HEALTH AGE SES HICOOKSD HIPRESS HIHAT Healthy Healthy Asthmatic Asthmatic Diabetic Healthy Diabetic Asthmatic Healthy Asthmatic It’s usually easier to make these kinds of decisions using a thoughtful graphical display of some kind … At this point, I favor ignoring the values of the influence statistics – they were needed to identify the atypical data-points. Instead, I favor inspecting the characteristics of the atypical cases in order to assess whether pairs, or groups, of cases sit close together in the point cloud.

D D A A A A H H H H © Willett, Harvard University Graduate School of Education, 11/13/2015S052/I.1(c) – Slide 15 S052/I.1(c): Detecting Influential Data-Points, and Assessing Their Impact Are There Substantively-Interesting Groupings Of Atypical Points? D = Diabetic A =Asthmatic H = Healthy D = Diabetic A =Asthmatic H = Healthy Figure 1.1(c). Three- dimensional scatterplot of ILLCAUSE by AGE and SES for 10 atypical cases identified with influence statistics in M6.

© Willett, Harvard University Graduate School of Education, 11/13/2015S052/I.1(c) – Slide 16 * * Create reduced datasets with atypical data points temporarily omitted Individually and in small groups, based on the previous analyses and graph * *; * Datasets with healthy children temporarily omitted; DATA TEMP_IC1; SET ILLCAUSE; IF ID=617 THEN DELETE; DATA TEMP_IC2; SET ILLCAUSE; IF ID=700 THEN DELETE; DATA TEMP_IC3; SET ILLCAUSE; IF ID=621 OR ID=702 THEN DELETE; * Datasets with asthmatic children temporarily omitted; DATA TEMP_IC4; SET ILLCAUSE; IF ID=441 THEN DELETE; DATA TEMP_IC5; SET ILLCAUSE; IF ID=553 THEN DELETE; DATA TEMP_IC6; SET ILLCAUSE; IF ID=423 THEN DELETE; DATA TEMP_IC7; SET ILLCAUSE; IF ID=444 THEN DELETE; * Datasets with diabetic children temporarily omitted; DATA TEMP_IC8; SET ILLCAUSE; IF ID=307 OR ID=322 THEN DELETE; * * Refit the final model in each of the temporarily reduced datasets * *; * Regression analyses with healthy children temporarily omitted; PROC REG DATA=TEMP_IC1; M6_1: MODEL ILLCAUSE = ILL AGE ILLxAGE SES; PROC REG DATA=TEMP_IC2; M6_2: MODEL ILLCAUSE = ILL AGE ILLxAGE SES; PROC REG DATA=TEMP_IC3; M6_3: MODEL ILLCAUSE = ILL AGE ILLxAGE SES; * Regression analyses with asthmatic children temporarily omitted; PROC REG DATA=TEMP_IC4; M6_4: MODEL ILLCAUSE = ILL AGE ILLxAGE SES; PROC REG DATA=TEMP_IC5; M6_5: MODEL ILLCAUSE = ILL AGE ILLxAGE SES; PROC REG DATA=TEMP_IC6; M6_6: MODEL ILLCAUSE = ILL AGE ILLxAGE SES; PROC REG DATA=TEMP_IC7; M6_7: MODEL ILLCAUSE = ILL AGE ILLxAGE SES; * Regression analyses with diabetic children temporarily omitted; PROC REG DATA=TEMP_IC8; M6_8: MODEL ILLCAUSE = ILL AGE ILLxAGE SES; S052/I.1(c): Detecting Influential Data-Points, and Assessing Their Impact Conducting A Reasonable Sensitivity Analysis It may not be ethical nor reasonable to eliminate atypical children entirely – it’s better to conduct sensible sensitivity analyses:  Refit final model, M6, repeatedly, while temporarily omitting atypical data points either singly or in sensible sub-groups.  Obtain parameter estimates, products of statistical inference, and goodness-of-fit statistics in each fit.  Compare and contrast the impact of each temporary omission on the regression analysis … Handout I.1(c).3 It may not be ethical nor reasonable to eliminate atypical children entirely – it’s better to conduct sensible sensitivity analyses:  Refit final model, M6, repeatedly, while temporarily omitting atypical data points either singly or in sensible sub-groups.  Obtain parameter estimates, products of statistical inference, and goodness-of-fit statistics in each fit.  Compare and contrast the impact of each temporary omission on the regression analysis … Handout I.1(c).3 Create supplemental SAS datasets (TEMP_IC1, TEMPT_IC2, etc), each with different sensible permutations of the atypical data points omitted. Repeatedly fit final model M6 in each of the temporary datasets.

© Willett, Harvard University Graduate School of Education, 11/13/2015S052/I.1(c) – Slide 17 Table 1.1(c).2. Parameter estimates, approximate p-values, and associated goodness-of-fit statistics for a sensitivity analysis of Model M6, the final model in an earlier taxonomy of fitted multiple regression models describing the relationship between children’s understanding of illness causality and their health status, controlling for child age and family SES. S052/I.1(c): Detecting Influential Data-Points, and Assessing Their Impact How Is The Regression Fit Affected By the Omission of Atypical Datapoints? Cases Omitted Nature of Omitted Point(s) Parameter Estimates Int’ cept ILLAGE ILL × AGE SESR2R2 SSE None *** *** ** * #617IC=2, Healthy, Age 7, SES=12.22 *** *** ** * #700IC=3, Healthy, Age 6, SES=42.14 *** *** ** * #621,702IC=6, Healthy, Age-9/11, SES= *** *** ** * #441IC=2,Asthmatic, Age 7, SES=22.15 *** *** ** * #553IC=5,Asthmatic, Age 7, SES=42.16 *** *** ** * #423IC=5,Asthmatic, Age 9, SES=22.12 *** *** ** * #444IC=3, Asthmatic, Age-15, SES=42.11 *** *** ** * #307,322IC=3,Diabetic, Age-16, SES= *** *** ** * Estimates of intercept fluctuate a little, but this is not important as intercept is fitted value of ILLCAUSE for a child of zero AGE and SES (a mythical child). Stat. sig. main effect of SES is maintained across all models Values of R 2 & SSE statistics maintained across all models Main effect of HEALTH status fluctuates a lot, but is n.s., and there is an interaction with AGE present in the model Stat. sig. main effect of AGE is maintained in all models Stat. sig. interaction of ILL & AGE is maintained in all models

© Willett, Harvard University Graduate School of Education, 11/13/2015S052/I.1(c) – Slide 18 S052/I.1(c): Detecting Influential Data-Points, and Assessing Their Impact Appendix 1: Definition of the PRESS Residual PRESS (or DELETED) residual is defined to better detect the location of atypical data points that are Extreme-on-Y. X Y + PRESS Residual PRESS Residual Regular Residual Regular Residual Fitted line, all data Fitted line, all data minus current case i th case Why make this modification? Offers some protection against the impact that each case has on its own residual. Is a less biased estimate of the case’s residual. Why make this modification? Offers some protection against the impact that each case has on its own residual. Is a less biased estimate of the case’s residual. PRESS Residual Each person in the dataset has a PRESS residual. Almost identical to regular raw residual … formed by subtracting a “predicted” value from an “observed” value. But, the definition of “predicted value” is modified: It is the predicted value obtained after a subsidiary regression fit that includes all of the data except the particular person in question!! The person is then put back into the dataset before PRESS is estimated for the next case. PRESS Residual Each person in the dataset has a PRESS residual. Almost identical to regular raw residual … formed by subtracting a “predicted” value from an “observed” value. But, the definition of “predicted value” is modified: It is the predicted value obtained after a subsidiary regression fit that includes all of the data except the particular person in question!! The person is then put back into the dataset before PRESS is estimated for the next case.

© Willett, Harvard University Graduate School of Education, 11/13/2015S052/I.1(c) – Slide 19 S052/I.1(c): Detecting Influential Data-Points, and Assessing Their Impact Appendix 1: Definition of the HAT Statistic HAT statistic has been defined to better detect the location of atypical data points that are Extreme-on-X. X Y + HAT statistic, in standard deviation units Fitted line, all data i th case HAT Statistic  Each case in the dataset has a value for the HAT statistic.  Measures the “distance” of the case from the center of the “see- saw,” in the plane of all the predictors.  Sometimes called the “Leverage” statistic. HAT Statistic  Each case in the dataset has a value for the HAT statistic.  Measures the “distance” of the case from the center of the “see- saw,” in the plane of all the predictors.  Sometimes called the “Leverage” statistic.

© Willett, Harvard University Graduate School of Education, 11/13/2015S052/I.1(c) – Slide 20 S052/I.1(c): Detecting Influential Data-Points, and Assessing Their Impact Appendix 1: Definition of the Cook’s D Statistic The Cook’s D Statistic has been defined to summarize the overall impact of each data point on the regression fit X Y Fitted line, all data Cook’s D Statistic Each person in the dataset has a value of Cook’s D statistic. It summarizes how much all the regression parameters (the intercept and the slopes) differ as a whole, when you remove that person’s data from the fit. Each person is returned to the dataset before estimating D for the next case. Cook’s D Statistic Each person in the dataset has a value of Cook’s D statistic. It summarizes how much all the regression parameters (the intercept and the slopes) differ as a whole, when you remove that person’s data from the fit. Each person is returned to the dataset before estimating D for the next case. Fitted line, all data minus current case + i th case