Screening the Data Tedious but essential!.

Slides:



Advertisements
Similar presentations
Repeated Measures/Mixed-Model ANOVA:
Advertisements

Chapter 9: Regression Analysis
Transformations & Data Cleaning
The Multiple Regression Model.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Prediction, Correlation, and Lack of Fit in Regression (§11. 4, 11
EPI 809/Spring Probability Distribution of Random Error.
CJT 765: Structural Equation Modeling Class 3: Data Screening: Fixing Distributional Problems, Missing Data, Measurement.
Simple Linear Regression 1. Correlation indicates the magnitude and direction of the linear relationship between two variables. Linear Regression: variable.
LECTURE 3 Introduction to Linear Regression and Correlation Analysis
Some Terms Y =  o +  1 X Regression of Y on X Regress Y on X X called independent variable or predictor variable or covariate or factor Which factors.
LINEAR REGRESSION: Evaluating Regression Models Overview Assumptions for Linear Regression Evaluating a Regression Model.
LINEAR REGRESSION: Evaluating Regression Models. Overview Assumptions for Linear Regression Evaluating a Regression Model.
Multiple regression analysis
Psych 524 Andrew Ainsworth Data Screening 1. Data check entry One of the first steps to proper data screening is to ensure the data is correct Check out.
Lecture 25 Regression diagnostics for the multiple linear regression model Dealing with influential observations for multiple linear regression Interaction.
Multivariate Data Analysis Chapter 4 – Multiple Regression.
Class 6: Tuesday, Sep. 28 Section 2.4. Checking the assumptions of the simple linear regression model: –Residual plots –Normal quantile plots Outliers.
Lecture 24 Multiple Regression (Sections )
1 4. Multiple Regression I ECON 251 Research Methods.
Regression Diagnostics Checking Assumptions and Data.
EPI809/Spring Testing Individual Coefficients.
Correlation and Regression Analysis
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Multiple Linear Regression A method for analyzing the effects of several predictor variables concurrently. - Simultaneously - Stepwise Minimizing the squared.
Forecasting Revenue: An Example of Regression Model Building Setting: Possibly a large set of predictor variables used to predict future quarterly revenues.
Quantitative Business Analysis for Decision Making Multiple Linear RegressionAnalysis.
© Willett, Harvard University Graduate School of Education, 8/27/2015S052/I.3(c) – Slide 1 More details can be found in the “Course Objectives and Content”
1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE.
Multivariate Statistical Data Analysis with Its Applications
© 2004 Prentice-Hall, Inc.Chap 15-1 Basic Business Statistics (9 th Edition) Chapter 15 Multiple Regression Model Building.
Model Building III – Remedial Measures KNNL – Chapter 11.
12a - 1 © 2000 Prentice-Hall, Inc. Statistics Multiple Regression and Model Building Chapter 12 part I.
2 Multicollinearity Presented by: Shahram Arsang Isfahan University of Medical Sciences April 2014.
Multiple Regression The Basics. Multiple Regression (MR) Predicting one DV from a set of predictors, the DV should be interval/ratio or at least assumed.
Basics of Data Cleaning
MBP1010H – Lecture 4: March 26, Multiple regression 2.Survival analysis Reading: Introduction to the Practice of Statistics: Chapters 2, 10 and 11.
Applied Epidemiologic Analysis - P8400 Fall 2002 Lab 10 Missing Data Henian Chen, M.D., Ph.D.
6-1 Introduction To Empirical Models Based on the scatter diagram, it is probably reasonable to assume that the mean of the random variable Y is.
Anaregweek11 Regression diagnostics. Regression Diagnostics Partial regression plots Studentized deleted residuals Hat matrix diagonals Dffits, Cook’s.
Review of Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
SW388R6 Data Analysis and Computers I Slide 1 Multiple Regression Key Points about Multiple Regression Sample Homework Problem Solving the Problem with.
Dr. C. Ertuna1 Issues Regarding Regression Models (Lesson - 06/C)
Adjusted from slides attributed to Andrew Ainsworth
1 G Lect 13W Imputation (data augmentation) of missing data Multiple imputation Examples G Multiple Regression Week 13 (Wednesday)
Right Hand Side (Independent) Variables Ciaran S. Phibbs.
» So, I’ve got all this data…what now? » Data screening – important to check for errors, assumptions, and outliers. » What’s the most important? ˃Depends.
 Relationship between education level, income, and length of time out of school  Our new regression equation: is the predicted value of the dependent.
Assumptions 5.4 Data Screening. Assumptions Parametric tests based on the normal distribution assume: – Independence – Additivity and linearity – Normality.
Applied Quantitative Analysis and Practices LECTURE#30 By Dr. Osman Sadiq Paracha.
B AD 6243: Applied Univariate Statistics Multiple Regression Professor Laku Chidambaram Price College of Business University of Oklahoma.
D/RS 1013 Data Screening/Cleaning/ Preparation for Analyses.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.
Tutorial I: Missing Value Analysis
Multiple Regression Learning Objectives n Explain the Linear Multiple Regression Model n Interpret Linear Multiple Regression Computer Output n Test.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
Lab 4 Multiple Linear Regression. Meaning  An extension of simple linear regression  It models the mean of a response variable as a linear function.
DATA ANALYSIS AND MODEL BUILDING LECTURE 9 Prof. Roland Craigwell Department of Economics University of the West Indies Cave Hill Campus and Rebecca Gookool.
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
Regression Analysis Part A Basic Linear Regression Analysis and Estimation of Parameters Read Chapters 3, 4 and 5 of Forecasting and Time Series, An Applied.
Regression. Why Regression? Everything we’ve done in this class has been regression: When you have categorical IVs and continuous DVs, the ANOVA framework.
Chapter 12 REGRESSION DIAGNOSTICS AND CANONICAL CORRELATION.
Yandell – Econ 216 Chap 15-1 Chapter 15 Multiple Regression Model Building.
Regression.
Multiple Linear Regression
CH2. Cleaning and Transforming Data
Multiple Regression Chapter 14.
Checking the data and assumptions before the final analysis.
Presentation transcript:

Screening the Data Tedious but essential!

Missing Data Missing Not at Random (MNAR) Missing at Random (MAR) Missing Completely at Random (MCAR)

Missing Not at Random (MNAR) Are missing cases on Y Missingness is related to the value of Y Faculty salaries – those with high salaries may be reluctant to reveal them Estimates of mean Y will be biased if use just the available data

Missing at Random (MAR) Missingness on Y not related to value of Y Or is related but through other variables on which we have data. Faculty salary related to rank. Higher rank = higher salary If missingness is random within each rank, within-rank estimates will be unbiased. Overall mean = weighted sum of within-rank estimates

Missing Completely at Random (MCAR) There is no variable, observed or not, that is related to missingness of Y. Ideal, not likely ever absolutely true.

Finding Patterns of Missingness There is specialized software. You do not have it. Can use SAS. Can use SPSS with home license code. Create missingness dummy variable 0 = not missing, 1 = missing Relate missingness to other variables.

Dealing with MCAR Data Delete Cases: Will create no bias, but will lower power and precision. Mean Substitution: For each missing value, substitute the group mean on that value. No bias for means, but will reduce standard deviations.

Dealing with MCAR Data Regression: For each missing score, develop a multiple regression to predict score from other variables. Impute that predicted score. Regression towards mean will reduce variability.

Dealing with MAR Data Deletion of Variables: If another variable can serve as a proxy. Multiple Imputation – specialized software, may eliminate bias Involves resampling techniques to generate several sets of predictions of missing scores Analyze each set and then average the results across sets.

Dealing with MNAR Data Sophisticated methods may reduce, but not eliminate, bias. Pairwise Correlation Matrix – use as input to multivariate procedures. Different correlations will be based on different subsets of the data. Can produce very strange results, not recommended.

Missing Item Data Within Unidimensional Scale Assume each item measures the same construct. For each subject, compute the means on the items which do have data. Set to missing the scale scores for subjects who have answered fewer than a threshold number of items.

Identifying Outliers Univariate: Box and whiskers plots Multivariate: Compute Mahalanobis Distance or Leverage. Investigate cases with high values. Use outlier dummy variable to compare outliers with inliers. Regression Diagnostics: Leverage: Cases with unusual values on the predictor variables

Outliers Standardized Residuals: Cases whose actual Y is far from predicted Y. Cook’s D: Cases with values that make them have great influence on the regression solution.

Dealing with Outliers Investigate: May be bad data. May be able to correct the data, may not. May represent cases not properly considered part of the population of interest. Out-of-Range Values: Even if not outliers, these are bad data that need correction.

Dealing with Outliers Set to Missing: If all else fails. Delete the Case: For example, if convinced the respondent was not even reading the questions. “I frequently visit planets outside of our solar system.” “I make all of my own clothes.” Delete the Variable: Last resort when it has many cases with missing data.

Dealing with Outliers Transform the Variable: If outliers are valid but contributing to skewness. Change the Score: For example, reduce very high score to value a small bit higher than the remaining highest score. See Howell’s discussion of “Winsorizing.”

Assumptions of the Analysis Check Outliers First: Dealing with outliers may resolve the problems below. Normality: Look at plots and measures of skewness and kurtosis. Ignore tests of significance, like Kolgomorov-Smirnov. May need to use different analysis. Homogeneity of Variance: Does the variance differ considerably across groups? May need to transform or use different analysis.

Assumptions of the Analysis Homoscedasticity: Carefully inspect the residuals. May need to transform data or use a different analysis. Homogeneity of Variance/Covariance Matrices (across groups): Box’s M. Sphericity: For univariate-approach related samples ANOVA. Check with Mauchley’s Test. Correct the df or use a multivariate approach instead.

Assumptions of the Analysis Homogeneity of Regression: In ANCOV, we assume the relationship between Y and the predictors is constant across groups. Test the Groups x Predictor(s) interactions. Linear Relationships: Look at plots. If necessary, transform variables or use curvilinear techniques.

Multicollinearity One predictor is nearly perfectly correlated with the other predictors. Makes the regression coefficients unstable across random samples from the same population. Makes complicated the interpretation of unique effects.

Detecting Multicollinearity For each predictor, compute the R2 between it and the other predictors. If very high (.9 or more), there is a problem. SAS will compute tolerance = (1 – that R2 ). If very low, there is a problem. If R2 = 1, the correlation matrix is singulair, cannot be inverted, the analysis crashes Predictors = Verbal SAT, Math SAT, Total SAT.

Variance Inflation Factor VIF = 1/tolerance. If high, there is a problem. How High? Some say 10, some say 5, a few say 2.5. If R2 = .9, tolerance = .1, VIF = 10.

Dealing with Multicollinearity Drop a Predictor – may resolve the problem. Combine Predictors – into a composite variable Principle Components Analysis – conduct the analysis on the resulting weighed linear combinations of the variables. Can then transform the results back to the original variables.

SAS 1 Look at the command lines in the SAS program. Always give every case a unique ID number, so you can locate it later. Label variables if their SAS name is not informative. input ID 1-3 @5 (Q1-Q138) (1.); label Q1='Sex' Q3 = 'Age';

SAS 2 Recode values that represent missing data. On several variables, such as “number of biological brothers,” response 5 was “do not know.” if Q15 = 5 then Q15 = . ; if Q16 = 5 then q16 = . ;

SAS 3 & 4 Transform variable to reduce positive skewness age_sr = sqrt(Q3); age_log = log10(Q3); age_inv = -1/(Q3); Dichotomize variable – transformation of last resort. if q3 = 1 then age_di = 1; else if q3 > 1 then age_di = 2;

SAS 5 & 6 Create composite variable SIBS = Q15 + Q16; Transform to reduce positive skewness sibs_sr = sqrt(sibs); sibs_log = log10(sibs); sibs_in = -1/sibs;

SAS 7 Create mental variable and associated missingness variable. MENTAL = Q62 + Q65 + Q67; MentalMiss = 0; If Mental=.then MentalMiss = 1;

SAS 8 Transform to reduce negative skewness Mental2 = Mental*Mental; Mental3 = Mental**3; Ment_exp = EXP(Mental); R_Ment = 13 - Mental; R_Ment_sr = sqrt(R_Ment); R_Ment_log = log10(R_Ment);

SAS 9 Dichotomize Mental if 0 LE Mental LE 9 then Ment_di=1; else if Mental > 9 then Ment_di=2; Be careful – SAS codes missing data with an extreme negative number.

SAS 10 Check for missing data and out-of-range values. proc means min max n nmiss; var q1-q10 q50-q70; run;

SAS 11 Check for skewness & kurtosis proc means min max n nmiss skewness kurtosis; var Q3 age_sr -- Mental Mental2 -- R_Ment_log; run;

SAS 12 Check distributions of variables with few values proc freq; tables q3 age_di sibs mental ment_di; run;

SAS 13 Locate cases with bad data data duh; set delmel; if q9 > 3; proc print; var q9; id id; run; Case 159 has out-of-range on item Q9.

SAS 14 Check correlates of missingness. proc corr nosimple data=delmel; var MentalMiss; with Q1 Q3 Q5 Q6 sibs; run; MentalMiss negatively correlated with sibs. Duh, some subjects have missing data on number of brothers or number of sisters. Instead of Mental = Q62+Q65+Q67, use Mental = Mean(of Q62 Q65 Q67);

Multidimensional Outliers investigate observations with leverage greater than 2p/n, “where n is the number of observations used to fit the model, and p is the number of parameters in the model.” 4 variables: Q1 Q3 Q6 mental + intercept 193 observations 2*5/193 = .052

SAS 15 Identify multivariate outliers proc reg data=delmel; model id = Q1 Q3 Q6 mental; output out=hat H=Leverage; run; data outliers; set hat; if leverage > .052;

SAS 15 Identify multivariate outliers proc print; var id Q1 Q3 Q6 mental leverage; run; proc means mean; var Q1 Q3 Q6 mental; run; As a group, the outliers are older than the overall sample. All three students aged 25 or older are included among the outliers.

Survey Scoundrels These sloths do not even read the questions, they just answer randomly to get whatever incentive is available for completing the survey. My daughter’s shock upon discovering this. Monitor how long it takes respondents to complete the survey.

Items to Help Detect Scoundrels Repeat same item, compare responsese “I frequently visit with aliens from other planets.” “I make all of my own clothes.”