Basics of Data Cleaning
Why Examine Your Data? Basic understanding of the data set Ensure statistical and theoretical underpinnings of a given m.v. technique are met Concerns about the data Departures from distribution assumptions (i.e., normality) Outliers Missing Data
Testing Assumptions MV Normality assumption Violation of MV Normality Solution is better Violation of MV Normality Skewness (symmetry) Kurtosis (peakedness) Heteroscedascity Non-linearity
Negative Skew
Positive Skew
Kurtosis Mesokurtic Leptokurtic Platykurtic
Skewness & Kurtosis SPSS Syntax FREQUENCIES VARIABLES=age /STATISTICS=SKEWNESS SESKEW KURTOSIS SEKURT /ORDER= ANALYSIS. Skewness = .354/.205 = 1.73 Kurtosis = -.266/.407 = -.654 Z values = Statistic Std Error Critical Values for z score .05 +/- 1.96 .01 +/- 2.58
Homoscedascity s21 = s22 = s23 = s24 = s2e When there are multiple groups, each group has similar levels of variance (similar standard deviation)
Linearity
Testing the Assumptions of Absence of Correlated Errors Correlated errors means there is an unmeasured variable affecting the analysis Key is to identify the unmeasured variable and to include it in the analysis How often do we meet this assumption?
Data Cleaning Examine Techniques to use Individual items/scales (i.e., reliability) Bivariate relationships Multivariate relationships Techniques to use Graphs non-normality, heteroscedasticity Frequencies missing data, out of bounds values Univariate outliers (+/- 3 SD from mean) Mahalanobis Distance (.001)
Graphical Examination Single Variable: Shape of Distribution Histogram Stem and leaf Relationships between two+ variables Scatterplot
Histogram
Scatterplot
Frequencies
Outliers Where do outliers come from? Inclusion of subjects not part of the population (e.g., ESL response to vocabulary test) Legitimate data points* Extreme values of random error (X = t + e) Error in observation Error in data preparation
Univariate Outliers Criteria: Mean +/- 3 SD Example: Age Out of range values > 64.83 or < 4.53
Univariate Outliers
Multivariate Outliers Mahalanobis Distance SPSS Syntax Regression Var = case VAR1 VAR2 /statistics collin /dependent =case / enter /residuals = outliers(mahal). Critical Values (case with D > c.v. is m.v. outlier) two variables - 13.82 three variables - 16.27 four variables - 18.46 five variables - 20.52 six variables - 22.46
Approaches to Outliers Leave them alone Delete entire case (listwise) Delete only relevant variables (pairwise) Trim – highest legitimate value Mean substitution Imputation Bottom line: You can do any of the above as long as you tell the reader what you did and the reviewers are ok with that approach. Often it will be driven by the results of your analyses. Most impt. Point: Ethics- you *must* tell the reader what approach you took The Orr Article suggests why this is important. What does it say that suggests important to tell what you did? It says that different types of outlier detection strategies yield different outliers.
Effects of Outliers r = .50 r = .32
Effects of Outliers
Major Problems: Missing Data Generalizability issues Reduces power (sample size) Impacts accuracy of results Accuracy = dispersion around true score (can be under- or over-estimation) Varies with MDT used
Dealing with Missing Data Listwise deletion Pairwise deletion Mean substitution Regression imputation Hot-deck imputation Multiple imputation
Dealing with Missing Data In Order of Accuracy: Pairwise deletion Listwise deletion Regression imputation Mean substitution Hot-deck imputation
Dealing with Missing Data MDT Pros Cons Listwise deletion Easy to use High accuracy Reduces sample size Pairwise deletion Highest accuracy Problematic in MV analyses; non-positive definite correlation matrix Mean substitution Saves data; preserves sample size Moderate accuracy Attenuation of findings Regression imputation (no error term adjustment) Difficult to use Can’t use when all predictors are missing Hot-deck imputation Lots of bias & error
Transformations Best Transformation to Try Square Root Log Inverse “Reflect” (mirror image), then transform Distribution Moderate deviation from normality Substantial deviation from normality Severe deviation; esp. j- shape Negative skew Interpretation of transformed variables?