Download presentation
1
Basics of Data Cleaning
2
Why Examine Your Data? Basic understanding of the data set
Ensure statistical and theoretical underpinnings of a given m.v. technique are met Concerns about the data Departures from distribution assumptions (i.e., normality) Outliers Missing Data
3
Testing Assumptions MV Normality assumption Violation of MV Normality
Solution is better Violation of MV Normality Skewness (symmetry) Kurtosis (peakedness) Heteroscedascity Non-linearity
4
Negative Skew
5
Positive Skew
6
Kurtosis Mesokurtic Leptokurtic Platykurtic
7
Skewness & Kurtosis SPSS Syntax FREQUENCIES VARIABLES=age
/STATISTICS=SKEWNESS SESKEW KURTOSIS SEKURT /ORDER= ANALYSIS. Skewness = .354/.205 = 1.73 Kurtosis = -.266/.407 = -.654 Z values = Statistic Std Error Critical Values for z score .05 +/- 1.96 .01 +/- 2.58
8
Homoscedascity s21 = s22 = s23 = s24 = s2e
When there are multiple groups, each group has similar levels of variance (similar standard deviation)
9
Linearity
10
Testing the Assumptions of Absence of Correlated Errors
Correlated errors means there is an unmeasured variable affecting the analysis Key is to identify the unmeasured variable and to include it in the analysis How often do we meet this assumption?
11
Data Cleaning Examine Techniques to use
Individual items/scales (i.e., reliability) Bivariate relationships Multivariate relationships Techniques to use Graphs non-normality, heteroscedasticity Frequencies missing data, out of bounds values Univariate outliers (+/- 3 SD from mean) Mahalanobis Distance (.001)
12
Graphical Examination
Single Variable: Shape of Distribution Histogram Stem and leaf Relationships between two+ variables Scatterplot
13
Histogram
14
Scatterplot
15
Frequencies
16
Outliers Where do outliers come from?
Inclusion of subjects not part of the population (e.g., ESL response to vocabulary test) Legitimate data points* Extreme values of random error (X = t + e) Error in observation Error in data preparation
17
Univariate Outliers Criteria: Mean +/- 3 SD Example: Age
Out of range values > or < 4.53
18
Univariate Outliers
19
Multivariate Outliers
Mahalanobis Distance SPSS Syntax Regression Var = case VAR1 VAR2 /statistics collin /dependent =case / enter /residuals = outliers(mahal). Critical Values (case with D > c.v. is m.v. outlier) two variables three variables four variables five variables six variables
20
Approaches to Outliers
Leave them alone Delete entire case (listwise) Delete only relevant variables (pairwise) Trim – highest legitimate value Mean substitution Imputation Bottom line: You can do any of the above as long as you tell the reader what you did and the reviewers are ok with that approach. Often it will be driven by the results of your analyses. Most impt. Point: Ethics- you *must* tell the reader what approach you took The Orr Article suggests why this is important. What does it say that suggests important to tell what you did? It says that different types of outlier detection strategies yield different outliers.
21
Effects of Outliers r = .50 r = .32
22
Effects of Outliers
23
Major Problems: Missing Data
Generalizability issues Reduces power (sample size) Impacts accuracy of results Accuracy = dispersion around true score (can be under- or over-estimation) Varies with MDT used
24
Dealing with Missing Data
Listwise deletion Pairwise deletion Mean substitution Regression imputation Hot-deck imputation Multiple imputation
25
Dealing with Missing Data
In Order of Accuracy: Pairwise deletion Listwise deletion Regression imputation Mean substitution Hot-deck imputation
26
Dealing with Missing Data
MDT Pros Cons Listwise deletion Easy to use High accuracy Reduces sample size Pairwise deletion Highest accuracy Problematic in MV analyses; non-positive definite correlation matrix Mean substitution Saves data; preserves sample size Moderate accuracy Attenuation of findings Regression imputation (no error term adjustment) Difficult to use Can’t use when all predictors are missing Hot-deck imputation Lots of bias & error
27
Transformations Best Transformation to Try Square Root Log Inverse
“Reflect” (mirror image), then transform Distribution Moderate deviation from normality Substantial deviation from normality Severe deviation; esp. j- shape Negative skew Interpretation of transformed variables?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.