Basics of Data Cleaning

Name: Basics of Data Cleaning
Uploaded: 2017-08-21T21:15:25+00:00
Duration: PTM7S26
Channel: Eleanor Sparks
Description: Basics of Data Cleaning

Basics of Data Cleaning

Why Examine Your Data? Basic understanding of the data set
Ensure statistical and theoretical underpinnings of a given m.v. technique are met Concerns about the data Departures from distribution assumptions (i.e., normality) Outliers Missing Data

Testing Assumptions MV Normality assumption Violation of MV Normality
Solution is better Violation of MV Normality Skewness (symmetry) Kurtosis (peakedness) Heteroscedascity Non-linearity

Negative Skew

Positive Skew

Kurtosis Mesokurtic Leptokurtic Platykurtic

Skewness & Kurtosis SPSS Syntax FREQUENCIES VARIABLES=age
/STATISTICS=SKEWNESS SESKEW KURTOSIS SEKURT /ORDER= ANALYSIS. Skewness = .354/.205 = 1.73 Kurtosis = -.266/.407 = -.654 Z values = Statistic Std Error Critical Values for z score .05  +/- 1.96 .01  +/- 2.58

Homoscedascity s21 = s22 = s23 = s24 = s2e
When there are multiple groups, each group has similar levels of variance (similar standard deviation)

Linearity

Testing the Assumptions of Absence of Correlated Errors
Correlated errors means there is an unmeasured variable affecting the analysis Key is to identify the unmeasured variable and to include it in the analysis How often do we meet this assumption?

Data Cleaning Examine Techniques to use
Individual items/scales (i.e., reliability) Bivariate relationships Multivariate relationships Techniques to use Graphs  non-normality, heteroscedasticity Frequencies  missing data, out of bounds values Univariate outliers (+/- 3 SD from mean) Mahalanobis Distance (.001)

Graphical Examination
Single Variable: Shape of Distribution Histogram Stem and leaf Relationships between two+ variables Scatterplot

Histogram

Scatterplot

Frequencies

Outliers Where do outliers come from?
Inclusion of subjects not part of the population (e.g., ESL response to vocabulary test) Legitimate data points* Extreme values of random error (X = t + e) Error in observation Error in data preparation

Univariate Outliers Criteria: Mean +/- 3 SD Example: Age
Out of range values > or < 4.53

Univariate Outliers

Multivariate Outliers
Mahalanobis Distance SPSS Syntax Regression Var = case VAR1 VAR2 /statistics collin /dependent =case / enter /residuals = outliers(mahal). Critical Values (case with D > c.v. is m.v. outlier) two variables three variables four variables five variables six variables

Approaches to Outliers
Leave them alone Delete entire case (listwise) Delete only relevant variables (pairwise) Trim – highest legitimate value Mean substitution Imputation Bottom line: You can do any of the above as long as you tell the reader what you did and the reviewers are ok with that approach. Often it will be driven by the results of your analyses. Most impt. Point: Ethics- you *must* tell the reader what approach you took The Orr Article suggests why this is important. What does it say that suggests important to tell what you did? It says that different types of outlier detection strategies yield different outliers.

Effects of Outliers r = .50 r = .32

Effects of Outliers

Major Problems: Missing Data
Generalizability issues Reduces power (sample size) Impacts accuracy of results Accuracy = dispersion around true score (can be under- or over-estimation) Varies with MDT used

Dealing with Missing Data
Listwise deletion Pairwise deletion Mean substitution Regression imputation Hot-deck imputation Multiple imputation

In Order of Accuracy: Pairwise deletion Listwise deletion Regression imputation Mean substitution Hot-deck imputation

MDT Pros Cons Listwise deletion Easy to use High accuracy Reduces sample size Pairwise deletion Highest accuracy Problematic in MV analyses; non-positive definite correlation matrix Mean substitution Saves data; preserves sample size Moderate accuracy Attenuation of findings Regression imputation (no error term adjustment) Difficult to use Can’t use when all predictors are missing Hot-deck imputation Lots of bias & error

Transformations Best Transformation to Try Square Root Log Inverse
“Reflect” (mirror image), then transform Distribution Moderate deviation from normality Substantial deviation from normality Severe deviation; esp. j- shape Negative skew Interpretation of transformed variables?

Basics of Data Cleaning

Similar presentations

Presentation on theme: "Basics of Data Cleaning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Basics of Data Cleaning

Similar presentations

Presentation on theme: "Basics of Data Cleaning"— Presentation transcript:

Similar presentations

About project

Feedback