Basics of Data Cleaning

Slides:



Advertisements
Similar presentations
Descriptive Statistics-II
Advertisements

Population vs. Sample Population: A large group of people to which we are interested in generalizing. parameter Sample: A smaller group drawn from a population.
Assumptions underlying regression analysis
AP Statistics Course Review.
Transformations & Data Cleaning
Data analysis: Explore GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 9.
Descriptive Statistics
Measures of Dispersion
CJT 765: Structural Equation Modeling Class 3: Data Screening: Fixing Distributional Problems, Missing Data, Measurement.
1 Week 6: Assumptions in Regression Analysis. 2 The Assumptions 1.The distribution of residuals is normal (at each value of the dependent variable). 2.The.
Stats Lunch: Day 2 Screening Your Data: Why and How.
Simple Linear Regression 1. Correlation indicates the magnitude and direction of the linear relationship between two variables. Linear Regression: variable.
LINEAR REGRESSION: Evaluating Regression Models Overview Assumptions for Linear Regression Evaluating a Regression Model.
LINEAR REGRESSION: Evaluating Regression Models. Overview Assumptions for Linear Regression Evaluating a Regression Model.
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 14 Using Multivariate Design and Analysis.
Chapter 13 Conducting & Reading Research Baumgartner et al Data Analysis.
Psych 524 Andrew Ainsworth Data Screening 1. Data check entry One of the first steps to proper data screening is to ensure the data is correct Check out.
BA 555 Practical Business Analysis
Lecture 25 Multiple Regression Diagnostics (Sections )
Multivariate Data Analysis Chapter 4 – Multiple Regression.
Lecture 6: Multiple Regression
Descriptive Statistics
Analysis of Research Data
What does researcher want of statistics?. 1.How variable it is? 2.Does “my pet thing” work? 3.Why do the things differ? 4.Why does it fail from time to.
Statistical Analysis SC504/HS927 Spring Term 2008 Session 7: Week 23: 7 th March 2008 Complex independent variables and regression diagnostics.
Multiple Regression – Assumptions and Outliers
Regression Diagnostics Checking Assumptions and Data.
Psych 524 Andrew Ainsworth Data Screening 2. Transformation allows for the correction of non-normality caused by skewness, kurtosis, or other problems.
Business Statistics - QBM117 Statistical inference for regression.
Slide 1 Testing Multivariate Assumptions The multivariate statistical techniques which we will cover in this class require one or more the following assumptions.
Linear Regression 2 Sociology 5811 Lecture 21 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
Checking Regression Model Assumptions NBA 2013/14 Player Heights and Weights.
Statistical Methods For Health Research. History Blaise Pascl: tossing ……probability William Gossett: standard error of mean “ how large the sample should.
Copyright © 2010 Pearson Education, Inc., publishing as Prentice-Hall. 2-1 Chapter 2 Examining Your Data.
Multivariate Statistical Data Analysis with Its Applications
APPENDIX B Data Preparation and Univariate Statistics How are computer used in data collection and analysis? How are collected data prepared for statistical.
Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 19 Process of Quantitative Data Analysis and Interpretation.
Multiple Regression The Basics. Multiple Regression (MR) Predicting one DV from a set of predictors, the DV should be interval/ratio or at least assumed.
Descriptive Statistics becoming familiar with the data.
Skewness & Kurtosis: Reference
Regression Analysis Week 8 DIAGNOSTIC AND REMEDIAL MEASURES Residuals The main purpose examining residuals Diagnostic for Residuals Test involving residuals.
An Introduction to Statistics. Two Branches of Statistical Methods Descriptive statistics Techniques for describing data in abbreviated, symbolic fashion.
The Beast of Bias Data Screening Chapter 5. Bias Datasets can be biased in many ways – but here are the important ones: – Bias in parameter estimates.
Numeric Summaries and Descriptive Statistics. populations vs. samples we want to describe both samples and populations the latter is a matter of inference…
September 18-19, 2006 – Denver, Colorado Sponsored by the U.S. Department of Housing and Urban Development Conducting and interpreting multivariate analyses.
REGRESSION DIAGNOSTICS Fall 2013 Dec 12/13. WHY REGRESSION DIAGNOSTICS? The validity of a regression model is based on a set of assumptions. Violation.
» So, I’ve got all this data…what now? » Data screening – important to check for errors, assumptions, and outliers. » What’s the most important? ˃Depends.
Multivariate Data Analysis Chapter 2 – Examining Your Data
Multiple regression.
Advanced Statistical Methods: Continuous Variables REVIEW Dr. Irina Tomescu-Dubrow.
B AD 6243: Applied Univariate Statistics Multiple Regression Professor Laku Chidambaram Price College of Business University of Oklahoma.
LIS 570 Summarising and presenting data - Univariate analysis.
D/RS 1013 Data Screening/Cleaning/ Preparation for Analyses.
Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.
Assumptions & Requirements.  Three Important Assumptions 1.The errors are normally distributed. 2.The errors have constant variance (i.e., they are homoscedastic)
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
Quantitative Methods Residual Analysis Multiple Linear Regression C.W. Jackson/B. K. Gordor.
Copyright © 2005 by Lippincott Williams and Wilkins. PowerPoint Presentation to Accompany Statistical Methods for Health Care Research by Barbara Hazard.
Unit 9: Dealing with Messy Data I: Case Analysis
Correlation, Bivariate Regression, and Multiple Regression
Descriptive measures Capture the main 4 basic Ch.Ch. of the sample distribution: Central tendency Variability (variance) Skewness kurtosis.
APPROACHES TO QUANTITATIVE DATA ANALYSIS
Maximum Likelihood & Missing data
Applied Statistical Analysis
Multivariate Analysis Lec 4
Nasty data… When killer data can ruin your analyses
CH2. Cleaning and Transforming Data
Checking the data and assumptions before the final analysis.
Chapter 2 Examining Your Data
Presentation transcript:

Basics of Data Cleaning

Why Examine Your Data? Basic understanding of the data set Ensure statistical and theoretical underpinnings of a given m.v. technique are met Concerns about the data Departures from distribution assumptions (i.e., normality) Outliers Missing Data

Testing Assumptions MV Normality assumption Violation of MV Normality Solution is better Violation of MV Normality Skewness (symmetry) Kurtosis (peakedness) Heteroscedascity Non-linearity

Negative Skew

Positive Skew

Kurtosis Mesokurtic Leptokurtic Platykurtic

Skewness & Kurtosis SPSS Syntax FREQUENCIES VARIABLES=age /STATISTICS=SKEWNESS SESKEW KURTOSIS SEKURT /ORDER= ANALYSIS. Skewness = .354/.205 = 1.73 Kurtosis = -.266/.407 = -.654 Z values = Statistic Std Error Critical Values for z score .05  +/- 1.96 .01  +/- 2.58

Homoscedascity s21 = s22 = s23 = s24 = s2e When there are multiple groups, each group has similar levels of variance (similar standard deviation)

Linearity

Testing the Assumptions of Absence of Correlated Errors Correlated errors means there is an unmeasured variable affecting the analysis Key is to identify the unmeasured variable and to include it in the analysis How often do we meet this assumption?

Data Cleaning Examine Techniques to use Individual items/scales (i.e., reliability) Bivariate relationships Multivariate relationships Techniques to use Graphs  non-normality, heteroscedasticity Frequencies  missing data, out of bounds values Univariate outliers (+/- 3 SD from mean) Mahalanobis Distance (.001)

Graphical Examination Single Variable: Shape of Distribution Histogram Stem and leaf Relationships between two+ variables Scatterplot

Histogram

Scatterplot

Frequencies

Outliers Where do outliers come from? Inclusion of subjects not part of the population (e.g., ESL response to vocabulary test) Legitimate data points* Extreme values of random error (X = t + e) Error in observation Error in data preparation

Univariate Outliers Criteria: Mean +/- 3 SD Example: Age Out of range values > 64.83 or < 4.53

Univariate Outliers

Multivariate Outliers Mahalanobis Distance SPSS Syntax Regression Var = case VAR1 VAR2 /statistics collin /dependent =case / enter /residuals = outliers(mahal). Critical Values (case with D > c.v. is m.v. outlier) two variables - 13.82 three variables - 16.27 four variables - 18.46 five variables - 20.52 six variables - 22.46

Approaches to Outliers Leave them alone Delete entire case (listwise) Delete only relevant variables (pairwise) Trim – highest legitimate value Mean substitution Imputation Bottom line: You can do any of the above as long as you tell the reader what you did and the reviewers are ok with that approach. Often it will be driven by the results of your analyses. Most impt. Point: Ethics- you *must* tell the reader what approach you took The Orr Article suggests why this is important. What does it say that suggests important to tell what you did? It says that different types of outlier detection strategies yield different outliers.

Effects of Outliers r = .50 r = .32

Effects of Outliers

Major Problems: Missing Data Generalizability issues Reduces power (sample size) Impacts accuracy of results Accuracy = dispersion around true score (can be under- or over-estimation) Varies with MDT used

Dealing with Missing Data Listwise deletion Pairwise deletion Mean substitution Regression imputation Hot-deck imputation Multiple imputation

Dealing with Missing Data In Order of Accuracy: Pairwise deletion Listwise deletion Regression imputation Mean substitution Hot-deck imputation

Dealing with Missing Data MDT Pros Cons Listwise deletion Easy to use High accuracy Reduces sample size Pairwise deletion Highest accuracy Problematic in MV analyses; non-positive definite correlation matrix Mean substitution Saves data; preserves sample size Moderate accuracy Attenuation of findings Regression imputation (no error term adjustment) Difficult to use Can’t use when all predictors are missing Hot-deck imputation Lots of bias & error

Transformations Best Transformation to Try Square Root Log Inverse “Reflect” (mirror image), then transform Distribution Moderate deviation from normality Substantial deviation from normality Severe deviation; esp. j- shape Negative skew Interpretation of transformed variables?