PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008.

Slides:



Advertisements
Similar presentations
Analyzing Survey Data Angelina Hill, Associate Director of Academic Assessment 2009 Academic Assessment Workshop May 14 th & 15 th UNLV.
Advertisements

Exploring Assumptions
CJT 765: Structural Equation Modeling Class 3: Data Screening: Fixing Distributional Problems, Missing Data, Measurement.
Stats Lunch: Day 2 Screening Your Data: Why and How.
Thursday, September 12, 2013 Effect Size, Power, and Exam Review.
Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review
Intro to Statistics for the Behavioral Sciences PSYC 1900 Lecture 9: Hypothesis Tests for Means: One Sample.
PSY 307 – Statistics for the Behavioral Sciences
Statistics for the Social Sciences Psychology 340 Fall 2006 Review For Exam 1.
Edpsy 511 Homework 1: Due 2/6.
Biol 500: basic statistics
How to deal with missing data: INTRODUCTION
Educational Research by John W. Creswell. Copyright © 2002 by Pearson Education. All rights reserved. Slide 1 Chapter 8 Analyzing and Interpreting Quantitative.
PSY 307 – Statistics for the Behavioral Sciences
Psych 524 Andrew Ainsworth Data Screening 2. Transformation allows for the correction of non-normality caused by skewness, kurtosis, or other problems.
Business Statistics - QBM117 Statistical inference for regression.
Summary of Quantitative Analysis Neuman and Robson Ch. 11
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 14: Non-parametric tests Marshall University Genomics.
Chapter 7 Forecasting with Simple Regression
Statistical Methods for Missing Data Roberta Harnett MAR 550 October 30, 2007.
Richard M. Jacobs, OSA, Ph.D.
Bootstrapping applied to t-tests
Comparisons across normal distributions Z -Scores.
8/20/2015Slide 1 SOLVING THE PROBLEM The two-sample t-test compare the means for two groups on a single variable. the The paired t-test compares the means.
1. Homework #2 2. Inferential Statistics 3. Review for Exam.
The Data Analysis Plan. The Overall Data Analysis Plan Purpose: To tell a story. To construct a coherent narrative that explains findings, argues against.
Multiple imputation using ICE: A simulation study on a binary response Jochen Hardt Kai Görgen 6 th German Stata Meeting, Berlin June, 27 th 2008 Göteborg.
Hypothesis Testing:.
How to Analyze Data? Aravinda Guntupalli. SPSS windows process Data window Variable view window Output window Chart editor window.
Basic Statistics Michael Hylin. Scientific Method Start w/ a question Gather information and resources (observe) Form hypothesis Perform experiment and.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 – Multiple comparisons, non-normality, outliers Marshall.
Copyright © 2008 by Pearson Education, Inc. Upper Saddle River, New Jersey All rights reserved. John W. Creswell Educational Research: Planning,
Choosing and using statistics to test ecological hypotheses
EDUC 200C Friday, October 26, Goals for today Homework Midterm exam Null Hypothesis Sampling distributions Hypothesis testing Mid-quarter evaluations.
Describing and Exploring Data Initial Data Analysis.
Quantitative Skills 1: Graphing
Name: Angelica F. White WEMBA10. Teach students how to make sound decisions and recommendations that are based on reliable quantitative information During.
Analyzing and Interpreting Quantitative Data
Chapter 2 Characterizing Your Data Set Allan Edwards: “Before you analyze your data, graph your data.
Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 19 Process of Quantitative Data Analysis and Interpretation.
Basics of Data Cleaning
Inference and Inferential Statistics Methods of Educational Research EDU 660.
The Beast of Bias Data Screening Chapter 5. Bias Datasets can be biased in many ways – but here are the important ones: – Bias in parameter estimates.
Experimental Research Methods in Language Learning Chapter 9 Descriptive Statistics.
SW 983 Missing Data Treatment Most of the slides presented here are from the Modern Missing Data Methods, 2011, 5 day course presented by the KUCRMDA,
» So, I’ve got all this data…what now? » Data screening – important to check for errors, assumptions, and outliers. » What’s the most important? ˃Depends.
The Impact of Missing Data on the Detection of Nonuniform Differential Item Functioning W. Holmes Finch.
Chapter Eight: Using Statistics to Answer Questions.
Chapter 6: Analyzing and Interpreting Quantitative Data
Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.
Edpsy 511 Exploratory Data Analysis Homework 1: Due 9/19.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 13, 2013.
Handbook for Health Care Research, Second Edition Chapter 10 © 2010 Jones and Bartlett Publishers, LLC CHAPTER 10 Basic Statistical Concepts.
 Assumptions are an essential part of statistics and the process of building and testing models.  There are many different assumptions across the range.
Why do we analyze data?  It is important to analyze data because you need to determine the extent to which the hypothesized relationship does or does.
Why do we analyze data?  To determine the extent to which the hypothesized relationship does or does not exist.  You need to find both the central tendency.
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 April 9, 2012.
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
Appendix I A Refresher on some Statistical Terms and Tests.
Copyright © 2005 by Lippincott Williams and Wilkins. PowerPoint Presentation to Accompany Statistical Methods for Health Care Research by Barbara Hazard.
Lecture 8 Data Analysis: Univariate Analysis and Data Description Research Methods and Statistics 1.
Chapter 12 Understanding Research Results: Description and Correlation
Statistics Collecting and analyzing large amounts of numerical data
Basic Statistical Terms
CH2. Cleaning and Transforming Data
Missing Data Mechanisms
(-4)*(-7)= Agenda Bell Ringer Bell Ringer
Chapter Nine: Using Statistics to Answer Questions
Chapter 2 Examining Your Data
Presentation transcript:

PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008

On Suspecting Fishiness Looking for outliers, gaps, and dips –e.g., tests of clairvoyance When gaps or dips are hypothesized –e.g., is dyslexia a distinct entity Cliffs –e.g., differences between rating of ingroup and outgroup Peaks –e.g., the blackout and baby boom The occurrence of impossible scores

Visualize your data! “make friends with your data” –Rosenthal “don’t becomes lovers with your data” –Me Statistics condense data View raw data graphically –Frequency distribution graphs –Scatter plots

Outliers Extreme scores Come from samples other than those of interest Can lead to Type I and II errors

Outlier Detection Graph –Box plots –Scatter plots Numerical criterion –Extremity (central tendency +/- spread) Outside fences –lower: Q1 - 3(Q3 - Q1) –upper: Q3 + 3(Q3 - Q1) z-score –Probability (Extremity + # measurements) Chauvenat’s/Peirce’s criterion, Grubb’s test –Absolute cutoff

Outlier Analysis Determine nature of impact –Quantitative Changes numbers, not inferences –Qualitative Changes numbers and inferences Consider source of outlier –Quantitative Same underlying mechanism/sample –Qualitative Different underlying mechanisms/samples –e.g., digit span = 107, simple RT = 1200 ms

Outlier Coping Options –Retain –Remove –Reduce Windsorize Normalizing transformation Considerations –Impact/Source –Convention –Believability Justification Replication

Transformations Linear “rescaling” –unit conversion e.g., # items correct, # items wrong e.g., standardization Curvilinear “reexpression” –variable conversion e.g., time (sec/trial) to speed (trials/sec) e.g., normalization

Standardization Why standardize data? –Intra-distribution statistics You got 8 questions wrong on one exam You were one standard deviation below the mean –Inter-distribution statistics You got 8 questions wrong on the midterm and 5 questions wrong on the final Aggregation: Overall, you were one standard deviation below the mean Comparison: You did better on the midterm than the final

z-score # standard deviations above/below the mean

Normal Distributions “…normality is a myth; there never was, and never will be, a normal distribution.” –Geary (1947) “Experimentalists think that it is a mathematical theorem while the mathematicians believe it to be an experimental fact.” –Lippman (1917)

Normalization Why normalize DV? –Meet statistical assumption of normality in situations when it matters Small n Unequal n One-sample t and z tests –Increase power Why NOT normalize DV? –Interpretability –Affects measurement scale

Tests of Normality Frequency distribution Skew/kurtosis statistics Kolmorogov-Smirnov test Probability plots (e.g., P-P plot)

Types of Curvilinear Transformations

Does normalization help? Games & Lucas (1966): Normalizing transformations hurt –Reduce interpretability, power Levin & Dunlap (1982): Transformations help –Increase power Games (1983): It Depends, Levin and Dunlap are stupid Levine & Dunlap (1984): It depends, Games is stupid Games (1984): This debate is stupid

Does non-normality hurt?

Normalize If and Only If It matters –In theory: Got robust? –In practice: Got change? Must assume normality (i.e., no non-parametric test available)

Missing Data

Why are they missing? –MCAR Variable’s missingness unrelated to both its value and other variables’ values e.g., equipment malfunction No bias –MAR Variable’s missingness unrelated to its value after controlling for its relation to other variables e.g., depression and income Bias –MNAR Variable’s missingness related to its value after controlling for its relation to other variables e.g., income reporting Bias

Diagnosing Missing Data How much? How concentrated? How essential? MCAR, MAR, MNAR? How influential?

Dealing with Missing Data –Treat missing data as data –Note bias “lower income individuals are underrepresented” –Delete variables –Delete cases Listwise Casewise –Estimation Prior knowledge Mean substitution Regression substitution Expectation-maximization (EM) Hot decking Multiple imputation (MI)

Missing Data: Conclusions Avoid missing data! If rare (<5%), MCAR, nonessential, concentrated, or impotent, delete appropriately If frequent, patterned, essential, diffuse, influential, use MI If MNAR, treat missingness as DV

Question: What’s the best method for identifying and removing RT outliers? Alternatives –RT cutoff (5 values) –z-score cutoff (1, 1.5) –Transformation (log, inverse) –Trimming –Medians –Windsorizing (2 SD)

Method Conduct series of simulations –DV: power (# sig simulations/1000) 2 x 2 ANOVA –One main effect (20, 30, 40 ms) 7 observations/condition –10% outlier probability –Outliers ms 32 participants Between-participants variability

SpreadDrift ex-Gaussian distribution

Inferences Absolute cutoffs resulted in greatest power Best cutoff values depended on type of effect –Shift: 10-15% cutoff –Spread: 5% cutoff Inverse transformation good, too With high between-participant variability, SD cutoff becomes effective

Recommendations Try range of cutoffs to examine robustness Replicate with inverse transformation (or SD cutoff) Replicate novel, unexpected, or important effects Choose method before analyzing data