Presentation is loading. Please wait.

Presentation is loading. Please wait.

PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008.

Similar presentations


Presentation on theme: "PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008."— Presentation transcript:

1 PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008

2 On Suspecting Fishiness Looking for outliers, gaps, and dips –e.g., tests of clairvoyance When gaps or dips are hypothesized –e.g., is dyslexia a distinct entity Cliffs –e.g., differences between rating of ingroup and outgroup Peaks –e.g., the blackout and baby boom The occurrence of impossible scores

3 Visualize your data! “make friends with your data” –Rosenthal “don’t becomes lovers with your data” –Me Statistics condense data View raw data graphically –Frequency distribution graphs –Scatter plots

4

5 Outliers Extreme scores Come from samples other than those of interest Can lead to Type I and II errors

6 Outlier Detection Graph –Box plots –Scatter plots Numerical criterion –Extremity (central tendency +/- spread) Outside fences –lower: Q1 - 3(Q3 - Q1) –upper: Q3 + 3(Q3 - Q1) z-score –Probability (Extremity + # measurements) Chauvenat’s/Peirce’s criterion, Grubb’s test –Absolute cutoff

7 Outlier Analysis Determine nature of impact –Quantitative Changes numbers, not inferences –Qualitative Changes numbers and inferences Consider source of outlier –Quantitative Same underlying mechanism/sample –Qualitative Different underlying mechanisms/samples –e.g., digit span = 107, simple RT = 1200 ms

8 Outlier Coping Options –Retain –Remove –Reduce Windsorize Normalizing transformation Considerations –Impact/Source –Convention –Believability Justification Replication

9 Transformations Linear “rescaling” –unit conversion e.g., # items correct, # items wrong e.g., standardization Curvilinear “reexpression” –variable conversion e.g., time (sec/trial) to speed (trials/sec) e.g., normalization

10 Standardization Why standardize data? –Intra-distribution statistics You got 8 questions wrong on one exam You were one standard deviation below the mean –Inter-distribution statistics You got 8 questions wrong on the midterm and 5 questions wrong on the final Aggregation: Overall, you were one standard deviation below the mean Comparison: You did better on the midterm than the final

11 z-score # standard deviations above/below the mean

12

13

14 Normal Distributions “…normality is a myth; there never was, and never will be, a normal distribution.” –Geary (1947) “Experimentalists think that it is a mathematical theorem while the mathematicians believe it to be an experimental fact.” –Lippman (1917)

15 Normalization Why normalize DV? –Meet statistical assumption of normality in situations when it matters Small n Unequal n One-sample t and z tests –Increase power Why NOT normalize DV? –Interpretability –Affects measurement scale

16 Tests of Normality Frequency distribution Skew/kurtosis statistics Kolmorogov-Smirnov test Probability plots (e.g., P-P plot)

17 Types of Curvilinear Transformations

18 Does normalization help? Games & Lucas (1966): Normalizing transformations hurt –Reduce interpretability, power Levin & Dunlap (1982): Transformations help –Increase power Games (1983): It Depends, Levin and Dunlap are stupid Levine & Dunlap (1984): It depends, Games is stupid Games (1984): This debate is stupid

19 Does non-normality hurt?

20 Normalize If and Only If It matters –In theory: Got robust? –In practice: Got change? Must assume normality (i.e., no non-parametric test available)

21 Missing Data

22 Why are they missing? –MCAR Variable’s missingness unrelated to both its value and other variables’ values e.g., equipment malfunction No bias –MAR Variable’s missingness unrelated to its value after controlling for its relation to other variables e.g., depression and income Bias –MNAR Variable’s missingness related to its value after controlling for its relation to other variables e.g., income reporting Bias

23 Diagnosing Missing Data How much? How concentrated? How essential? MCAR, MAR, MNAR? How influential?

24 Dealing with Missing Data –Treat missing data as data –Note bias “lower income individuals are underrepresented” –Delete variables –Delete cases Listwise Casewise –Estimation Prior knowledge Mean substitution Regression substitution Expectation-maximization (EM) Hot decking Multiple imputation (MI)

25 Missing Data: Conclusions Avoid missing data! If rare (<5%), MCAR, nonessential, concentrated, or impotent, delete appropriately If frequent, patterned, essential, diffuse, influential, use MI If MNAR, treat missingness as DV

26 Question: What’s the best method for identifying and removing RT outliers? Alternatives –RT cutoff (5 values) –z-score cutoff (1, 1.5) –Transformation (log, inverse) –Trimming –Medians –Windsorizing (2 SD)

27 Method Conduct series of simulations –DV: power (# sig simulations/1000) 2 x 2 ANOVA –One main effect (20, 30, 40 ms) 7 observations/condition –10% outlier probability –Outliers 0-2000 ms 32 participants Between-participants variability

28 SpreadDrift ex-Gaussian distribution

29

30

31

32 Inferences Absolute cutoffs resulted in greatest power Best cutoff values depended on type of effect –Shift: 10-15% cutoff –Spread: 5% cutoff Inverse transformation good, too With high between-participant variability, SD cutoff becomes effective

33 Recommendations Try range of cutoffs to examine robustness Replicate with inverse transformation (or SD cutoff) Replicate novel, unexpected, or important effects Choose method before analyzing data


Download ppt "PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008."

Similar presentations


Ads by Google