Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 April 9, 2012.

Slides:

Advertisements

Similar presentations

T-tests continued.

Advertisements

Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 March 12, 2012.

Treatment of missing values

6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.

Some birds, a cool cat and a wolf

CJT 765: Structural Equation Modeling Class 3: Data Screening: Fixing Distributional Problems, Missing Data, Measurement.

PSY 307 – Statistics for the Behavioral Sciences

Lecture 9: One Way ANOVA Between Subjects

How to deal with missing data: INTRODUCTION

Today Concepts underlying inferential statistics

Psych 524 Andrew Ainsworth Data Screening 2. Transformation allows for the correction of non-normality caused by skewness, kurtosis, or other problems.

FINAL REPORT: OUTLINE & OVERVIEW OF SURVEY ERRORS

Statistical Methods for Missing Data Roberta Harnett MAR 550 October 30, 2007.

Introduction to Hypothesis Testing

Multiple imputation using ICE: A simulation study on a binary response Jochen Hardt Kai Görgen 6 th German Stata Meeting, Berlin June, 27 th 2008 Göteborg.

Psy B07 Chapter 1Slide 1 ANALYSIS OF VARIANCE. Psy B07 Chapter 1Slide 2 t-test refresher  In chapter 7 we talked about analyses that could be conducted.

ANCOVA Lecture 9 Andrew Ainsworth. What is ANCOVA?

Chapter 13: Inference in Regression

Introduction to plausible values National Research Coordinators Meeting Madrid, February 2010.

APPENDIX B Data Preparation and Univariate Statistics How are computer used in data collection and analysis? How are collected data prepared for statistical.

Fundamentals of Data Analysis Lecture 4 Testing of statistical hypotheses.

F OUNDATIONS OF S TATISTICAL I NFERENCE. D EFINITIONS Statistical inference is the process of reaching conclusions about characteristics of an entire.

1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.

Guide to Handling Missing Information Contacting researchers Algebraic recalculations, conversions and approximations Imputation method (substituting missing.

Making estimations Statistics for the Social Sciences Psychology 340 Spring 2010.

Estimation Bias, Standard Error and Sampling Distribution Estimation Bias, Standard Error and Sampling Distribution Topic 9.

Inferential Statistics 2 Maarten Buis January 11, 2006.

Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 08/10/ :23 PM 1 Some basic statistical concepts, statistics.

Chapter 12 Multiple Linear Regression Doing it with more variables! More is better. Chapter 12A.

User Study Evaluation Human-Computer Interaction.

Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.

Standard Error and Confidence Intervals Martin Bland Professor of Health Statistics University of York

1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.

G Lecture 11 G Session 12 Analyses with missing data What should be reported?  Hoyle and Panter  McDonald and Moon-Ho (2002)

Applied Epidemiologic Analysis - P8400 Fall 2002 Lab 10 Missing Data Henian Chen, M.D., Ph.D.

Managerial Economics Demand Estimation & Forecasting.

CHAPTER 12 Descriptive, Program Evaluation, and Advanced Methods.

Sampling Error.  When we take a sample, our results will not exactly equal the correct results for the whole population. That is, our results will be.

Section 10.1 Confidence Intervals

SW 983 Missing Data Treatment Most of the slides presented here are from the Modern Missing Data Methods, 2011, 5 day course presented by the KUCRMDA,

1 G Lect 13W Imputation (data augmentation) of missing data Multiple imputation Examples G Multiple Regression Week 13 (Wednesday)

The Impact of Missing Data on the Detection of Nonuniform Differential Item Functioning W. Holmes Finch.

Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 February 22, 2012.

Simulation Study for Longitudinal Data with Nonignorable Missing Data Rong Liu, PhD Candidate Dr. Ramakrishnan, Advisor Department of Biostatistics Virginia.

T tests comparing two means t tests comparing two means.

Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 13, 2013.

IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.

D/RS 1013 Data Screening/Cleaning/ Preparation for Analyses.

Tutorial I: Missing Value Analysis

Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 3, 2013.

Statistics (cont.) Psych 231: Research Methods in Psychology.

Jump to first page Inferring Sample Findings to the Population and Testing for Differences.

Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.

Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.

Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.

Sampling and Sampling Distribution

Missing data: Why you should care about it and what to do about it

MISSING DATA AND DROPOUT

Maximum Likelihood & Missing data

Introduction to Survey Data Analysis

The European Statistical Training Programme (ESTP)

CH2. Cleaning and Transforming Data

Simple Linear Regression

Experimental Design: The Basic Building Blocks

Essential Statistics Two-Sample Problems - Two-sample t procedures -

Statistics for the Social Sciences

Clinical prediction models

Chapter 13: Item nonresponse

Considerations for the use of multiple imputation in a noninferiority trial setting Kimberly Walters, Jie Zhou, Janet Wittes, Lisa Weissfeld Joint Statistical.

Presentation transcript:

Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 April 9, 2012

Today’s Class Missing Data and Imputation

Missing Data Frequently, when collecting large amounts of data from diverse sources, there are missing values for some data sources

Examples It’s easy to think of examples; can anyone here give examples from your own current or past research or projects

Classes of missing data Missing all data/“Unit nonresponse” Missing all of one source of data – E.g. student did not fill out questionnaire but used tutor Missing specific data/“Item nonresponse” – E.g. student did not answer one question on questionnaire Subject dropout/attrition – Subject ceased to be part of population during study E.g. student was suspended for a fight

What do we do?

Case Deletion Simply delete any case that has at least one missing value Alternate form: Simply delete any case that is missing the dependent variable

Case Deletion In what situations might this be acceptable? In what situations might this be unacceptable? In what situations might this be practically impossible?

Case Deletion In what situations might this be acceptable? – Relatively little missing data in sample – Dependent variable missing, and journal unlikely to accept imputed dependent variable – Almost all data missing for case Example: A student who is absent during entire usage of tutor In what situations might this be unacceptable? In what situations might this be practically impossible?

Case Deletion In what situations might this be acceptable? In what situations might this be unacceptable? – Data loss appears to be non-random Example: The students who fail to answer “How much marijuana do you smoke?” have lower GPA than the average student who does answer that question – Data loss is due to attrition, and you care about inference up until the point of the data loss Student completes pre-test, tutor, and post-test, but not retention test In what situations might this be practically impossible?

Case Deletion In what situations might this be acceptable? In what situations might this be unacceptable? In what situations might this be practically impossible? – Almost all students missing at least some data

Analysis-by-Analysis Case Deletion Common approach Advantages? Disadvantages?

Analysis-by-Analysis Case Deletion Common approach Advantages? – Every analysis involves all available data Disadvantages? – Difficult to do omnibus or multivariate analyses – Are your analyses fully comparable to each other?

Mean Substitution Replace all missing data with the mean value for the data set Mathematically equivalent: unitize all variables, and treat missing values as 0

Mean Substitution Advantages? Disadvantages?

Mean Substitution Advantages? – Simple to Conduct – Essentially drops missing data from analysis without dropping case from analysis entirely Disadvantages? – Distorts variances and correlations – May bias against significance in an over- conservative fashion

Distortion From Mean Substitution Imagine a sample where the true sample is that 50 out of 1000 students have smoked marijuana GPA Smokers: M=2.6, SD=0.5 Non-Smokers: M=3.3, SD=0.5

Distortion From Mean Substitution However, 30 of the 50 smokers refuse to answer whether they smoke, and 20 of the 950 non- smokers refuse to answer And the respondents who remain are fully representative GPA Smokers: M=2.6 Non-Smokers: M=3.3

Distortion From Mean Substitution GPA Smokers: M=2.6 Non-Smokers: M=3.3 Overall Average: M=3.285

Distortion From Mean Substitution GPA Smokers: M=2.6 Non-Smokers: M=3.3 Overall Average: M=3.285 Smokers (Mean Sub): M= 3.02 Non-Smokers (Mean Sub): M= 3.3

MAR and MNAR “Missing At Random” “Missing Not At Random”

MAR Data is MAR if R = Missing data Ycom = Complete data set (if nothing missing) Yobs = Observed data set

MAR In other words If values for R are not dependent on whether R is missing or not, the data is MAR

MAR and MNAR Are these MAR or MNAR? Students who smoke marijuana are less likely to answer whether they smoke marijuana Students who smoke marijuana are likely to lie and say they do not smoke marijuana Some students don’t answer all questions out of laziness Some data is not recorded due to server logging errors Some students are not present for whole study due to suspension from school due to fighting

MAR and MNAR MAR-based estimation may often be reasonably robust to violation of MAR assumption (Graham et al., 2007; Collins et al., 2001) Often difficult to verify for real data – In many cases, you don’t know why data is missing…

MAR-assuming approaches Single Imputation Multiple Imputation Maximum Likelihood Estimation

Single Imputation Replace all missing items with statistically plausible values and then conduct statistical analysis Mean substitution is a simple form of single imputation

Single Imputation Relatively simple to conduct Probably OK when limited missing data

Other Single Imputation Procedures

Hot-Deck Substitution: Replace each missing value with a value randomly drawn from other students (for the same variable) Very conservative; biases strongly towards no effect by discarding any possible association for that value

Other Single Imputation Procedures Linear regression/classification: For missing data for variable X Build regressor or classifier predicting observed cases of variable X from all other variables Substitute predictor of X for missing values

Other Single Imputation Procedures Linear regression/classification: For missing data for variable X Build regressor or classifier predicting observed cases of variable X from all other variables Substitute predictor of X for missing values Limitation: if you want to correlate X to other variables, this will increase the strength of correlation

Other Single Imputation Procedures Distribution-based linear regression/classification: For missing data for variable X Build regressor or classifier predicting observed cases of variable X from all other variables Compute probability density function for X – Based on confidence interval if X normally distributed Randomly draw from probability density function of each missing value Limitation: A lot of work, still reduces data variance in undesirable fashions

Multiple Imputation Conduct procedure similar to single imputation many times, creating many data sets – times recommended by Schafer & Graham (2002) Conduct meta-analysis across data sets, to determine both overall answer and degree of uncertainty

Multiple Imputation Procedure Several procedures – essentially extensions of single imputation procedures One example

Multiple Imputation Procedure Conduct linear regression/classification For each data set – Add noise to each data point, drawn from a distribution which maps to the distribution of the original (non-missing) data set for that variable – Note: if original distribution is non-normal, use non-normal noise distribution

Maximum Likelihood Use expectation maximization to iterate between Estimating the function to predict missing data, using both observed data and predictions of missing data Predicting missing data

Maximum Likelihood Limitations: Do not account for noise and variance in observations as well as Multiple Imputation Not very robust to departures from MAR

MNAR Estimation Selection models – Predict missingness on a variable from other variables – Then attempt to predict missing cases using both the other variables, and the model of situations when the variable is missing

Reducing Missing Values Of course, the best way to deal with missing values is to not have missing values in the first place For more on this, please take PSY503!

Asgn. 9 Students who completed Asgn. 9, please discuss your solutions – Christian – Mike Wixon – Zak

Wixon Solution: Over-fitting? Why might it be overfit?

Rogoff Solution: Over-fitting? StandardizedExam = * studentID * MathCareerInterest * MathCourseInHS

Now to compare the models of Wixon versus Rogoff On a new test set with no missing data

Results (r) StudentTraining SetTest Set Wixon Rogoff

Wait, what?

Correlation in test set between student ID and standardized exam score = But coefficient = So weighted value ranges from to 11.98, which was not enough to hurt prediction

Asgn. 10 Questions? Comments?

Next Class Wednesday, April 11 3pm-5pm AK232 Power Analysis Readings Lachin, J.M. (1981) Introduction to Sample Size Determination and Power Analysis for Clinical Trials Controlled Clinical Trials,2, [pdf][pdf] Cohen, J. (1992) A Power Primer. Psychological Bulletin, 112 (1), Assignments Due: None

The End