STAT 3130 Statistical Methods II Missing Data and Imputation.

Slides:



Advertisements
Similar presentations
Fall 2013Biostat 5110 (Biostatistics 511) Discussion Section Week 4 Sandrine Moutou Medical Biometry I.
Advertisements

CountrySTAT Team-I November 2014, ECO Secretariat,Teheran.
Preparing Data for Quantitative Analysis
Objectives 10.1 Simple linear regression
Estimating a Population Proportion
TRIM Workshop Arco van Strien Wildlife statistics Statistics Netherlands (CBS)
Chapter 3 Producing Data 1. During most of this semester we go about statistics as if we already have data to work with. This is okay, but a little misleading.
The World’s Fastest Crash Course in Statistics Or, What You Need to Know to Answer Your Research Question 13 November 2006.
Adapting to missing data
1 Summarizing Performance Data Confidence Intervals Important Easy to Difficult Warning: some mathematical content.
Validity, Sampling & Experimental Control Psych 231: Research Methods in Psychology.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
1 Psych 5500/6500 The t Test for a Single Group Mean (Part 5): Outliers Fall, 2008.
Bootstrapping applied to t-tests
PEAS wprkshop 2 Non-response and what to do about it Gillian Raab Professor of Applied Statistics Napier University.
Multiple imputation using ICE: A simulation study on a binary response Jochen Hardt Kai Görgen 6 th German Stata Meeting, Berlin June, 27 th 2008 Göteborg.
Chapter 12-2 Transforming Relationships Day 2
Inference in practice BPS chapter 16 © 2006 W.H. Freeman and Company.
SAS Workshop Lecture 1 Lecturer: Annie N. Simpson, MSc.
Analyzing Surveys. The Goal Once research is collected Analyze to find patterns Analyze to find connections (Correlations) Find the cause of these correlations.
The Research Enterprise in Psychology. The Scientific Method: Terminology Operational definitions are used to clarify precisely what is meant by each.
HPR Copyright © Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by copyright law and.
MULTIPLE REGRESSION Using more than one variable to predict another.
Analyzing and Interpreting Quantitative Data
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
CS 350, slide set 5 M. Overstreet Old Dominion University Spring 2005.
Grant Brown.  AIDS patients – compliance with treatment  Binary response – complied or no  Attempt to find factors associated with better compliance.
Using Weighted Data Donald Miller Population Research Institute 812 Oswald Tower, December 2008.
Summer SAS Workshop Lecture 2. Summer Summer SAS Workshop Lecture 2 I’ve got Data…how do I get started? Libname Review How do you do arithmetic.
Accuracy Chapter 5.1 Data Screening. Data Screening So, I’ve got all this data…what now? – Please note this is going to deviate from the book a bit and.
Experimental Research Methods in Language Learning Chapter 9 Descriptive Statistics.
PROCESSING OF DATA The collected data in research is processed and analyzed to come to some conclusions or to verify the hypothesis made. Processing of.
Educational Research Chapter 13 Inferential Statistics Gay, Mills, and Airasian 10 th Edition.
BMTRY 789 Lecture 11: Debugging Readings – Chapter 10 (3 rd Ed) from “The Little SAS Book” Lab Problems – None Homework Due – None Final Project Presentations.
SW 983 Missing Data Treatment Most of the slides presented here are from the Modern Missing Data Methods, 2011, 5 day course presented by the KUCRMDA,
Notes 1.3 (Part 1) An Overview of Statistics. What you will learn 1. How to design a statistical study 2. How to collect data by taking a census, using.
Hypothesis Testing. Why do we need it? – simply, we are looking for something – a statistical measure - that will allow us to conclude there is truly.
The Impact of Missing Data on the Detection of Nonuniform Differential Item Functioning W. Holmes Finch.
1 Summarizing Performance Data Confidence Intervals Important Easy to Difficult Warning: some mathematical content.
Stats 845 Applied Statistics. This Course will cover: 1.Regression –Non Linear Regression –Multiple Regression 2.Analysis of Variance and Experimental.
Chapter 6: Analyzing and Interpreting Quantitative Data
Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.
A REVIEW By Chi-Ming Kam Surajit Ray April 23, 2001 April 23, 2001.
Summer SAS Workshop Lecture 3. Summer SAS Workshop Website
Inferential Statistics Introduction. If both variables are categorical, build tables... Convention: Each value of the independent (causal) variable has.
Chapter 3: Organizing Data. Raw data is useless to us unless we can meaningfully organize and summarize it (descriptive statistics). Organization techniques.
1 1 Slide Simulation Professor Ahmadi. 2 2 Slide Simulation Chapter Outline n Computer Simulation n Simulation Modeling n Random Variables and Pseudo-Random.
Analysis of Experiments
CS190/295 Programming in Python for Life Sciences: Lecture 6 Instructor: Xiaohui Xie University of California, Irvine.
More on regression Petter Mostad More on indicator variables If an independent variable is an indicator variable, cases where it is 1 will.
 Seeks to determine group membership from predictor variables ◦ Given group membership, how many people can we correctly classify?
Chapter 6 Becoming Acquainted With Statistical Concepts.
Coding Preparing The Research for Data Entry. Coding (defined) Coding is the process of converting questionnaire responses into a form that a computer.
Sect. 1-3 Experimental Design Objective: SWBAT learn how to design a statistical Study, How to collect data by taking a census using a sampling, using.
Review Design of experiments, histograms, average and standard deviation, normal approximation, measurement error, and probability.
Central Tendency and Variability Chapter 4. Variability In reality – all of statistics can be summed into one statement: – Variability matters. – (and.
Programming Logic and Design Fourth Edition, Comprehensive Chapter 10 Using Menus and Validating Input.
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
Data Entry, Coding & Cleaning SPSS Training Thomas Joshua, MS July, 2008.
SAS ® 101 Based on Learning SAS by Example: A Programmer’s Guide Chapters 16 & 17 By Tasha Chapman, Oregon Health Authority.
DATA STRUCTURES AND LONGITUDINAL DATA ANALYSIS Nidhi Kohli, Ph.D. Quantitative Methods in Education (QME) Department of Educational Psychology 1.
SPSS For a Beginner CHAR By Adebisi A. Abdullateef
QM222 Class 8 Section A1 Using categorical data in regression
Advanced Analytics Using Enterprise Miner
Dealing with missing data
CS190/295 Programming in Python for Life Sciences: Lecture 6
The bane of data analysis
MOON Data File Components
Presentation transcript:

STAT 3130 Statistical Methods II Missing Data and Imputation

STAT3130 – Missing Data Rarely does “real” data come to you without any missing values. And “missing” can take several forms: 1.Truly “missing” – meaning no value is present; 2.Coded – meaning that there is a “value” but it means something different from the scale of the data; 3.Miscoded – meaning that there is a “value”, but it is wrong...

STAT3130 – Missing Data Lets take each one individually… 1)Truly Missing Data. Consider the GSS08 dataset. You will see missing data which is character as well as numeric. Note that missing values for a character variable are identified in SAS as a blank, while missing values for numeric variables are identified in SAS as a “.”. This is the most obvious form of missing data. Note that you can check for truly missing data by using the following SAS Code: Proc Means data = data nmiss; Var var; Run; Proc Freq data = data; Tables var; Run;

STAT3130 – Missing Data Lets take each one individually… 2)Coded Data Frequently, when data is input into a database, any values which were missing, incorrect, illegible, etc. will be coded at the time of entry. These codes are typically (but not always) provided to you in a data dictionary. Coded values are sometimes easy to spot (the codes are character when the rest of the data is numeric) or not easy to spot (the coded values are numeric, but not part of the “true” range of the data). Consider the GSS08 dataset again. Take a look at the age variable – there are coded values there. What are they and how would you know?

STAT3130 – Missing Data Lets take each one individually… 3)MisCoded Data Humans make mistakes – sometimes in weird ways computers make mistakes too. When data is entered incorrectly, this can really mess things up when you are trying to run a model or a test. Consider the age variable again in the GSS08 dataset…

STAT3130 – Missing Data With all of these issues, you also need to determine if the data is missing: 1)Completely randomly – also called MCAR. This means that the missing values have no pattern. In other words, the missing values cannot be predicted in any way. 2)Missing at random – also called MAR. This means that the missing values can be predicted using the other data available for an observation. In these instances, you may want to assign a categorical value (when the variable is categorical) with an indicator of “MISSING” to identify these observations differently. 3)Missing that depends upon latent variables. For example, there could be a latent (unobserved) variable which is highly correlated with the missing values. A familiar example from medical studies is that if a particular treatment causes discomfort, a patient is more likely to drop out of the study. This “missingness” is not at random (unless “discomfort” is measured and observed for all patients).

STAT3130 – Missing Data In all instances, the data values need to be replaced or “imputed” with a logical, meaningful value. Before we discuss the strategies for imputation…lets make a quick point regarding why this needs to be done… All analytical software packages – including SAS – require “complete case” for an observation to be included in the analysis. This means that if there are 100 variables and an observation is missing just ONE value, the entire case is removed from the analysis. And, you lose the other 99 perfectly good values. 

Think about this…if you are missing only 1% of your data and you have 1,000,000 observations and 50 variables, you could lose as much as 395,000 observations when you go to model… [total observations – (((1-percent missing)^variables)*total observations)] or [1,000,000 – (((1-.01)^50)*1,000,000)] = 394,994 That is A LOT of valid data that you would lose! And, it could bias your results. STAT3130 – Missing Data

STAT3130 – Imputation We need a way to replace those values – logically. Many options for imputation exist. Here are four of the primary methods: 1.Mean based imputation 2.Median based imputation 3.Stratified imputation 4.Regressed imputation – difficult with MCAR Each of these will be discussed briefly in turn.

STAT3130 – Imputation Imputation Strategies: 1)Mean Based Imputation – this process is the most simple. This involves replacing the missing values with the mean of the variable. But…before you do this, think through these questions: a.How would the distribution of the variable affect/and be affected by this imputation decision? b.What happens to the mean of the variable? c.What happens to the standard deviation of the variable? d.How might the results be biased?

Imputation Strategies: 2)Median Based Imputation – this process is also very simple. This involves replacing the missing values with the median of the variable. But…before you do this, think through these questions: a.How would the distribution of the variable affect/and be affected by this imputation decision? b.What happens to the mean of the variable? c.What happens to the standard deviation of the variable? d.How might the results be biased? STAT3130 – Imputation

Imputation Strategies: 3)Stratified Imputation – this process is slightly more involved. This involves replacing the missing values with the mean or median of the variable but with consideration for similar strata of observations. But…before you do this, think through these questions: a.How would the distribution of the variable affect/and be affected by this imputation decision? b.What happens to the mean of the variable? c.What happens to the standard deviation of the variable? d.How might the results be biased? STAT3130 – Imputation

Imputation Strategies: 4)Regressed Imputation – This process involves actually predicting the value of the missing values using Regression. It works well if: the variables are related to each other and if you only have one or two variables with missing data. But…before you do this, think through these questions: a.How would the distribution of the variable affect/and be affected by this imputation decision? b.What happens to the mean of the variable? c.What happens to the standard deviation of the variable? d.How might the results be biased? STAT3130 – Imputation