Nonresponse in Survey Sampling

Slides:



Advertisements
Similar presentations
Treatment of missing values
Advertisements

The estimation strategy of the National Household Survey (NHS) François Verret, Mike Bankier, Wesley Benjamin & Lisa Hayden Statistics Canada Presentation.
Copyright 2010, The World Bank Group. All Rights Reserved. Estimation and Weighting, Part I.
Chap 20-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 20 Sampling: Additional Topics in Sampling Statistics for Business.
Chapter 8: Nonresponse Reading (read for concepts)
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
Sampling Design and Analysis MTH 494 LECTURE-12 Ossam Chohan Assistant Professor CIIT Abbottabad.
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
SW 983 Missing Data Treatment Most of the slides presented here are from the Modern Missing Data Methods, 2011, 5 day course presented by the KUCRMDA,
Eurostat Weighting and Estimation. Presented by Loredana Di Consiglio Istituto Nazionale di Statistica, ISTAT.
The Impact of Missing Data on the Detection of Nonuniform Differential Item Functioning W. Holmes Finch.
1 Data Collection and Sampling ST Methods of Collecting Data The reliability and accuracy of the data affect the validity of the results of a statistical.
Sampling & Simulation Chapter – Common Sampling Techniques  For researchers to make valid inferences about population characteristics, samples.
Lecture 5.  It is done to ensure the questions asked would generate the data that would answer the research questions n research objectives  The respondents.
SAMPLING Purposes Representativeness “Sampling error”
Copyright © 2009 Pearson Education, Inc.
AC 1.2 present the survey methodology and sampling frame used
Elementary Statistics
Learning Objectives : After completing this lesson, you should be able to: Describe key data collection methods Know key definitions: Population vs. Sample.
Chapter 1 Introduction and Data Collection
Module 9: Choosing the Sampling Strategy
Sampling From Populations
Marketing Research Aaker, Kumar, Leone and Day Eleventh Edition
Sampling Why use sampling? Terms and definitions
Collecting Data with Surveys and Scientific Studies
Part III – Gathering Data
Sampling Population: The overall group to which the research findings are intended to apply Sampling frame: A list that contains every “element” or.
SAMPLING Purposes Representativeness “Sampling error”
Section 5.1 Designing Samples
2. Sampling and Measurement
SAMPLING (Zikmund, Chapter 12.
Meeting-6 SAMPLING DESIGN
Sampling: Design and Procedures
Introduction to Survey Data Analysis
Multiple Imputation Using Stata
Chapter Eight: Quantitative Methods
Stratified Sampling STAT262.
Sampling and Surveys How do we collect data? 8/20/2012.
Information from Samples
1.2 Sampling LEARNING GOAL
Federalist Papers Activity
An Active Collection using Intermediate Estimates to Manage Follow-Up of Non-Response and Measurement Errors Jeannine Claveau, Serge Godbout and Claude.
The European Statistical Training Programme (ESTP)
Chapter 7 Sampling Distributions
Lecture 2: Data Collecting and Sampling
Variables and Measurement (2.1)
Sampling Lecture 10.
Data Collection and Sampling
The European Statistical Training Programme (ESTP)
Chapter 8: Weighting adjustment
Chapter 12: Other nonresponse correction techniques
Task Force on Victimization Eurostat, October 2011 Guillaume Osier
Warm Up Imagine you want to conduct a survey of the students at Leland High School to find the most beloved and despised math teacher on campus. Among.
Day 1 Parameters, Statistics, and Sampling Methods
SAMPLING (Zikmund, Chapter 12).
Chapter 5: Producing Data
The European Statistical Training Programme (ESTP)
MATH 2311 Section 6.1.
Chapter 5: Producing Data
Sample Surveys Idea 1: Examine a part of the whole.
Chapter: 9: Propensity scores
Day 1 Parameters, Statistics, and Sampling Methods
Business Statistics: A First Course (3rd Edition)
New Techniques and Technologies for Statistics 2017  Estimation of Response Propensities and Indicators of Representative Response Using Population-Level.
The European Statistical Training Programme (ESTP)
What do Samples Tell Us Variability and Bias.
10/18/ B Samples and Surveys.
Chapter 13: Item nonresponse
Chapter 5: The analysis of nonresponse
Determining Subsampling Rates for Nonrespondents
Presentation transcript:

Nonresponse in Survey Sampling STAT262@UCI

Introduction Unit nonresponse: the entire unit is missing Also called missing data in some areas Two types of nonresponse: Unit nonresponse: the entire unit is missing Adjust the sampling weights for sampled units Item nonresponse: some measurements are present but at least one is missing

What is the best way to deal with nonresponses?

NHANES

NHANES

NHANES ID SurveyYr Gender Age 0.000000000 0.000000000 0.000000000 0.000000000 Race1 WTINT2YR WTMEC2YR SDMVPSU SDMVSTRA HomeOwn HomeRooms Diabetes 0.000000000 0.006751096 0.007145321 0.041048637 Weight Poverty HHIncome HHIncomeMid 0.043758932 0.090474548 0.102301286 0.102301286 …… Marijuana SexNumPartYear AlcoholDay SexOrientation 0.651505445 0.653082344 0.655398413 0.662593012 Testosterone SmokeNow SmokeAge nPregnancies 0.663627852 0.742127827 0.751194993 0.792933524 nBabies AgeFirstMarij RegularMarij BMICatUnder20yrs 0.805893658 0.816981225 0.817079781 0.834672054 Age1stBaby UrineVol2 UrineFlow2 PregnantNow 0.844379835 0.866554970 0.867097029 0.871236387 Length TVHrsDayChild CompHrsDayChild AgeRegMarij 0.887399596 0.890208446 0.890208446 0.910313901 DiabetesAge HeadCirc 0.929187405 0.976642192

Possible causes Demographical information (age, race, sex etc.) is usually available; they can be used to adjust for nonresponse The interviewer may not be able to contact the household The person may be ill and cannot respond The person may refuse to participate in the survey The person may refuse to answer some questions

How to Deal with Nonresponse Prevent (the best solution) Ignore (the least preferred)

How to Deal with Nonresponse Subsampling nonrespondents Use statistical methods to deal with item nonresponse Model missingness using methods similar for two-phase sampling Imputation and reweighting

1: Effects of Ignoring Nonresponse Increasing sample size doesn’t help Increasing sample size without targeting nonresponse cannot reduce nonresponse bias Nonrespondents might be very different from respondents E.g., The item nonresponse for the income item is highest for low- and high-income households E.g.,

Nonrespondents might be very different from respondents

Nonrespondents might be very different from respondents

The Potential Bias of Ignoring Nonrespondents Ignoring nonrespondents leads to biased estimates for population quantities NR: the number of population respondents NM: the number of population nonrespondents The bias is small if (1) the two population means are close to each other (2) there is little nonresponse

2: Reduce nonresponse The best way to deal with nonresponse is to prevent it

How to Prevent Nonresponse Well-trained interviewers a well trained person always considers potential problems in the data-collection process Training, workload, motivation Protection of privacy Drug use may have a large number of refusals Time of survey vacation month of August might not be a good time

Factors Affecting Nonresponse Questionnaire design Wording Order Respondent burden Keep questions short and clear A short questionnaire can achieve higher response rate Survey introduction Provides motivation What purpose the data will be use: e.g., affects which television shows are aired Follow-up Test questions before conduct a survey

Factors Affecting Nonresponse Data-collection method: telephone and mail surveys have a lower rate than in-person surveys E.g.,Dillman et al. (1995): a factorial experiment A prenotice letter or not A stamped return envelop or not A reminder postcard or not No-No-No: 50.0 response rate Yes-Yes-Yes: 64.3 response rate Yes-No-Yes: 62.7 response rate …

3: Callbacks and Two-Phase Sampling Two 1984 Michigan polls on preference for presidential candidates (Traugott 1987) Response rate 65% 59% supported Reagan 39% supported Mondale 21% of the 65% responded on the first call 48% supported Reagan 45% supported Mondale 79% of the 65% responded after multiple attempts

Two-Phase Sampling (Double Sampling) In ratio or regression estimators, we use auxiliary variable (x) to improve the estimates of population quantities regarding another measurement (y) In stratified sampling, the stratum can be treated as a kind of auxiliary information. Both of them require that the auxiliary information is known a priori. Two-phase sampling deals with the situation when x is not known but can be observed with a much less expensive cost than y

Two-Phase Sampling with Ratio Estimation Phase one: take a probability sample S(1) and measure x in the sample. The population total of x can be estimated by Phase two: treat S(1) as a “population” and take a subsample S(2). Obtain and Use y values in S2 to estimate Based on , and to estimate

Two-Phase Sampling for Stratification A simple case: SRS in phase I and II. Obs are in H strata, the stratum info (xih) is unknown until phase I is completed. The stratum info in S(1), i.e., (xih), and the y values in the subsample S(2) are used to estimate the population total

Callbacks and Two-Phase Sampling The population is divided into two strata: respondents and initial nonrespondents Randomly select n units in the population nR respond. Average nM do not respond Select a random subsample of 100v% of the nM nonrespondents. Average: Assumes that all targeted subsamples are reached

Estimates for the Population Mean and Total

4: Mechanisms for Nonresponse Why study the mechanisms of nonresponse? Nonresponse is difficult to avoid To make inference about nonresponse, we have to assume that they are related to response in some way : propensity score Response probability

Missing Completely at Random (MCAR) The propensity score Φi does not depend on xi, yi or the design of the survey All Φi are equal {Ri=1} are independent Pr(Ri=1| yi )=Pr(Ri=1) E.g., the lost mails of survey results If data are MCAR, the respondents are representative of the selected sample

Missing Completely at Random (MCAR) If an SRS of size n is taken, the sample of respondents can be treated as an SRS of size nR. The sample mean of respondents is unbiased for the population mean. To reach a chosen precision, increase the sample size to account for nonresponse

Missing at Random (MAR) Also known as “missing at random given covariates”, or “ignorable nonresponse” The propensity score (prob. of missing) Φi depends on xi. Conditional on xi, it doesn’t depend on yi Pr(Ri=1| xi,yi )=Pr(Ri=1| xi) Eg, target pop is 18+; young people is more likely to not respond; it is age, not y, that determines Φi

Missing at Random (MAR) The nonresponse can be modeled successfully “Ignorable” means a model can explain the nonresponse mechanism The nonresponse can be ignored after the model accounts for it

Nonignorable Nonresponse The propensity score (prob of missing) Φi depends on yi. The dependence cannot be completely explained by xi. Models are helpful, but cannot completely adjust for the nonresponse

Mechanisms of Nonresponse: Summary The Φi are useful to determine the type of nonresponse But they are unknown! MCAR and MAR sometimes are distinguishable by model fitting Fit logistic regressions MAR and noningnorable response are difficult to differentiate We will try to estimate the propensity scores

5: Weighting Methods In stratified sampling: whj =Nh/nh, Zi: indicator for presence in the sample Ri: indicator for response Pr(Zi=1, Ri=1)= Pr(Ri=1|Zi=1)Pr(Zi=1) =πiΦi Φi can be estimated using auxiliary info. E.g., marijuna and age. The final weight is , increased to represent the nonrespondents’ share, as well as their own MAR is assumed When all units in the sample response, the sampling weight wi is the number of units in the population represented by unit i of the sample When there are nonresponses, weights of repondents are increased so that the respondents represent the nonrespondents’ share of the population as well as their own

5.1: Weighting-Class Adjustment Variables known for all units in the selected sample are used to construct weighting-adjustment classes The prob of response (Φi) is assumed to be the same within each weighting class Within a class, Φi is indep of y (MAR data) In Class c,

Auxiliary variables 1 Yes No 2 3 Associated with nonresponse Associated with the targeting variable Bias Precision 1 Yes No 2 3

Auxiliary variables 1 Yes No No effect decrease 2 reduce bias Associated with nonresponse Associated with the targeting variable Bias Precision 1 Yes No No effect decrease 2 reduce bias Unknown (weights become more variable) 3 increase

Weighting-Class Adjustment Let wi be the sampling weight for unit i. If unit i is in class c, the new weight after accounting for nonresponse is if it is a respondent and 0 otherwise Define xci=1 if unit i is in class c, and 0 otherwise. The new weight can be written as

Weighting-Class Adjustment The estimate for population total and mean In an SRS,

Construction of Weighting Classes Units within each class are as similar as possible and the response rates vary from classes to classes Little (1986) Estimate the prob. of response as a function of the known variables (e.g., logistic regression) Group observations into classes according to the estimated prob of response

5.2: Poststratification Very similar to weighting-class adjustment Both assume MAR Assumptions: Within each poststratum each unit in the sample has the same prob of response Ri is independent of the rest Nonrespondents in a poststratum behave like respondents Data are MCAR within each poststratum

Poststratification: SRS In weighting-class adjustment In stratified sampling Assume SRS, the poststratified estimator

Poststratification: Using Weights SRS: Generalization to prob. sampling Define xci=1 if unit i is in class c, and 0 otherwise Weight due to poststratification and nonresponse

6: Imputation Purpose: assign values to the missing items to obtain a “clean”, rectangular data set Reduce the nonresponse bias Multivariate analysis When imputation is used, an additional variable that indicates whether the response was measured or imputed should be created for the data set

An example

6.1: Deductive Imputation Sometime logic relations among the variables may be used to replace the missing items E.g., If a woman has two kids in year1 and two kids in year3, but is missing the value for year2, the logic value to fill in is 2. E.g., if a person answered yes to “violate-crime victim”, the missing value for “crime victim” can be replaced by “yes”.

6.2: Cell Mean Imputation Divide respondents into classes (cells) based on known variables. Use the average of the values of the responding units in cell c to substitute for each missing value in the same cell. Assumes MCAR Distorts multivariate relationship b/c imputation is done separately for each missing item The method fails to reflect the variability of nonresponse. A stochastic cell mean imputation can be used

Cell Mean Imputation

6.3: Hot-Deck Imputation Sample units are divided into classes The value of one responding unit in the class is used to impute the missing items “Hot” deck dates back from the card days, the days when computer programs and data were punched on cards – the deck of cards containing the data set being analyzed was warmed by the card reader Hot deck was used to refer imputations made on the same data set

Hot-Deck imputation: choosing the donor unit Sequential hot-deck imputation A carryover from the card days It assumes that data are arranged in some geographic order; adjacent units in the same subgroup are more similar than randomly chosen Impute the value in the same subgroup that was last read by the computer E.g., person 19 is missing response of crime victimization. Person 13 had the response in the same class is the donor. A person may be a donor multiple times if nonrespondents clusters

Hot-Deck imputation: choosing the donor unit Random hot-deck imputation A donor is randomly chosen from the persons in the cell To preserve multivariate relationships, usually values from the same donor are used for all missing items of a person E.g., person 10 is missing the two variables for victimization. Persons 3, 5, 14 in the same cell have both available. One person was randomly chosen from {3, 5, 14} as the donor for person 10.

Hot-Deck imputation: choosing the donor unit Nearest-neighbor hot-deck imputation Define a distance measure between observations Impute the value of a respondent who is “closest” to the person with the missing item Closeness is defined using the distance function

An example

6.4: Regression Imputation Predicts the missing value by using regression of the items of interest on variables observed for all cases Marijuana and age (using subjects with both age and marijuana)

6.4: Regression Imputation To add variability, stochastic regression imputation is often used, in which the missing value is replace by the predicted value + a random error

6.5: Cold-Deck Imputation Impute values based on a previous survey or other information, such as historical data Neither hot-deck or cold-deck imputation is guaranteed to eliminate nonresponse bias

6.6: Substitution If a unit can not be reached, sometimes it will be replaced by another unit nearby. May reduce nonresponse bias in some situation the household next door may be more similar to the nonrespondent than a randomly chosen If the nonresponse is related to the characteristics of interest, there will be still nonresponse bias

6.7: Multiple Imputation In multiple imputation, each missing value is imputed m different times. Usually the same stochastic model is used for each imputation Each of the m data sets is analyzed Leads to additional variance due to imputation Different models can be used to assess the sensitivity of the results to models

6.7: Multiple Imputation www.sas.com/rnd/app/papers/multipleimputation.pdf

7: Acceptable Response Rate There are no absolute guidelines for acceptable response rates When data are MCAR, a response rate about 50% is not bad; however, if nonresponse is associated with characteristics of interest, a response rate 95% can still leads to biased results Response rate may be defined differently. When reading reports, be careful about which definition was used