Download presentation
Presentation is loading. Please wait.
1
Nonresponse in Survey Sampling
2
Introduction Unit nonresponse: the entire unit is missing
Also called missing data in some areas Two types of nonresponse: Unit nonresponse: the entire unit is missing Adjust the sampling weights for sampled units Item nonresponse: some measurements are present but at least one is missing
3
What is the best way to deal with nonresponses?
4
NHANES
5
NHANES
6
NHANES ID SurveyYr Gender Age
Race WTINT2YR WTMEC2YR SDMVPSU SDMVSTRA HomeOwn HomeRooms Diabetes Weight Poverty HHIncome HHIncomeMid …… Marijuana SexNumPartYear AlcoholDay SexOrientation Testosterone SmokeNow SmokeAge nPregnancies nBabies AgeFirstMarij RegularMarij BMICatUnder20yrs Age1stBaby UrineVol UrineFlow PregnantNow Length TVHrsDayChild CompHrsDayChild AgeRegMarij DiabetesAge HeadCirc
7
Possible causes Demographical information (age, race, sex etc.) is usually available; they can be used to adjust for nonresponse The interviewer may not be able to contact the household The person may be ill and cannot respond The person may refuse to participate in the survey The person may refuse to answer some questions
8
How to Deal with Nonresponse
Prevent (the best solution) Ignore (the least preferred)
9
How to Deal with Nonresponse
Subsampling nonrespondents Use statistical methods to deal with item nonresponse Model missingness using methods similar for two-phase sampling Imputation and reweighting
10
1: Effects of Ignoring Nonresponse
Increasing sample size doesn’t help Increasing sample size without targeting nonresponse cannot reduce nonresponse bias Nonrespondents might be very different from respondents E.g., The item nonresponse for the income item is highest for low- and high-income households E.g.,
11
Nonrespondents might be very different from respondents
12
Nonrespondents might be very different from respondents
13
The Potential Bias of Ignoring Nonrespondents
Ignoring nonrespondents leads to biased estimates for population quantities NR: the number of population respondents NM: the number of population nonrespondents The bias is small if (1) the two population means are close to each other (2) there is little nonresponse
14
2: Reduce nonresponse The best way to deal with nonresponse is to prevent it
15
How to Prevent Nonresponse
Well-trained interviewers a well trained person always considers potential problems in the data-collection process Training, workload, motivation Protection of privacy Drug use may have a large number of refusals Time of survey vacation month of August might not be a good time
16
Factors Affecting Nonresponse
Questionnaire design Wording Order Respondent burden Keep questions short and clear A short questionnaire can achieve higher response rate Survey introduction Provides motivation What purpose the data will be use: e.g., affects which television shows are aired Follow-up Test questions before conduct a survey
17
Factors Affecting Nonresponse
Data-collection method: telephone and mail surveys have a lower rate than in-person surveys E.g.,Dillman et al. (1995): a factorial experiment A prenotice letter or not A stamped return envelop or not A reminder postcard or not No-No-No: 50.0 response rate Yes-Yes-Yes: 64.3 response rate Yes-No-Yes: 62.7 response rate …
18
3: Callbacks and Two-Phase Sampling
Two 1984 Michigan polls on preference for presidential candidates (Traugott 1987) Response rate 65% 59% supported Reagan 39% supported Mondale 21% of the 65% responded on the first call 48% supported Reagan 45% supported Mondale 79% of the 65% responded after multiple attempts
19
Two-Phase Sampling (Double Sampling)
In ratio or regression estimators, we use auxiliary variable (x) to improve the estimates of population quantities regarding another measurement (y) In stratified sampling, the stratum can be treated as a kind of auxiliary information. Both of them require that the auxiliary information is known a priori. Two-phase sampling deals with the situation when x is not known but can be observed with a much less expensive cost than y
20
Two-Phase Sampling with Ratio Estimation
Phase one: take a probability sample S(1) and measure x in the sample. The population total of x can be estimated by Phase two: treat S(1) as a “population” and take a subsample S(2). Obtain and Use y values in S2 to estimate Based on , and to estimate
21
Two-Phase Sampling for Stratification
A simple case: SRS in phase I and II. Obs are in H strata, the stratum info (xih) is unknown until phase I is completed. The stratum info in S(1), i.e., (xih), and the y values in the subsample S(2) are used to estimate the population total
22
Callbacks and Two-Phase Sampling
The population is divided into two strata: respondents and initial nonrespondents Randomly select n units in the population nR respond. Average nM do not respond Select a random subsample of 100v% of the nM nonrespondents. Average: Assumes that all targeted subsamples are reached
23
Estimates for the Population Mean and Total
24
4: Mechanisms for Nonresponse
Why study the mechanisms of nonresponse? Nonresponse is difficult to avoid To make inference about nonresponse, we have to assume that they are related to response in some way : propensity score Response probability
25
Missing Completely at Random (MCAR)
The propensity score Φi does not depend on xi, yi or the design of the survey All Φi are equal {Ri=1} are independent Pr(Ri=1| yi )=Pr(Ri=1) E.g., the lost mails of survey results If data are MCAR, the respondents are representative of the selected sample
26
Missing Completely at Random (MCAR)
If an SRS of size n is taken, the sample of respondents can be treated as an SRS of size nR. The sample mean of respondents is unbiased for the population mean. To reach a chosen precision, increase the sample size to account for nonresponse
27
Missing at Random (MAR)
Also known as “missing at random given covariates”, or “ignorable nonresponse” The propensity score (prob. of missing) Φi depends on xi. Conditional on xi, it doesn’t depend on yi Pr(Ri=1| xi,yi )=Pr(Ri=1| xi) Eg, target pop is 18+; young people is more likely to not respond; it is age, not y, that determines Φi
28
Missing at Random (MAR)
The nonresponse can be modeled successfully “Ignorable” means a model can explain the nonresponse mechanism The nonresponse can be ignored after the model accounts for it
29
Nonignorable Nonresponse
The propensity score (prob of missing) Φi depends on yi. The dependence cannot be completely explained by xi. Models are helpful, but cannot completely adjust for the nonresponse
30
Mechanisms of Nonresponse: Summary
The Φi are useful to determine the type of nonresponse But they are unknown! MCAR and MAR sometimes are distinguishable by model fitting Fit logistic regressions MAR and noningnorable response are difficult to differentiate We will try to estimate the propensity scores
31
5: Weighting Methods In stratified sampling: whj =Nh/nh,
Zi: indicator for presence in the sample Ri: indicator for response Pr(Zi=1, Ri=1)= Pr(Ri=1|Zi=1)Pr(Zi=1) =πiΦi Φi can be estimated using auxiliary info. E.g., marijuna and age. The final weight is , increased to represent the nonrespondents’ share, as well as their own MAR is assumed When all units in the sample response, the sampling weight wi is the number of units in the population represented by unit i of the sample When there are nonresponses, weights of repondents are increased so that the respondents represent the nonrespondents’ share of the population as well as their own
32
5.1: Weighting-Class Adjustment
Variables known for all units in the selected sample are used to construct weighting-adjustment classes The prob of response (Φi) is assumed to be the same within each weighting class Within a class, Φi is indep of y (MAR data) In Class c,
33
Auxiliary variables 1 Yes No 2 3 Associated with nonresponse
Associated with the targeting variable Bias Precision 1 Yes No 2 3
34
Auxiliary variables 1 Yes No No effect decrease 2 reduce bias
Associated with nonresponse Associated with the targeting variable Bias Precision 1 Yes No No effect decrease 2 reduce bias Unknown (weights become more variable) 3 increase
35
Weighting-Class Adjustment
Let wi be the sampling weight for unit i. If unit i is in class c, the new weight after accounting for nonresponse is if it is a respondent and 0 otherwise Define xci=1 if unit i is in class c, and 0 otherwise. The new weight can be written as
36
Weighting-Class Adjustment
The estimate for population total and mean In an SRS,
37
Construction of Weighting Classes
Units within each class are as similar as possible and the response rates vary from classes to classes Little (1986) Estimate the prob. of response as a function of the known variables (e.g., logistic regression) Group observations into classes according to the estimated prob of response
38
5.2: Poststratification Very similar to weighting-class adjustment
Both assume MAR Assumptions: Within each poststratum each unit in the sample has the same prob of response Ri is independent of the rest Nonrespondents in a poststratum behave like respondents Data are MCAR within each poststratum
39
Poststratification: SRS
In weighting-class adjustment In stratified sampling Assume SRS, the poststratified estimator
40
Poststratification: Using Weights
SRS: Generalization to prob. sampling Define xci=1 if unit i is in class c, and 0 otherwise Weight due to poststratification and nonresponse
41
6: Imputation Purpose: assign values to the missing items to obtain a “clean”, rectangular data set Reduce the nonresponse bias Multivariate analysis When imputation is used, an additional variable that indicates whether the response was measured or imputed should be created for the data set
42
An example
43
6.1: Deductive Imputation
Sometime logic relations among the variables may be used to replace the missing items E.g., If a woman has two kids in year1 and two kids in year3, but is missing the value for year2, the logic value to fill in is 2. E.g., if a person answered yes to “violate-crime victim”, the missing value for “crime victim” can be replaced by “yes”.
44
6.2: Cell Mean Imputation Divide respondents into classes (cells) based on known variables. Use the average of the values of the responding units in cell c to substitute for each missing value in the same cell. Assumes MCAR Distorts multivariate relationship b/c imputation is done separately for each missing item The method fails to reflect the variability of nonresponse. A stochastic cell mean imputation can be used
45
Cell Mean Imputation
46
6.3: Hot-Deck Imputation Sample units are divided into classes
The value of one responding unit in the class is used to impute the missing items “Hot” deck dates back from the card days, the days when computer programs and data were punched on cards – the deck of cards containing the data set being analyzed was warmed by the card reader Hot deck was used to refer imputations made on the same data set
47
Hot-Deck imputation: choosing the donor unit
Sequential hot-deck imputation A carryover from the card days It assumes that data are arranged in some geographic order; adjacent units in the same subgroup are more similar than randomly chosen Impute the value in the same subgroup that was last read by the computer E.g., person 19 is missing response of crime victimization. Person 13 had the response in the same class is the donor. A person may be a donor multiple times if nonrespondents clusters
48
Hot-Deck imputation: choosing the donor unit
Random hot-deck imputation A donor is randomly chosen from the persons in the cell To preserve multivariate relationships, usually values from the same donor are used for all missing items of a person E.g., person 10 is missing the two variables for victimization. Persons 3, 5, 14 in the same cell have both available. One person was randomly chosen from {3, 5, 14} as the donor for person 10.
49
Hot-Deck imputation: choosing the donor unit
Nearest-neighbor hot-deck imputation Define a distance measure between observations Impute the value of a respondent who is “closest” to the person with the missing item Closeness is defined using the distance function
50
An example
51
6.4: Regression Imputation
Predicts the missing value by using regression of the items of interest on variables observed for all cases Marijuana and age (using subjects with both age and marijuana)
52
6.4: Regression Imputation
To add variability, stochastic regression imputation is often used, in which the missing value is replace by the predicted value + a random error
53
6.5: Cold-Deck Imputation
Impute values based on a previous survey or other information, such as historical data Neither hot-deck or cold-deck imputation is guaranteed to eliminate nonresponse bias
54
6.6: Substitution If a unit can not be reached, sometimes it will be replaced by another unit nearby. May reduce nonresponse bias in some situation the household next door may be more similar to the nonrespondent than a randomly chosen If the nonresponse is related to the characteristics of interest, there will be still nonresponse bias
55
6.7: Multiple Imputation In multiple imputation, each missing value is imputed m different times. Usually the same stochastic model is used for each imputation Each of the m data sets is analyzed Leads to additional variance due to imputation Different models can be used to assess the sensitivity of the results to models
56
6.7: Multiple Imputation
57
7: Acceptable Response Rate
There are no absolute guidelines for acceptable response rates When data are MCAR, a response rate about 50% is not bad; however, if nonresponse is associated with characteristics of interest, a response rate 95% can still leads to biased results Response rate may be defined differently. When reading reports, be careful about which definition was used
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.