Download presentation
Presentation is loading. Please wait.
1
Unit and item non response
ESTP Training Course “Quality Management and survey Quality Measurement” Rome, 24 – 27 September 2013 Giorgia Simeoni Researcher Unit “Quality, Audit and Harmonisation“ Istat Marcello D’Orazio Senior Researcher Chief, Unit “Methodologies in agriculture statistics“ Istat
2
Framework on nonresponse - definition
Nonresponse is a non-observation error It represents an unsuccessful attempt to obtain the desired information from an eligible unit selected in the survey Two types of nonresponse: Unit nonresponse is a complete failure to obtain data from a sample unit. Item nonresponse occurs when a respondent provides some, but not all, of the requested information, or if the reported information is not usable. The distinction between unit and item nonresponse is not always clear
3
Framework on nonresponse - definition
Considering only nonresponse and sampling errors*: Nonresponse in multistage sampling can occur at each stage of sampling Nonresponse in longitudinal surveys can be to one or more waves. “Attrition”: nonresponse increases over time Target/frame population: U Size: N Sample: s Size: n Response set: r Size: m *Särndal and Lundström, 2005
4
Framework on nonresponse – causes and effects
Reasons for nonresponse: Unit nonresponse: Refusal, no-contact, inability to participate… Item nonresponse: interview interruption, refusal, skip of due question, “don’t know” Why is important to collect information on them? To address improvement efforts Impact of nonresponse on final survey estimates: Slight increase of variance (reduced sample size) Possible Bias The presence of item nonresponse can also cause problems when estimating the relationship among survey variables
5
Framework on nonresponse – relevant topics
Preventing actions Indirect measurement of nonresponse: Unit and item nonresponse rates, unweighted and weighted Tools to monitor nonresponse errors during surveys Preventing variance increase due to nonresponse Direct measurement of nonresponse: Measuring nonresponse bias Dealing with nonresponse: unit nonresponse: weighting procedures item nonresponse: imputation procedures
6
Tools to monitor nonresponse errors during surveys
Quality indicators on unit non response and on reasons for non response Standard outcome classification Total Units Resolved Units Unresolved Units In-Scope Units Out-of-Scope Units Respondents Nonrespondents Adapted from Hidiroglou et al. (1993).
7
Standard outcome classification
Adapted from Hidiroglou et al. (1993).
8
Standard outcome definitions
Nonrespondents: units for which it has not been possible to obtain information Respondents: units for which it has been possible to obtain information Refusals: in scope units explicitly refusing to participate to the survey No Contacts: in scope units which has not been possible to contact Other Nonrespondents: in scope units which have been contacted but have not been able to provide the required information (i.e. diseased or elderly persons) No Contacts Due to Frame Errors: in scope units which have not been contacted due to errors or incomplete information in the frame (i.e. wrong addresses) No Contacts Due to Other Reasons: in scope unit which have not been contacted due to impossibility to be found (family left for holiday)
9
Quality indicators on reasons for non response
Indicators on the causes of nonresponse Useful during field monitoring to address efforts On the basis of the standard classification, ad-hoc indicators can be defined and calculated for specific objectives
10
Quality indicators on reasons for non response
Unit non response rate Refusal rate Other Reas. rate Nocontact rate
11
Preventing variance increase due to nonresponse
Additional nonresponse variance is due to the reduction of sample size. It can be prevented by: Oversampling: in the planning stage, the sample size is inflated according to the expected nonresponse rate. Example: a sample of n = 1,000 has to be selected; it is expected a nonresponse rate of about 20%, then to obtain m = 1,000 respondents a sample of n = 1,250 units is selected. Substitutions: additional sample units are selected. A nonrespondent unit of the base sample is substituted with a “substitute” unit having similar characteristics chosen in the subset. Not suggested in practice!!!!!
12
Measuring nonresponse bias
To study the impact of nonresponse on estimates is necessary to assume a model for nonresponse. Two possible models: Deterministic model: simple approach, quite unrealistic, easy to understand Model based on the response propensity : more realistic, slightly more complex, useful to study relationships among nonresponse and target variables. 12
13
Measuring nonresponse bias: deterministic model
N = number of units in the population U. = unknown average of the variable Y in the population U Non Respondents: UNR Number of Units: N-NR Mean: Population/Frame: U Number of Units: N Mean: Respondents: UR Number of Units: NR Mean: Adapted from Biemer and Lyberg, 2003 13
14
Measuring nonresponse bias: deterministic model
A sample s is drawn from U based on a Simple Random Sampling (SRS) scheme. Let r be the set of responding units of size m. Respondents : r Number of Units : m Mean: Non respondents: UNR Number of Units: N-NR Mean: Population/Frame: U Number of Units : N Mean : Respondents: UR Number of Units: NR Mean: Campione: s Dimensione: n Mean : is an unbiased estimator of , however it is a biased estimator of 14
15
Measuring nonresponse bias: deterministic model
The bias of is given by: Based on this model, the bias due to non response depends on two elements: 1) The rate of non respondents in the population 2) The difference in the mean behavior between respondents and non respondents in the population If 15
16
Measuring nonresponse bias: deterministic model
Nonresponse rate is an indicator of POTENTIAL bias
17
Measuring nonresponse bias: response propensities
Model based on the response propensity Each element of the population has its own "undetectable" propensity (probability) to respond i. Based on this assumption, the bias due to non response is: Where : C(Y, ) = covariance between the variable Y and the propensity to respond in the population = expected (mean) response propensity in the population, computed with respect to both the different possible actual samples (under the sampling design), and the different possible observed data (given the data collection protocol) 17
18
Measuring nonresponse bias: response propensities
Under this model, nonresponse bias is zero if the covariance C(Y,) between the target variable Y and the response propensity is zero. Different random models can be assumed on the relationship between the target variable Y and the propensity to respond. Under each model different effects on the bias can be expected. 18
19
Measuring nonresponse bias: response propensities
Variable(s) Z determine the response propensity, however they are distinct and uncorrelated with the variable(s) X which determine Y Under this model, C(Y,r)=0 Bias=0 In other words, under this model, the rate of MR, whatever, has no effect on the distortion. Unrealistic model Groves, 2006 19
20
Measuring nonresponse bias: response propensities
The variable(s) Z determine both the propensity to respond, and the variables X . This determines C(Y,r)>0 . Excluding the effect of Z (i.e. analyzing data inside homogeneous classes determined with respect to Z) determines again that C(Y,r)=0 Bias=0. This model is assumed when dealing with non response Groves, 2006 20
21
Measuring nonresponse bias: response propensities
The variable of interest Y determines the propensity to respond In this case C(Y,r)0 Bias 0. Under this model, traditional corrections have no effect Groves, 2006 21
22
Measuring nonresponse bias
Summarysing: The nonresponse rate can have an impact on bias of estimates… …But a key role is played by the relationship between the response propensity and the target variable This implies that in the same survey, with the same non response rate, different variables may have very different levels of bias in their estimates (Groves, 2006; Olsen, 2006) This is the reason why it is difficult to define a threshold for the “acceptable” level of non response which can be tolerated in a survey 22
23
Measuring nonresponse bias
Evaluation methods: Indirect analysis: Comparison among the nonresponse rates in different subgroups: if rates are similar, it can be assumed that the bias due to nonresponse has no effect Identification studies Ad hoc surveys on a sub-sample of nonrespondents 23
24
Identification studies*
Characteristics of respondents and nonrespondents are compared on auxiliary variables (e.g. socio-demographic variables for individuals) available for all units from the sampling frame or some external sources. If the distribution of the auxiliary variables is similar for respondents and nonrespondents, then nonresponse bias can be assumed ignorable. Variations: In panel surveys, data from previous waves can be used. Characteristics of early and late respondents (or respondents after several follow-ups) can be compared assuming the latter similar to nonrespondents. Census data can be used * Lessler, Kalsbeek (1992). FCSM (2001) 24
25
Ad hoc surveys on a subsample of nonrespondents*
A random subsample of nonrespondents is selected and a major effort is used to obtain responses from the each unit of the subsample. If all the units in the subsample respond, combining the data obtained in both the surveys, unbiased estimates can be derived (a stochastic model is assumed for the response model of the main survey) * Cochran (1977), Särndal (1992) 25
26
Dealing with nonresponse
The availability of auxiliary information is the starting point for dealing with nonresponse, whatever approach is followed. It can be available for all the population elements, for all sample elements or partly for the population and partly for the sample. We will refer to auxiliary information for element k as the auxiliary vector xk
27
Dealing with nonresponse
id π x1 1 π1 2 π2 3 π3 … i πi m πm s πs y1 y2 … yk 4 1400 3 2500 1 5 2300 6 2 8 1200 y1 y2 … yk 4 1400 3 2500 1 5 2300 6 2000 2 8 1200 Item nonresponse Imputation Weights adjustment Unit nonresponse
28
Dealing with unit nonresponse: weights adjustment
Response set r can be considered the result of 2 probabilistic selections*: i) s is a subset of U ii) r is a subset of s Each element k of U has: a known inclusion probability πk with which design weights dk can be calculated an unknown response probability θk *two phase approach/re-weighting
29
Dealing with unit nonresponse: weights adjustment
θk should be estimated ( ) and used to adjust design weights θk can not be estimated without assuming (explicitly or implicitly) a model exploiting auxiliary information available Methods to estimate θk : Individual response probabilities estimation (explicit modelling) Response Homogeneity Groups
30
Individual response probabilities estimation (explicit modelling)
θk can be estimated assuming a model, e.g. logit: Models can produce estimates that are too variable. To avoid this drawback, units with similar estimates can be grouped together. Average (or median, mode, …) of in a group is used as the estimated response probability for all the units in that group.
31
Response Homogeneity Groups (RHG)
The population is divided in non-overlapping groups assuming that elements of the same group have the same response probabilities in an independent manner* (Response Homogeneity Groups – RHG - model) . Response probabilities can change from group to group. In the hypothesis that only sampling and nonresponse errors occurred, θk can be estimated by: Information needed is given by response rates for each of the RHG, and precise classification of sample units in RHG. *Särndal and Lundström, 2005; Särndal et al. 1992
32
Response Homogeneity Groups (RHG)
The grouping of units in RHG can be carried out according to the values of some categorical auxiliary variables. Often in households surveys RHG are obtained by classifying people according to geographic area, sex and age class. With several auxiliary variables (categorical and continuous) RHG can be identified by means of Classification and Regression Trees Techniques*. *Breiman, L. et al., 1984
33
Further correction of sampling weights (calibration)
Weights adjusted for nonresponse can be further modified so that the sum of final weights for given subsets of the respondents (e.g. males and females, etc.) equals to known totals for the population: This new correction of the weights is a sort of post-stratification of sample units that can also compensate for bias due to coverage problems. The new system of weights is said calibrated*; i.e. it is calibrated to the population total The calibration estimator is then defined as 33
34
Further correction of sampling weights (calibration)
In the calibration an additional condition is set: the distance between the initial and final weights is minimized. The aim is to change the sampling weights as little as possible to reduce the bias (the sampling weights produce estimators which are unbiased or nearly unbiased) Obviously, different distance functions lead to different estimators (e.g., the Euclidean distance leads to the generalized regression estimator) 34
35
Dealing with item nonresponse
Possible approaches: Complete Case Analysis: omit all cases with missing values on any variable. A lot of information is lost! Available Case Analysis: omit cases with missing values on variables required for given analysis. Risk of incoherence on survey results! Weighting: different weighting systems would be necessary for different variables Modelling Methods for Incomplete Data: allow for missing data in model* Imputation: the treatment of data used to treat problems of missing, invalid or inconsistent values identified during editing. This is done by substituting estimated values for the values flagged during editing and error localization** *Little, R.J.A. and Rubin, B.D. (2002) ** Luzi et al. (2007) 35
36
Imputation methods Manual: the values of data items deemed erroneous are changed by subject-matter experts supported by programs specially written for this purpose Deductive imputation: where only one correct value exists, as in the missing sum of a balance. A value is thus determined from other values on the same questionnaire Imputation based on explicit models: data are imputed following an explicit model assumed for the data (averages, medians, regressions) Imputation based on implicit models: more attention is paid to the algorithm, however there is a (may be unknown) model underlying the data
37
Imputation based on explicit models
Mean imputation: missing values are replaced by the mean of observed values (possibly by imputation cells). It is conceptually analogous to the re-weighting (inside imputation cells) Regression imputation: missing values for a given (response) variable are replaced by values predicted based on a regression model using appropriate covariates (possibly by imputation cells)
38
Imputation based on implicit models
Hot-deck imputation: missing values are replaced by a value provided by another respondent (the donor) Random donor: the donor is randomly selected (in imputation cells) Nearest-neighbour donor: the donor is the most similar unit w.r.t. a distance function computed using appropriate covariates (in imputation cells) (Chen and Shao, 2000; Chen, Rao and Sitter, 2000) Cold-deck imputation: missing values are replaced by a value provided by a unit observed in another survey or by the same unit in a previous survey repetition Combined methods: combines different methods. For example, in Predictive mean matching regression is performed at the first stage and hot- deck at the second stage (Rubin, 1987)
39
Deterministic and stochastic imputation methods
Deterministic: the estimated value (e.g. by mean or regression) is directly used for imputation Stochastic: a residual term is added to the estimated value Stochastic methods allow for a better preservation of distribution variance
40
Pros and cons of imputation
Simple to use Standard methods for complete data can be used in subsequent data analyses Reduces bias on univariate statistics compared to complete case and available case analyses Use of all the available information either observed or from other sources (register, historical data, other sources) Multivariate analyses: Imputation generally produces an attenuation of data relationships Variance (1): Imputation introduces a further variance term (imputation variance) Variance (2): If imputed data are treated as originally observed, the estimates precision is over-estimated (under-estimation of total variance, too narrow confidence intervals, invalid tests,…)
41
References AAPOR (2004). Standard Definitions: Final Dispositions of Case Codes and Outcome Rates for Surveys. 3rd edition. Lenexa, Kansas. Bethlehem, Jelke “Weighting Nonresponse Adjustments Based on Auxiliary Information.” In Survey Nonresponse, ed. Robert M. Groves, Don A. Dillman, John L. Eltinge, and Roderick J. A. Little, pp. 275–88. New York: Wiley. Biemer, P.P.; Lyberg L.E. (2003). Introduction to survey quality. Hoboken, New Jersey: John Wiley & Sons. Brancato G., Pellegrini C., Signore M., and Simeoni G., (2004) “Standardising, Evaluating and Documenting Quality: the Implementation of Istat Information System for Survey Documentation – SIDI”, Proceedings of the European Conference on Quality and Methodology in Official Statistics, Mainz, Germany Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984) - Classification and Regression Trees - Wadsworth International Group, Belmont CA. Cochran, W. G. (1977) Sampling Techniques, 3rd ed., Wiley, New York. Deville, J. C., Särndal, C. E. (1992) Calibration estimators in survey sampling. Journal of American Statistical Association, N. 87, pp
42
References FCSM (2001) “Measuring and Reporting Sources of Error in Surveys”. Federal Committee on Statistical Methodology, Statistical Policy Working Paper 31. Kalton G., Kasprzyk, D. (1986) The treatment of missing survey data. Survey methodology, vol.12, n.1, pp. 1-16 Lessler, J., and Kalsbeek, W. (1992) Nonsampling Errors in Surveys. Wiley, New York. Little, R.J.A., and Rubin, B.D. (2002) Statistical Analysis with Missing Data, 2nd Edition. Wiley, New York. Luzi O., Di Zio M., Gurnera U., Manzari A., De Waal T., Pannekoek J., Hoogland J., Tempelman C., Hulliger B., Kilchmann D. (2008) Recommended Practices for Editing and Imputation in Cross - Sectional Business Surveys. Edimbus project Särndal, C. E.; Lundström, S.(2005) Estimation in Surveys with Nonresponse. Wiley. Särndal C. E., Swensson, B., Wretman, L. (1992) Model Assisted Survey Sampling. Springer-Verlag, New York.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.