MEASUREMENT OF THE QUALITY OF STATISTICS

Slides:



Advertisements
Similar presentations
Handling attrition and non- response in longitudinal data Harvey Goldstein University of Bristol.
Advertisements

Data Imputation United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile.
Unido.org/statistics International workshop on industrial statistics 8 – 10 July, Beijing Non response in industrial surveys Shyam Upadhyaya.
Module B-4: Processing ICT survey data TRAINING COURSE ON THE PRODUCTION OF STATISTICS ON THE INFORMATION ECONOMY Module B-4 Processing ICT Survey data.
Treatment of missing values
CountrySTAT Team-I November 2014, ECO Secretariat,Teheran.
Prediction and Imputation in ISEE - Tools for more efficient use of combined data sources Li-Chun Zhang, Statistics Norway Svein Nordbotton, University.
How to Handle Missing Values in Multivariate Data By Jeff McNeal & Marlen Roberts 1.
Documentation and survey quality. Introduction.
How to deal with missing data: INTRODUCTION
Partially Missing At Random and Ignorable Inferences for Parameter Subsets with Missing Data Roderick Little Rennes
Statistical Methods for Missing Data Roberta Harnett MAR 550 October 30, 2007.
Eurostat Statistical Data Editing and Imputation.
Guide to Handling Missing Information Contacting researchers Algebraic recalculations, conversions and approximations Imputation method (substituting missing.
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
Multiple Imputation (MI) Technique Using a Sequence of Regression Models OJOC Cohort 15 Veronika N. Stiles, BSDH University of Michigan September’2012.
Performance of Resampling Variance Estimation Techniques with Imputed Survey data.
G Lecture 11 G Session 12 Analyses with missing data What should be reported?  Hoyle and Panter  McDonald and Moon-Ho (2002)
Introduction Since 1995, the Municipality of Firenze designed a quarterly labour force (LF) survey, parallel to that of ISTAT, to cope with the unavailability,
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
Calibrated imputation of numerical data under linear edit restrictions Jeroen Pannekoek Natalie Shlomo Ton de Waal.
Eurostat Statistical matching when samples are drawn according to complex survey designs Training Course «Statistical Matching» Rome, 6-8 November 2013.
SW 983 Missing Data Treatment Most of the slides presented here are from the Modern Missing Data Methods, 2011, 5 day course presented by the KUCRMDA,
Eurostat Weighting and Estimation. Presented by Loredana Di Consiglio Istituto Nazionale di Statistica, ISTAT.
© John M. Abowd 2007, all rights reserved General Methods for Missing Data John M. Abowd March 2007.
1 G Lect 13W Imputation (data augmentation) of missing data Multiple imputation Examples G Multiple Regression Week 13 (Wednesday)
Simulation Study for Longitudinal Data with Nonignorable Missing Data Rong Liu, PhD Candidate Dr. Ramakrishnan, Advisor Department of Biostatistics Virginia.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 13, 2013.
Multivariate selective editing via mixture models: first applications to Italian structural business surveys Orietta Luzi, Guarnera U., Silvestri F., Buglielli.
Eurostat Accuracy of Results of Statistical Matching Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts.
Methods and software for editing and imputation: recent advancements at Istat M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute.
Tutorial I: Missing Value Analysis
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 April 9, 2012.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
DATA STRUCTURES AND LONGITUDINAL DATA ANALYSIS Nidhi Kohli, Ph.D. Quantitative Methods in Education (QME) Department of Educational Psychology 1.
Research and Evaluation Methodology Program College of Education A comparison of methods for imputation of missing covariate data prior to propensity score.
HANDLING MISSING DATA.
Missing data: Why you should care about it and what to do about it
Theme (i): New and emerging methods
The treatment of uncertainty in the results
Multiple Imputation using SOLAS for Missing Data Analysis
MISSING DATA AND DROPOUT
Modeling approaches for the allocation of costs
The Centre for Longitudinal Studies Missing Data Strategy
Introduction to Survey Data Analysis
Multiple Imputation Using Stata
How to handle missing data values
An Active Collection using Intermediate Estimates to Manage Follow-Up of Non-Response and Measurement Errors Jeannine Claveau, Serge Godbout and Claude.
Survey phases, survey errors and quality control system
Nonresponse in Survey Sampling
Survey phases, survey errors and quality control system
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
The European Statistical Training Programme (ESTP)
Chapter 8: Weighting adjustment
Chapter 12: Other nonresponse correction techniques
CH2. Cleaning and Transforming Data
EM for Inference in MV Data
The European Statistical Training Programme (ESTP)
Fixed, Random and Mixed effects
Chapter: 9: Propensity scores
Non response and missing data in longitudinal surveys
Analysis of missing responses to the sexual experience question in evaluation of an adolescent HIV risk reduction intervention Yu-li Hsieh, Barbara L.
EM for Inference in MV Data
New Techniques and Technologies for Statistics 2017  Estimation of Response Propensities and Indicators of Representative Response Using Population-Level.
The European Statistical Training Programme (ESTP)
Treatment of Missing Data Pres. 8
A bootstrap method for estimators based on combined administrative and survey data Sander Scholtus (Statistics Netherlands) NTTS Conference 13 March 2019.
Unit and item non response
Chapter 13: Item nonresponse
Missing data: Is it all the same?
Presentation transcript:

MEASUREMENT OF THE QUALITY OF STATISTICS Item Nonresponse   Orietta Luzi luzi@istat.it Istat – Department for National Accounts and Economic Statistics

Item nonresponse Is an error of non observation Occurs when a respondent provides some, but not all, of the information required, or if the information can not be used Common causes: Interview interruption Refusals Skip of a group of questions “Don’t know” Also known as missing values

Item nonresponse Action for preventing item nonresponse Questionnaire wording Guarantee statistical confidentiality Specific training for interviewers Accuracy of questionnaire instructions (help on line for e- questionnaires) Add the “Don’t know” to question’s items ….

Item nonresponse Evaluation Item nonresponse rates can be produced for critical variables (some rates as those for unit nonresponse) Item nonresponse rate: Units non responding to the question of interest Eligible units for the question of interest It may be difficult to compute this indicator in case of very complex questionnaires having many skip questions and alternative patterns It can also be used an indicator based on the number of missing values which have been integrated during the data processing phase (editing and imputation phase)

Item nonresponse Classification of non response (Rubin, 1987) MCAR (Missing Completely At Random): the probability that a variable value is missing does not depend neither on the observed nor on the missing data MAR (Missing At Random): the probability that a value is missing depends only on the observed data MNAR (NMAR) (Missing Not At Random): the probability that a value is missing depends on both the observed and the missing data This classification is fundamental when using adjustment methods for non-response

Dealing with item nonresponse (1) Complete case analysis: only data without missing information are considered (low precision of estimates, additional bias if the mechanism is not MCAR) Available case analysis: for each variable, only units with observed data are analysed (bias in estimates of variance/covariance matrices) Re-weighting: (different systems of weights for different items) All case analysis : Modelling Methods for incomplete data Imputation

Dealing with item nonresponse (2) Modelling: given data model f(y;q) assumed for the data, ML estimates of q are obtained using all data (in case of incomplete data the EM algorithm is generally used). The estimated model is then used to impute missing data. Advantages: explicit assumptions Drawbacks: costly approach for complex data distributions (Little and Rubin, 2002; Little, 1988; Dempster et al., 1977)

Dealing with item nonresponse (3) Imputation: missing data are replaced with properly estimated values. Often the substituted values are intended to create a data record that does not fail the so called consistency rules (edits) (Kalton and Kasprzyik, 1986 and 1982; Kovar et al., 1995) Several imputation methods have been proposed in literature

Dealing with item nonresponse (3)

Imputation methods Deductive imputation: where only one correct value exists, as in the missing sum of a balance. A value is thus determined from other values on the same questionnaire Manual imputation: the values of data items deemed erroneous are changed by subject-matter experts supported by programs specially developed for this purpose. Usually reserved to a small number of large or critical units (in terms of potential impact on target estimates). Sometimes these units can be re-contacted to collect the missing information

Imputation methods Imputation based on statistical models Imputation based on explicit models: data are imputed following an explicit model assumed for the data (averages, medians, regressions) Imputation based on implicit models: more attention is paid to the algorithm, however there is a (may be unknown) model underlying the data

Imputation based on explicit models Mean imputation: missing values are replaced by the mean of observed values. It is conceptually analogous to the re-weighting. It can lead to serious biasing effects if respondents and non respondents have significantly different behaviours (mean) with respect to the the target variable under imputation Mean imputation within classes : Classes of homogeneous units (imputation classes) are defined before imputation. Missing values in a class are imputed with the class mean or the mode in the class. In this way, if the auxiliary variables used to form class are correlated with the variable to be imputed, a reduction of the bias due to nonresponse and imputation is obtained.

Imputation based on explicit models Regression imputation: missing values for a given (response) variable are replaced by values predicted based on a regression model fitted on responding units: the variable with missing values is the dependent variable, predictors are chosen among available auxiliary variables. Regression models generally are estimated by imputation cells

Imputation based on implicit models Hot-deck imputation: missing values are replaced by a value provided by another respondent (the donor) Random donor: the donor is randomly selected (in imputation cells) Nearest-neighbour donor: the donor is the most similar unit w.r.t. a distance function computed using appropriate auxiliary variables (in imputation cells) (Chen and Shao, 2000; Chen, Rao and Sitter, 2000) Cold-deck imputation: missing values are replaced by a value provided by a unit observed in another survey or by the same unit in a previous survey repetition Combined methods: combines different methods. For example, in Predictive mean matching regression is performed at the first stage and hot-deck at the second stage (Rubin, 1987)

Deterministic and Stochastic Imputation Deterministic: the estimated value (e.g. by mean or regression) is directly used for imputation Stochastic: a residual random term is added to the estimated (predicted) value In effect, deterministic methods for imputation can bias the distributions and lead a decrease in the variability. Stochastic methods allow for a better preservation of the distribution variability

Efficient use of information in imputation In order to better preserve univariate and multivariate distributions the available information can be used: as covariates in regression models, hot-deck, predictive mean matching to form imputation cells imputation cells allow to approximate the MCAR assumption inside them imputation cells are highly internally homogeneous, different imputation cells are highly different For hot-deck and predictive mean matching, imputation cells contains ‘enough’ data (available donors)

Advantages of imputation Simple to use Standard methods for complete data can be used in subsequent data analyses Reduces bias on univariate statistics compared to complete case and available case analyses Use of all the available information either observed or from other sources (register, historical data, other sources)

Risks of imputation Multivariate analyses: Imputation generally produces an attenuation of data relationships Variance (1): Imputation introduces a further variance term (imputation variance) Variance (2): If imputed data are treated as originally observed, the estimates precision is over-estimated (under-estimation of total variance, too narrow confidence intervals, invalid tests,…)

Risks of imputation: variance estimation Variance estimation under single imputation Model-assisted (Särndal, 1992; Rao and Sitter, 1995; Lee, Rancourt and Särndal, 2002) Re-sampling techniques (Shao, 2002) Jackknife (Rao and Shao, 1992) Bootstrap (Shao and Sitter, 1996) Reversed approach (Shao and Steel, 1999) Multiple imputation (Rubin, 1987; Schafer, 1997; Raghunatan et al., 2001) The method consists of imputing several times (say m) the incomplete data set. The m data sets are then combined in order to estimate the additional uncertainty due to missing data and imputation

References Chen J., Shao J. (2000). Nearest Neighbor Imputation for Survey Data. Journal of Official Statistics, No 16, pp. 113-131. Chen J., RaoJ.N.K, Sitter R. (2000). Efficient Random Imputation for Missing Data in Complex Surveys, Statistics Sinica, Vol.10, pp. 1153-1169. Dempster, A.P., Laird, N.M., Rubin, D.B. (1977) Maximum Likelihood from Incomplete Data via the EM Algorithm, Journal of the Royal Statistical Society, Ser. B, 39, 1-38. Kalton G. (1983), Compensating for missing survey data, Survey Research Center, University of Michigan, 75-76 Kalton G., Kasprzyk D. (1986), The treatment of missing survey data, Survey methodology, 12, 1, Statistics Canada Kalton, G. and Kasprzyk, D. (1982), Imputing for missing survey responses, Proceedings of the section on Survey Research Methods, American Statistical Association, 22-31 Kovar J.G., MacMillian J.H., Whitridge P. (1988), Overview and strategy for the generalized edit and imputation system", Statistics Canada, Methodology Branch, April 1988 Kovar J.G., Whitridge P. (1995), Imputation of business survey data, in Business Survey Methods, John Wiley Little, R.J.A. (1988), Missing data adjustments in large surveys, Journal of Business and Economic Statistics, 6, No 3, pp. 287-296 Little R.J.A., Rubin D.B. (2002), Statistical Analysis with Missing Data, 2nd Edition, Wiley, New York Raghunatan, T. E., Lepkowsky, J. M., Van Hoewyk, J., Solenberger, P. (2001), A Multivariate technique for Multiply Imputing Missing Values Using a Sequence of Regression Models, Survey Methodology, 27, No 1, pp. 85-95.

Rao, J. N. K. (1996), On variance estimation with imputed survey data Rao, J. N. K. (1996), On variance estimation with imputed survey data. Journal of the American Statistical Association, 91, pp. 499-506. Rao J.N.K., Shao J. (1992), Jackknife Variance Estimation with Survey Data under Hot-deck Imputation, Biometrika, 79, 811-822. Rao J.N.K., Sitter R.R. (1995), Variance Estimation under Two-Phase Sampling with Application to Imputation for Missing Data, Biometrika, 82, 453-460 Rubin D.B. (1976), Inference and missing data, Biometrika, 63: 581-592 Rubin, D.B. (1987), Multiple Imputation for non-response in surveys. Wiley, New York Sarndal C.E. (1992), Method for Estimating the Precision of Survey Estimates when Imputation Has Been Used, Survey Methodology, 241-252. Schafer J. L. (1997), Analysis of Incomplete Multivariate Data. Chapman & Hall, Shao J., Sitter, R.R. (1996), Bootstrap for Imputed Survey Data. Journal of the American Statistical Association, 91, 1278-1288. Shao J., Steel P. (1999), Variance Estimation for Survey Data with Composite Imputation and Nonnegligible Sampling Fractions, Journal of the American Statistical Association, 94, 254-265. Shao J. (2002), Replication Methods for Variance Estimation in Complex Surveys with Imputed Data, in Survey Nonresponse, Groves, R. et al eds., J. Wiley and Sons, New York, 303-314.