The estimation strategy of the National Household Survey (NHS) François Verret, Mike Bankier, Wesley Benjamin & Lisa Hayden Statistics Canada Presentation.

Slides:



Advertisements
Similar presentations
1 ESTIMATION IN THE PRESENCE OF TAX DATA IN BUSINESS SURVEYS David Haziza, Gordon Kuromi and Joana Bérubé Université de Montréal & Statistics Canada ICESIII.
Advertisements

1 Measurement Bias Adjustment in the Swedish Farm Accidents Survey Jörgen Svensson, Statistics Sweden.
Analysis by design Statistics is involved in the analysis of data generated from an experiment. It is essential to spend time and effort in advance to.
Introduction Simple Random Sampling Stratified Random Sampling
Nonresponse Bias Correction in Telephone Surveys Using Census Geocoding: An Evaluation of Error Properties Paul Biemer RTI International and University.
Copyright 2010, The World Bank Group. All Rights Reserved. Estimation and Weighting Part II.
Chapter 3 Producing Data 1. During most of this semester we go about statistics as if we already have data to work with. This is okay, but a little misleading.
NLSCY – Non-response. Non-response There are various reasons why there is non-response to a survey  Some related to the survey process Timing Poor frame.
QBM117 Business Statistics Statistical Inference Sampling 1.
Bridging the Gaps: Dealing with Major Survey Changes in Data Set Harmonization Joint Statistical Meetings Minneapolis, MN August 9, 2005 Presented by:
Data Mining CS 341, Spring 2007 Lecture 4: Data Mining Techniques (I)
The Excel NORMDIST Function Computes the cumulative probability to the value X Business Statistics: A First Course, 5e © 2009 Prentice-Hall, Inc
~ Draft version ~ 1 HOW TO CHOOSE THE NUMBER OF CALL ATTEMPTS IN A TELEPHONE SURVEY IN THE PRESENCE OF NONRESPONSE AND MEASUREMENT ERRORS Annica Isaksson.
FINAL REPORT: OUTLINE & OVERVIEW OF SURVEY ERRORS
Increasing Survey Statistics Precision Using Split Questionnaire Design: An Application of Small Area Estimation 1.
The Weighting Strategy of the Canadian Community Health Survey Cathlin Sarafin Methodologist Statistics Canada March 25, 2008.
Key terms in Sampling Sample: A fraction or portion of the population of interest e.g. consumers, brands, companies, products, etc Population: All the.
18/08/2015 Statistics Canada Statistique Canada Responsive Collection Design (RCD) for CATI Surveys and Total Survey Error (TSE) François Laflamme International.
National Household Survey: collection, quality and dissemination Laurent Roy Statistics Canada March 20, 2013 National Household Survey 1.
Household Surveys ACS – CPS - AHS INFO 7470 / ECON 8500 Warren A. Brown University of Georgia February 22,
Sampling : Error and bias. Sampling definitions  Sampling universe  Sampling frame  Sampling unit  Basic sampling unit or elementary unit  Sampling.
Introduction to plausible values National Research Coordinators Meeting Madrid, February 2010.
Chapter 1: Introduction to Statistics
Measurement Error.
Copyright 2010, The World Bank Group. All Rights Reserved. Estimation and Weighting, Part I.
Nonresponse issues in ICT surveys Vasja Vehovar, Univerza v Ljubljani, FDV Bled, June 5, 2006.
Chap 20-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 20 Sampling: Additional Topics in Sampling Statistics for Business.
Use of Administrative Data in Statistics Canada’s Annual Survey of Manufactures Steve Matthews and Wesley Yung May 16, 2004 The United Nations Statistical.
1 Ratio estimation under SRS Assume Absence of nonsampling error SRS of size n from a pop of size N Ratio estimation is alternative to under SRS, uses.
Data Collection and Sampling
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
© 2013 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Editing a Mixture of Canadian 2006 Census and Tax Data Mike Bankier Statistics Canada 2006 Work Session on Statistical Data Editing
The new multiple-source system for Italian Structural Business Statistics based on administrative and survey data Orietta Luzi, Ugo Guarnera, Paolo Righi.
Stop the Madness: Use Quality Targets Laurie Reedman.
Handling Attrition and Non- response in the 1970 British Cohort Study Tarek Mostafa Institute of Education – University of London.
AP Review #4: Sampling & Experimental Design. Sampling Techniques Simple Random Sample – Each combination of individuals has an equal chance of being.
Conducting A Study Designing Sample Designing Experiments Simulating Experiments Designing Sample Designing Experiments Simulating Experiments.
CHAPTER 12 Descriptive, Program Evaluation, and Advanced Methods.
Evaluating generalised calibration / Fay-Herriot model in CAPEX Tracy Jones, Angharad Walters, Ria Sanderson and Salah Merad (Office for National Statistics)
5-4-1 Unit 4: Sampling approaches After completing this unit you should be able to: Outline the purpose of sampling Understand key theoretical.
A Theoretical Framework for Adaptive Collection Designs Jean-François Beaumont, Statistics Canada David Haziza, Université de Montréal International Total.
AP STATISTICS LESSON AP STATISTICS LESSON DESIGNING DATA.
Eurostat Weighting and Estimation. Presented by Loredana Di Consiglio Istituto Nazionale di Statistica, ISTAT.
Household Surveys: American Community Survey & American Housing Survey Warren A. Brown February 8, 2007.
Chapter 6: 1 Sampling. Introduction Sampling - the process of selecting observations Often not possible to collect information from all persons or other.
Statistics Canada Citizenship and Immigration Canada Methodological issues.
1 Chapter 11 Understanding Randomness. 2 Why Random? What is it about chance outcomes being random that makes random selection seem fair? Two things:
Guillaume Osier Institut National de la Statistique et des Etudes Economiques (STATEC) Social Statistics Division Construction.
Slide 7.1 Saunders, Lewis and Thornhill, Research Methods for Business Students, 5 th Edition, © Mark Saunders, Philip Lewis and Adrian Thornhill 2009.
Review Law of averages, expected value and standard error, normal approximation, surveys and sampling.
How to deal with quality aspects in estimating national results Annalisa Pallotti Short Term Expert Asa 3st Joint Workshop on Pesticides Indicators Valletta.
Canadian Census E&I – Lessons Learned from 2006 with Plans for 2011
The Language of Sampling
Introduction to Survey Data Analysis
An Active Collection using Intermediate Estimates to Manage Follow-Up of Non-Response and Measurement Errors Jeannine Claveau, Serge Godbout and Claude.
The European Statistical Training Programme (ESTP)
The European Statistical Training Programme (ESTP)
Chapter 8: Weighting adjustment
Chapter 12: Other nonresponse correction techniques
Chapter 10: Selection of auxiliary variables
The European Statistical Training Programme (ESTP)
Chapter: 9: Propensity scores
Chapter 3: Response models
New Techniques and Technologies for Statistics 2017  Estimation of Response Propensities and Indicators of Representative Response Using Population-Level.
Sampling and estimation
The European Statistical Training Programme (ESTP)
The European Statistical Training Programme (ESTP)
Chapter 13: Item nonresponse
Chapter 5: The analysis of nonresponse
Presentation transcript:

The estimation strategy of the National Household Survey (NHS) François Verret, Mike Bankier, Wesley Benjamin & Lisa Hayden Statistics Canada Presentation at the ITSEW 2011 June 21, 2011

2 2 Outline of the presentation 1.Introduction 2.Handling non-response error 3.Simulation set-up 4.Results 5.Limits of the study 6.Conclusion 7.Future work

3 1. Introduction  2006 Census: 20% long form, 80% short form  2011: 100% Census mandatory short form 30% sampled to voluntarily complete the NHS long form  Objectives of the long form: get data to plan, deliver and support government programs directed at target populations  2011 common topics to both forms: demography, family structure, language  Additional 2011 long form topics: education, ethnicity, income, immigration, mobility…  NHS sample size is 4.5 million dwellings (f = 30%)

4 1. Introduction  Non-response error in the NHS: Survey now voluntary => expect significant non-response To minimize the impact, after a fixed date restrict the collection efforts to a Non-Response Follow-Up (NRFU) random sub-sample  Set-up developed by Hansen & Hurwitz (1946) 1.Select 1 st phase sample s from population U 2.Non-response s nr observed in s 3.NRFU selected from s nr 4.Response NRFU r and non-response NRFU nr observed in the NRFU (HH assumed 100% resp. rate) U s srsr s nr NRFU NRFU r NRFU nr

5 1. Introduction  When 100% of the NRFU responds (as in Hansen and Hurwitz original setting), the NRFU can be used to estimate without non-response bias the total in s nr  This is not the case in the NHS.  However focusing the collection efforts on the NRFU converts part of the non-response bias (that would be observed in the full s nr ) into sub-sampling error U s srsr s nr NRFU NRFU r NRFU nr

6 2. Handling non-response error  The estimation method chosen to minimize the remaining non-response bias should have the following properties: As few bias assumptions as possible should be made The method should be simple to explain and to implement in production  Available micro-level auxiliary data to adjust for non- response: 2011 Census short form Tax data  Calibration: Agreement with Census totals is desirable from a user’s perspective

7 2. Handling non-response error  First class of contenders: Reweighting Usual method used to compensate for total non-response in social surveys The Hansen & Hurwitz estimator of a total is unbiased if 100% of the NRFU answers  When the assumption does not hold, we must model the last non-response mechanism/phase and reweight accordingly…

8 2. Handling non-response error  Scores method: Model the probability of response with a logistic regression Form Response Homogeneity Groups (RHG) of respondents and non-respondents with similar predicted response probabilities Calculate the response rate in each RHG and assign these new predicted response probabilities to respondents Divide the NRFU r weights by this probability:

9 2. Handling non-response error  Second class of contenders: Imputation Usual method to compensate for item non-response We will consider nearest-neighbour imputation using the CANadian Census Edit & Imputation System (CANCEIS) only 1.Partial imputation: Impute only non-respondents to the subsample (NRFU nr ) and use reweighting to take sampling into account 2.Mass imputation: Impute all non-respondents ( s nr /NRFU r )

10 2. Handling non-response error  Some pros & cons Method ScoresPartial imputation Mass imputation Preserves micro-level information of non-respondents √√√ Does not create synthetic information√√√ Uses less heavy non-response hypotheses √√ Fully takes sub-sampling design into account √√ Census systems available√√ More calibration to known Census totals can be done √√√

11 3. Simulation set-up  Use 2006 Census 20% long form sample data  Restricted to Census Metropolitan Area (CMA) of Toronto  Simulation aimed at preserving the properties of the NHS (except for the f = 30%): Non-response to the 1 st phase was simulated by deterministically blanking out the data of the 63% of respondents who answered last in 2006 Of these non-respondents, the 78% who answered first will have their response restored if they are selected in the NRFU sub-sample NRFU sub-sampling was simulated by selecting a stratified random sample of 41% of s nr

12 3. Simulation set-up  Estimators calculated As points of reference, unbiased estimators: As contenders:

13 3. Simulation set-up  The scores method A single logistic regression was done for the whole CMA of Toronto Household response probability was predicted Considered for stepwise selection: household-level variables, our best attempt at summarizing the person-level information and one paradata variable R-square of 26% 13 RHG formed with predicted probabilities ranging from 29% to 95%

14 3. Simulation set-up  Imputation methods Nearest-neighbour imputation done with CANCEIS RHG is defined by household size The distance between non-respondents and donors (respondents) is defined by weighting each household-level, person-level and paradata characteristics in the distance function Preference is given to donors who are geographically close For each non-respondents, a list of donors is made and one is randomly selected with probability proportional to a measure of size (1 st phase weight for mass imputation, score method weights for partial imputation)

15 3. Simulation set-up  M=84 non short form characteristics over the various topics  Average relative difference: Calculated at the CMA level: At the Weighting area (953 WA in total) level within the CMA:

16 4. Results  Errors at the CMA and WA levels for Toronto CMAWA Point of comparison Full first- phase Hansen & Hurwitz Full first- phase Hansen & Hurwitz Hansen & Hurwitz estimator Mass imputation 2.97N/A24.56N/A Partial imputation Scores method

17 5. Limits of the study  Results: The simulation only includes one replication of the sub- sampling and non-response mechanisms Non-response bias is the measure of interest, but errors were presented Non-response mechanisms were generated deterministically. Should they be generated probabilistically? The 2011 sampling, non-response and available data (ex: paradata) cannot be replicated exactly Only totals studied. What about other parameters such as correlations?

18 5. Limits of the study  Possible confounding effects: Logistic regression was done at the aggregated level of the CMA and no WA effect or interaction were considered Paradata for imputation is more closely related to non- response mechanism (give preference to late respondents in the distance) Weighting of donors in imputation has an impact Calibration done from sample to U; calibration at inner levels/phases could help scores and partial imputation

19  With these preliminary results, it seems scores method is doing well at aggregate levels, while partial imputation is doing better than scores at finer levels Mass imputation: Can you override the known sub-sample design with an imputation model? Partial imputation: Can include more information (person- level, paradata) than scores, but weighting of each component in the distance is partially data driven and not straightforward Scores method: More difficult to include the information, but variable selection to explain non-response is direct 6. Conclusion

20  Possible: Replicate sub-sampling and imputation more than once to isolate bias components Consider other levels of calibration in the comparisons Hybrid of scores and partial imputation  Definite: Implement a method into NHS production Estimate the errors and variances (multi-phase, large sampling fractions, errors due to modeling,…) and educate data users  Important to get a good model for the last non- response mechanism. Whatever the method, quality of the results is a function of the auxiliary data available. 7. Future Work

21 For more information, please contact: François Verret - SSMD/DMES (613)