The estimation strategy of the National Household Survey (NHS) François Verret, Mike Bankier, Wesley Benjamin & Lisa Hayden Statistics Canada Presentation.

The estimation strategy of the National Household Survey (NHS) François Verret, Mike Bankier, Wesley Benjamin & Lisa Hayden Statistics Canada Presentation at the ITSEW 2011 June 21, 2011

2 2 Outline of the presentation 1.Introduction 2.Handling non-response error 3.Simulation set-up 4.Results 5.Limits of the study 6.Conclusion 7.Future work

3 1. Introduction  2006 Census: 20% long form, 80% short form  2011: 100% Census mandatory short form 30% sampled to voluntarily complete the NHS long form  Objectives of the long form: get data to plan, deliver and support government programs directed at target populations  2011 common topics to both forms: demography, family structure, language  Additional 2011 long form topics: education, ethnicity, income, immigration, mobility…  NHS sample size is 4.5 million dwellings (f = 30%)

4 1. Introduction  Non-response error in the NHS: Survey now voluntary => expect significant non-response To minimize the impact, after a fixed date restrict the collection efforts to a Non-Response Follow-Up (NRFU) random sub-sample  Set-up developed by Hansen & Hurwitz (1946) 1.Select 1 st phase sample s from population U 2.Non-response s nr observed in s 3.NRFU selected from s nr 4.Response NRFU r and non-response NRFU nr observed in the NRFU (HH assumed 100% resp. rate) U s srsr s nr NRFU NRFU r NRFU nr

5 1. Introduction  When 100% of the NRFU responds (as in Hansen and Hurwitz original setting), the NRFU can be used to estimate without non-response bias the total in s nr  This is not the case in the NHS.  However focusing the collection efforts on the NRFU converts part of the non-response bias (that would be observed in the full s nr ) into sub-sampling error U s srsr s nr NRFU NRFU r NRFU nr

6 2. Handling non-response error  The estimation method chosen to minimize the remaining non-response bias should have the following properties: As few bias assumptions as possible should be made The method should be simple to explain and to implement in production  Available micro-level auxiliary data to adjust for non- response: 2011 Census short form Tax data  Calibration: Agreement with Census totals is desirable from a user’s perspective

7 2. Handling non-response error  First class of contenders: Reweighting Usual method used to compensate for total non-response in social surveys The Hansen & Hurwitz estimator of a total is unbiased if 100% of the NRFU answers  When the assumption does not hold, we must model the last non-response mechanism/phase and reweight accordingly…

8 2. Handling non-response error  Scores method: Model the probability of response with a logistic regression Form Response Homogeneity Groups (RHG) of respondents and non-respondents with similar predicted response probabilities Calculate the response rate in each RHG and assign these new predicted response probabilities to respondents Divide the NRFU r weights by this probability:

9 2. Handling non-response error  Second class of contenders: Imputation Usual method to compensate for item non-response We will consider nearest-neighbour imputation using the CANadian Census Edit & Imputation System (CANCEIS) only 1.Partial imputation: Impute only non-respondents to the subsample (NRFU nr ) and use reweighting to take sampling into account 2.Mass imputation: Impute all non-respondents ( s nr /NRFU r )

10 2. Handling non-response error  Some pros & cons Method ScoresPartial imputation Mass imputation Preserves micro-level information of non-respondents √√√ Does not create synthetic information√√√ Uses less heavy non-response hypotheses √√ Fully takes sub-sampling design into account √√ Census systems available√√ More calibration to known Census totals can be done √√√

11 3. Simulation set-up  Use 2006 Census 20% long form sample data  Restricted to Census Metropolitan Area (CMA) of Toronto  Simulation aimed at preserving the properties of the NHS (except for the f = 30%): Non-response to the 1 st phase was simulated by deterministically blanking out the data of the 63% of respondents who answered last in 2006 Of these non-respondents, the 78% who answered first will have their response restored if they are selected in the NRFU sub-sample NRFU sub-sampling was simulated by selecting a stratified random sample of 41% of s nr

12 3. Simulation set-up  Estimators calculated As points of reference, unbiased estimators: As contenders:

13 3. Simulation set-up  The scores method A single logistic regression was done for the whole CMA of Toronto Household response probability was predicted Considered for stepwise selection: household-level variables, our best attempt at summarizing the person-level information and one paradata variable R-square of 26% 13 RHG formed with predicted probabilities ranging from 29% to 95%

14 3. Simulation set-up  Imputation methods Nearest-neighbour imputation done with CANCEIS RHG is defined by household size The distance between non-respondents and donors (respondents) is defined by weighting each household-level, person-level and paradata characteristics in the distance function Preference is given to donors who are geographically close For each non-respondents, a list of donors is made and one is randomly selected with probability proportional to a measure of size (1 st phase weight for mass imputation, score method weights for partial imputation)

15 3. Simulation set-up  M=84 non short form characteristics over the various topics  Average relative difference: Calculated at the CMA level: At the Weighting area (953 WA in total) level within the CMA:

16 4. Results  Errors at the CMA and WA levels for Toronto CMAWA Point of comparison Full first- phase Hansen & Hurwitz Full first- phase Hansen & Hurwitz Hansen & Hurwitz estimator 0.940.0022.980.00 Mass imputation 2.97N/A24.56N/A Partial imputation 2.251.5226.6913.22 Scores method 2.031.4526.7718.67

17 5. Limits of the study  Results: The simulation only includes one replication of the sub- sampling and non-response mechanisms Non-response bias is the measure of interest, but errors were presented Non-response mechanisms were generated deterministically. Should they be generated probabilistically? The 2011 sampling, non-response and available data (ex: paradata) cannot be replicated exactly Only totals studied. What about other parameters such as correlations?

18 5. Limits of the study  Possible confounding effects: Logistic regression was done at the aggregated level of the CMA and no WA effect or interaction were considered Paradata for imputation is more closely related to non- response mechanism (give preference to late respondents in the distance) Weighting of donors in imputation has an impact Calibration done from sample to U; calibration at inner levels/phases could help scores and partial imputation

19  With these preliminary results, it seems scores method is doing well at aggregate levels, while partial imputation is doing better than scores at finer levels Mass imputation: Can you override the known sub-sample design with an imputation model? Partial imputation: Can include more information (person- level, paradata) than scores, but weighting of each component in the distance is partially data driven and not straightforward Scores method: More difficult to include the information, but variable selection to explain non-response is direct 6. Conclusion

20  Possible: Replicate sub-sampling and imputation more than once to isolate bias components Consider other levels of calibration in the comparisons Hybrid of scores and partial imputation  Definite: Implement a method into NHS production Estimate the errors and variances (multi-phase, large sampling fractions, errors due to modeling,…) and educate data users  Important to get a good model for the last non- response mechanism. Whatever the method, quality of the results is a function of the auxiliary data available. 7. Future Work

21 For more information, please contact: François Verret - SSMD/DMES Francois.Verret@statcan.gc.ca (613) 951-7318

The estimation strategy of the National Household Survey (NHS) François Verret, Mike Bankier, Wesley Benjamin & Lisa Hayden Statistics Canada Presentation.

Similar presentations

Presentation on theme: "The estimation strategy of the National Household Survey (NHS) François Verret, Mike Bankier, Wesley Benjamin & Lisa Hayden Statistics Canada Presentation."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The estimation strategy of the National Household Survey (NHS) François Verret, Mike Bankier, Wesley Benjamin & Lisa Hayden Statistics Canada Presentation.

Similar presentations

Presentation on theme: "The estimation strategy of the National Household Survey (NHS) François Verret, Mike Bankier, Wesley Benjamin & Lisa Hayden Statistics Canada Presentation."— Presentation transcript:

Similar presentations

About project

Feedback