A bootstrap method for estimators based on combined administrative and survey data Sander Scholtus (Statistics Netherlands) NTTS Conference 13 March 2019.

Slides:



Advertisements
Similar presentations
Hypothesis testing and confidence intervals by resampling by J. Kárász.
Advertisements

Uncertainty in fall time surrogate Prediction variance vs. data sensitivity – Non-uniform noise – Example Uncertainty in fall time data Bootstrapping.
Sampling: Final and Initial Sample Size Determination
Sampling Distributions (§ )
Model Assessment and Selection
Chapter 10: Sampling and Sampling Distributions
Bootstrapping in regular graphs
2008 Chingchun 1 Bootstrap Chingchun Huang ( 黃敬群 ) Vision Lab, NCTU.
Standard error of estimate & Confidence interval.
Bootstrap spatobotp ttaoospbr Hesterberger & Moore, chapter 16 1.
2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,
Sampling Basics Jeremy Kees, Ph.D.. Conceptually defined… Sampling is the process of selecting units from a population of interest so that by studying.
Statistical Bootstrapping Peter D. Christenson Biostatistician January 20, 2005.
Bootstrapping (And other statistical trickery). Reminder Of What We Do In Statistics Null Hypothesis Statistical Test Logic – Assume that the “no effect”
Slide 1 Estimating Performance Below the National Level Applying Simulation Methods to TIMSS Fourth Annual IES Research Conference Dan Sherman, Ph.D. American.
Resampling techniques
Eurostat Statistical matching when samples are drawn according to complex survey designs Training Course «Statistical Matching» Rome, 6-8 November 2013.
Limits to Statistical Theory Bootstrap analysis ESM April 2006.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 7-1 Chapter 7 Sampling Distributions Basic Business Statistics.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Section 7-5 Estimating a Population Variance.
Single-Factor Studies KNNL – Chapter 16. Single-Factor Models Independent Variable can be qualitative or quantitative If Quantitative, we typically assume.
Multivariate selective editing via mixture models: first applications to Italian structural business surveys Orietta Luzi, Guarnera U., Silvestri F., Buglielli.
Separable Monte Carlo Separable Monte Carlo is a method for increasing the accuracy of Monte Carlo sampling when the limit state function is sum or difference.
1 Probability and Statistics Confidence Intervals.
Project Plan Task 8 and VERSUS2 Installation problems Anatoly Myravyev and Anastasia Bundel, Hydrometcenter of Russia March 2010.
Quantifying Uncertainty
Bias-Variance Analysis in Regression  True function is y = f(x) +  where  is normally distributed with zero mean and standard deviation .  Given a.
STA248 week 121 Bootstrap Test for Pairs of Means of a Non-Normal Population – small samples Suppose X 1, …, X n are iid from some distribution independent.
Sampling Sampling Distributions. Sample is subset of population used to infer something about the population. Probability – know the likelihood of selection.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 1-1 Statistics for Managers Using Microsoft ® Excel 4 th Edition Chapter.
Bootstrap and Model Validation
Methods for Data-Integration
Chapter 1 Introduction and Data Collection
Statistics in Management
Theme (v): Managing change
Standard Errors Beside reporting a value of a point estimate we should consider some indication of its precision. For this we usually quote standard error.
Theme (i): New and emerging methods
Confidence Interval Estimation
Virtual University of Pakistan
ECO 173 Chapter 10: Introduction to Estimation Lecture 5a
Point and interval estimations of parameters of the normally up-diffused sign. Concept of statistical evaluation.
Sampling Distributions and Estimators
Review 49% of all people in the world are male. Interested whether physics majors are more likely to be male than the general population, you survey 10.
Multiple Imputation using SOLAS for Missing Data Analysis
Assessing Disclosure Risk in Microdata
Microeconometric Modeling
ECO 173 Chapter 10: Introduction to Estimation Lecture 5a
Test for Mean of a Non-Normal Population – small n
Quantifying uncertainty using the bootstrap
Chapter 7 Sampling Distributions
Bootstrap - Example Suppose we have an estimator of a parameter and we want to express its accuracy by its standard error but its sampling distribution.
Chapter 7 – Statistical Inference and Sampling
Lecture 2 – Monte Carlo method in finance
Ch13 Empirical Methods.
The European Statistical Training Programme (ESTP)
Chapter 8: Weighting adjustment
Categorical Data Analysis Review for Final
Sampling Distributions
Ensemble learning Reminder - Bagging of Trees Random Forest
Business Statistics: A First Course (3rd Edition)
New Techniques and Technologies for Statistics 2017  Estimation of Response Propensities and Indicators of Representative Response Using Population-Level.
Sampling Distributions (§ )
The European Statistical Training Programme (ESTP)
Interval Estimation Download this presentation.
Automatic Editing with Soft Edits
Chapter 13: Item nonresponse
Fractional-Random-Weight Bootstrap
Small area estimation for the Dutch Investment Survey
Bootstrapping and Bootstrapping Regression Models
STATISTICS INFORMED DECISIONS USING DATA
Presentation transcript:

A bootstrap method for estimators based on combined administrative and survey data Sander Scholtus (Statistics Netherlands) NTTS Conference 13 March 2019

Dutch Virtual Census 𝑈 1 ( 𝑆 2 ) 𝑈 2 ? 𝑥 education level (𝑦) number of cases (2011 Census) 0-14 year olds ±2.9 million 𝑈 1 admin. data admin. data ±6.5 million LFS ( 𝑆 2 ) ±0.34 million 𝑈 2 ? ±6.9 million ±16.7 million

educational attainment (𝒚) Dutch Virtual Census Goal: estimate tables of frequencies involving education level Typical element: 𝜃 ℎ𝑐 = 𝑖∈𝑈 ℎ 𝑖 𝑦 𝑐𝑖 ( ℎ 𝑖 ∈ 0,1 , 𝑦 𝑐𝑖 ∈ 0,1 , 𝑐∈ 1,…,𝐶 ) other variables educational attainment (𝒚) level 1 … level c level C 1 h 𝜃 ℎ𝑐 H

Dutch Virtual Census Goal: estimate tables of frequencies involving education level Typical element: 𝜃 ℎ𝑐 = 𝑖∈𝑈 ℎ 𝑖 𝑦 𝑐𝑖 ( ℎ 𝑖 ∈ 0,1 , 𝑦 𝑐𝑖 ∈ 0,1 , 𝑐∈ 1,…,𝐶 ) Proposal (De Waal and Daalmans, 2018): use mass imputation Estimate (e.g., logistic regression) model for 𝑦 1 ,…, 𝑦 𝐶 based on 𝑆 2 For each 𝑖∈ 𝑈 2 \ 𝑆 2 , impute predictions 𝑦 1𝑖 ,…, 𝑦 𝐶𝑖 based on this model Estimator for 𝜃 ℎ𝑐 : 𝜃 ℎ𝑐 = 𝜃 ℎ𝑐1 + 𝜃 ℎ𝑐2 = 𝑖∈ 𝑈 1 ℎ 𝑖 𝑦 𝑐𝑖 + 𝑖∈ 𝑆 2 ℎ 𝑖 𝑦 𝑐𝑖 + 𝑖∈ 𝑈 2 \ 𝑆 2 ℎ 𝑖 𝑦 𝑐𝑖 Question: how to evaluate the variance of 𝜃 ℎ𝑐 ? Analytical approximation: possible but cumbersome (Scholtus, 2018) Bootstrap procedure

General set-up Target population 𝑈= 𝑈 1 ∪ 𝑈 2 Subpopulation 𝑈 1 : Admin. data available Considered fixed (no variance) Probability sample 𝑆: May have overlap with admin. data 𝑆 1 =𝑆∩ 𝑈 1 ; 𝑆 2 =𝑆∩ 𝑈 2 Inclusion probabilities 𝜋 𝑖 known Estimator of interest: 𝜃 =𝑡( 𝑈 1 ,𝑆) 𝑈 1 𝑆= 𝑆 1 ∪ 𝑆 2 𝑈 2 ?

Bootstrap Classical bootstrap (Efron, 1979): Estimator 𝜃 =𝑡 𝑆 , with 𝑆 an i.i.d. random sample of size 𝑛 from a distribution 𝐹 Resampling: Draw a with-replacement sample 𝑆 𝑏 ∗ of size 𝑛 from 𝑆 Compute 𝜃 𝑏 ∗ =𝑡 𝑆 𝑏 ∗ Repeat this a large number of times (𝑏=1,2,…,𝐵) Bootstrap estimator for the variance of 𝜃 : var boot 𝜃 = 1 𝐵−1 𝑏=1 𝐵 𝜃 𝑏 ∗ − 𝜃 ∗ 2 𝜃 ∗ = 1 𝐵 𝑏=1 𝐵 𝜃 𝑏 ∗

Bootstrap Classical bootstrap does not account for Finite-population sampling Complex survey design Different extensions of the bootstrap available Overview: Mashreghi, Haziza and Léger (2016) Here: extension based on pseudo-populations Theory: Booth, Butler and Hall (1994), Chauvet (2007) Previous application: Kuijvenhoven and Scholtus (2011)

Bootstrap First assume: 𝑤 𝑖 =1/ 𝜋 𝑖 is always integer-valued Bootstrap algorithm: 1. Create a pseudo-population 𝑈 ∗ by taking 𝑤 𝑖 copies of each unit 𝑖∈𝑆. 2. For each 𝑏=1,…,𝐵 do the following: - Draw sample 𝑆 𝑏 ∗ from 𝑈 ∗ analogous to design used to draw 𝑆 from 𝑈. If 𝑘∈ 𝑈 ∗ is a copy of 𝑖∈𝑆 then its inclusion probability is 𝜋 𝑘 ∗ ∝ 𝜋 𝑖 . - Analogously to 𝜃 =𝑡 𝑆, 𝑈 1 , construct replicate 𝜃 𝑏 ∗ =𝑡 𝑆 𝑏 ∗ , 𝑈 1 . 3. Compute the variance estimate for 𝜃 based on pseudo-population 𝑈 ∗ as: var boot 𝜃 = 1 𝐵−1 𝑏=1 𝐵 𝜃 𝑏 ∗ − 𝜃 ∗ 2 , with 𝜃 ∗ = 1 𝐵 𝑏=1 𝐵 𝜃 𝑏 ∗ .

Bootstrap General case: 𝑤 𝑖 =1/ 𝜋 𝑖 = 𝑤 𝑖 + 𝜑 𝑖 , with 𝑤 𝑖 ∈ℕ, 𝜑 𝑖 ∈ 0,1 Bootstrap algorithm: 1. Create a pseudo-population 𝑈 ∗ by taking 𝜔 𝑖 copies of each unit 𝑖∈𝑆. Random inflation weight: 𝜔 𝑖 = 𝑤 𝑖 with probability 1− 𝜑 𝑖 , 𝜔 𝑖 = 𝑤 𝑖 +1 with probability 𝜑 𝑖 . 2. For each 𝑏=1,…,𝐵 do the following: - Draw sample 𝑆 𝑏 ∗ from 𝑈 ∗ analogous to design used to draw 𝑆 from 𝑈. If 𝑘∈ 𝑈 ∗ is a copy of 𝑖∈𝑆 then its inclusion probability is 𝜋 𝑘 ∗ ∝ 𝜋 𝑖 . - Analogously to 𝜃 =𝑡 𝑆, 𝑈 1 , construct replicate 𝜃 𝑏 ∗ =𝑡 𝑆 𝑏 ∗ , 𝑈 1 . 3. Compute the variance estimate for 𝜃 based on pseudo-population 𝑈 ∗ as: var boot 𝜃 = 1 𝐵−1 𝑏=1 𝐵 𝜃 𝑏 ∗ − 𝜃 ∗ 2 , with 𝜃 ∗ = 1 𝐵 𝑏=1 𝐵 𝜃 𝑏 ∗ .

Bootstrap General case: 𝑤 𝑖 =1/ 𝜋 𝑖 = 𝑤 𝑖 + 𝜑 𝑖 , with 𝑤 𝑖 ∈ℕ, 𝜑 𝑖 ∈ 0,1 Bootstrap algorithm: For each 𝑎=1,…,𝐴 do the following: 1. Create a pseudo-population 𝑈 𝑎 ∗ by taking 𝜔 𝑖 copies of each unit 𝑖∈𝑆. Random inflation weight: 𝜔 𝑖 = 𝑤 𝑖 with probability 1− 𝜑 𝑖 , 𝜔 𝑖 = 𝑤 𝑖 +1 with probability 𝜑 𝑖 . 2. For each 𝑏=1,…,𝐵 do the following: - Draw sample 𝑆 𝑎𝑏 ∗ from 𝑈 𝑎 ∗ analogous to design used to draw 𝑆 from 𝑈. If 𝑘∈ 𝑈 𝑎 ∗ is a copy of 𝑖∈𝑆 then its inclusion probability is 𝜋 𝑘 ∗ ∝ 𝜋 𝑖 . - Analogously to 𝜃 =𝑡 𝑆, 𝑈 1 , construct replicate 𝜃 𝑎𝑏 ∗ =𝑡 𝑆 𝑎𝑏 ∗ , 𝑈 1 . 3. Compute the variance estimate for 𝜃 based on pseudo-population 𝑈 𝑎 ∗ as: 𝑣 𝑎 𝜃 = 1 𝐵−1 𝑏=1 𝐵 𝜃 𝑎𝑏 ∗ − 𝜃 𝑎 ∗ 2 , with 𝜃 𝑎 ∗ = 1 𝐵 𝑏=1 𝐵 𝜃 𝑎𝑏 ∗ . Finally compute: var boot 𝜃 = 1 𝐴 𝑎=1 𝐴 𝑣 𝑎 𝜃 .

Bootstrap Key step: Analogously to 𝜃 =𝑡 𝑆, 𝑈 1 , construct replicate 𝜃 𝑎𝑏 ∗ =𝑡 𝑆 𝑎𝑏 ∗ , 𝑈 1 Example: Dutch Virtual Census with mass imputation Original estimator: 𝜃 ℎ𝑐 = 𝑖∈ 𝑈 1 ℎ 𝑖 𝑦 𝑐𝑖 + 𝑖∈ 𝑆 2 ℎ 𝑖 𝑦 𝑐𝑖 + 𝑖∈ 𝑈 2 \ 𝑆 2 ℎ 𝑖 𝑦 𝑐𝑖 Construction of bootstrap replicate 𝜃 ℎ𝑐,𝑎𝑏 ∗ : 𝑈 2𝑎 ∗ is the subpopulation of 𝑈 𝑎 ∗ consisting of copies of units from 𝑆 2 𝑆 2𝑎𝑏 ∗ = 𝑆 𝑎𝑏 ∗ ∩ 𝑈 2𝑎 ∗ ; note: size of overlap is random Use 𝑆 2𝑎𝑏 ∗ to re-estimate the imputation model for 𝑦 1 ,…, 𝑦 𝐶 Impute the missing values of 𝑦 1 ,…, 𝑦 𝐶 in 𝑈 2𝑎 ∗ \ 𝑆 2𝑎𝑏 ∗ Compute: 𝜃 ℎ𝑐,𝑎𝑏 ∗ = 𝑘∈ 𝑈 1 ℎ 𝑘 𝑦 𝑐𝑘 + 𝑘∈ 𝑆 2𝑎𝑏 ∗ ℎ 𝑘 𝑦 𝑐𝑘 + 𝑘∈ 𝑈 2𝑎 ∗ \ 𝑆 2𝑎𝑏 ∗ ℎ 𝑘 𝑦 𝑐𝑘

true standard deviations educational attainment Simulation study   true counts true standard deviations age (years) educational attainment low medium high young (15–35) 330 795 400 34.5 42.2 36.8 middle (36–55) 115 560 480 22.3 36.1 old (56+) 120 525 22.8 35.6 Synthetic target population of 𝑁=3725 persons Simple random sample of size 𝑛=𝑁/5=745; no admin. data Mass imputation based on logistic regression: Gender × (Age + Income) True standard deviations: estimated by repeating sampling and imputing 20 000 times

Simulation study   true counts true standard deviations age (years) educational attainment low medium high young (15–35) 330 795 400 34.5 42.2 36.8 middle (36–55) 115 560 480 22.3 36.1 old (56+) 120 525 22.8 35.6 Analytical variance approximation (Scholtus, 2018), repeated 100 times Bootstrap procedure with 𝐴=1, 𝐵=200, repeated 100 times   estimated analytical st. dev. estimated bootstrap st. dev. age (years) educational attainment low medium high young (15–35) 32.2 39.5 34.5 34.1 41.9 36.4 middle (36–55) 20.6 34.0 33.3 22.7 36.6 36.0 old (56+) 21.1 32.8 31.8 22.5 35.2

Conclusion Bootstrap method for estimating accuracy of statistics based on combined administrative and survey data Advantage over analytical variance estimation: flexibility Possible disadvantage: computational workload Future work: Simulation study with real Dutch Census data (in progress) Extending method to account for additional sources of uncertainty: Micro-integration of survey and admin. data in overlapping part Measurement error …

References J.G. Booth, R.W. Butler, and P. Hall (1994), Bootstrap Methods for Finite Populations. Journal of the American Statistical Association 89, 1282–1289. G. Chauvet (2007), Méthodes de Bootstrap en Population Finie. PhD Thesis (in French), L’Université de Rennes. T. de Waal and J. Daalmans (2018), Mass Imputation for Census Estimation: Methodology. Report, Statistics Netherlands. B. Efron (1979), Bootstrap methods: another look at the jack-knife. The Annals of Statistics 7, 1–26. L. Kuijvenhoven and S. Scholtus (2011), Bootstrapping Combined Estimators based on Register and Sample Survey Data. Discussion Paper, Statistics Netherlands. Z. Mashreghi, D. Haziza, and C. Léger (2016), A Survey of Bootstrap Methods in Finite Population Sampling. Statistics Surveys 10, 1–52. S. Scholtus (2018), Variances of Census Tables after Mass Imputation. Discussion Paper, Statistics Netherlands.