A bootstrap method for estimators based on combined administrative and survey data Sander Scholtus (Statistics Netherlands) NTTS Conference 13 March 2019
Dutch Virtual Census 𝑈 1 ( 𝑆 2 ) 𝑈 2 ? 𝑥 education level (𝑦) number of cases (2011 Census) 0-14 year olds ±2.9 million 𝑈 1 admin. data admin. data ±6.5 million LFS ( 𝑆 2 ) ±0.34 million 𝑈 2 ? ±6.9 million ±16.7 million
educational attainment (𝒚) Dutch Virtual Census Goal: estimate tables of frequencies involving education level Typical element: 𝜃 ℎ𝑐 = 𝑖∈𝑈 ℎ 𝑖 𝑦 𝑐𝑖 ( ℎ 𝑖 ∈ 0,1 , 𝑦 𝑐𝑖 ∈ 0,1 , 𝑐∈ 1,…,𝐶 ) other variables educational attainment (𝒚) level 1 … level c level C 1 h 𝜃 ℎ𝑐 H
Dutch Virtual Census Goal: estimate tables of frequencies involving education level Typical element: 𝜃 ℎ𝑐 = 𝑖∈𝑈 ℎ 𝑖 𝑦 𝑐𝑖 ( ℎ 𝑖 ∈ 0,1 , 𝑦 𝑐𝑖 ∈ 0,1 , 𝑐∈ 1,…,𝐶 ) Proposal (De Waal and Daalmans, 2018): use mass imputation Estimate (e.g., logistic regression) model for 𝑦 1 ,…, 𝑦 𝐶 based on 𝑆 2 For each 𝑖∈ 𝑈 2 \ 𝑆 2 , impute predictions 𝑦 1𝑖 ,…, 𝑦 𝐶𝑖 based on this model Estimator for 𝜃 ℎ𝑐 : 𝜃 ℎ𝑐 = 𝜃 ℎ𝑐1 + 𝜃 ℎ𝑐2 = 𝑖∈ 𝑈 1 ℎ 𝑖 𝑦 𝑐𝑖 + 𝑖∈ 𝑆 2 ℎ 𝑖 𝑦 𝑐𝑖 + 𝑖∈ 𝑈 2 \ 𝑆 2 ℎ 𝑖 𝑦 𝑐𝑖 Question: how to evaluate the variance of 𝜃 ℎ𝑐 ? Analytical approximation: possible but cumbersome (Scholtus, 2018) Bootstrap procedure
General set-up Target population 𝑈= 𝑈 1 ∪ 𝑈 2 Subpopulation 𝑈 1 : Admin. data available Considered fixed (no variance) Probability sample 𝑆: May have overlap with admin. data 𝑆 1 =𝑆∩ 𝑈 1 ; 𝑆 2 =𝑆∩ 𝑈 2 Inclusion probabilities 𝜋 𝑖 known Estimator of interest: 𝜃 =𝑡( 𝑈 1 ,𝑆) 𝑈 1 𝑆= 𝑆 1 ∪ 𝑆 2 𝑈 2 ?
Bootstrap Classical bootstrap (Efron, 1979): Estimator 𝜃 =𝑡 𝑆 , with 𝑆 an i.i.d. random sample of size 𝑛 from a distribution 𝐹 Resampling: Draw a with-replacement sample 𝑆 𝑏 ∗ of size 𝑛 from 𝑆 Compute 𝜃 𝑏 ∗ =𝑡 𝑆 𝑏 ∗ Repeat this a large number of times (𝑏=1,2,…,𝐵) Bootstrap estimator for the variance of 𝜃 : var boot 𝜃 = 1 𝐵−1 𝑏=1 𝐵 𝜃 𝑏 ∗ − 𝜃 ∗ 2 𝜃 ∗ = 1 𝐵 𝑏=1 𝐵 𝜃 𝑏 ∗
Bootstrap Classical bootstrap does not account for Finite-population sampling Complex survey design Different extensions of the bootstrap available Overview: Mashreghi, Haziza and Léger (2016) Here: extension based on pseudo-populations Theory: Booth, Butler and Hall (1994), Chauvet (2007) Previous application: Kuijvenhoven and Scholtus (2011)
Bootstrap First assume: 𝑤 𝑖 =1/ 𝜋 𝑖 is always integer-valued Bootstrap algorithm: 1. Create a pseudo-population 𝑈 ∗ by taking 𝑤 𝑖 copies of each unit 𝑖∈𝑆. 2. For each 𝑏=1,…,𝐵 do the following: - Draw sample 𝑆 𝑏 ∗ from 𝑈 ∗ analogous to design used to draw 𝑆 from 𝑈. If 𝑘∈ 𝑈 ∗ is a copy of 𝑖∈𝑆 then its inclusion probability is 𝜋 𝑘 ∗ ∝ 𝜋 𝑖 . - Analogously to 𝜃 =𝑡 𝑆, 𝑈 1 , construct replicate 𝜃 𝑏 ∗ =𝑡 𝑆 𝑏 ∗ , 𝑈 1 . 3. Compute the variance estimate for 𝜃 based on pseudo-population 𝑈 ∗ as: var boot 𝜃 = 1 𝐵−1 𝑏=1 𝐵 𝜃 𝑏 ∗ − 𝜃 ∗ 2 , with 𝜃 ∗ = 1 𝐵 𝑏=1 𝐵 𝜃 𝑏 ∗ .
Bootstrap General case: 𝑤 𝑖 =1/ 𝜋 𝑖 = 𝑤 𝑖 + 𝜑 𝑖 , with 𝑤 𝑖 ∈ℕ, 𝜑 𝑖 ∈ 0,1 Bootstrap algorithm: 1. Create a pseudo-population 𝑈 ∗ by taking 𝜔 𝑖 copies of each unit 𝑖∈𝑆. Random inflation weight: 𝜔 𝑖 = 𝑤 𝑖 with probability 1− 𝜑 𝑖 , 𝜔 𝑖 = 𝑤 𝑖 +1 with probability 𝜑 𝑖 . 2. For each 𝑏=1,…,𝐵 do the following: - Draw sample 𝑆 𝑏 ∗ from 𝑈 ∗ analogous to design used to draw 𝑆 from 𝑈. If 𝑘∈ 𝑈 ∗ is a copy of 𝑖∈𝑆 then its inclusion probability is 𝜋 𝑘 ∗ ∝ 𝜋 𝑖 . - Analogously to 𝜃 =𝑡 𝑆, 𝑈 1 , construct replicate 𝜃 𝑏 ∗ =𝑡 𝑆 𝑏 ∗ , 𝑈 1 . 3. Compute the variance estimate for 𝜃 based on pseudo-population 𝑈 ∗ as: var boot 𝜃 = 1 𝐵−1 𝑏=1 𝐵 𝜃 𝑏 ∗ − 𝜃 ∗ 2 , with 𝜃 ∗ = 1 𝐵 𝑏=1 𝐵 𝜃 𝑏 ∗ .
Bootstrap General case: 𝑤 𝑖 =1/ 𝜋 𝑖 = 𝑤 𝑖 + 𝜑 𝑖 , with 𝑤 𝑖 ∈ℕ, 𝜑 𝑖 ∈ 0,1 Bootstrap algorithm: For each 𝑎=1,…,𝐴 do the following: 1. Create a pseudo-population 𝑈 𝑎 ∗ by taking 𝜔 𝑖 copies of each unit 𝑖∈𝑆. Random inflation weight: 𝜔 𝑖 = 𝑤 𝑖 with probability 1− 𝜑 𝑖 , 𝜔 𝑖 = 𝑤 𝑖 +1 with probability 𝜑 𝑖 . 2. For each 𝑏=1,…,𝐵 do the following: - Draw sample 𝑆 𝑎𝑏 ∗ from 𝑈 𝑎 ∗ analogous to design used to draw 𝑆 from 𝑈. If 𝑘∈ 𝑈 𝑎 ∗ is a copy of 𝑖∈𝑆 then its inclusion probability is 𝜋 𝑘 ∗ ∝ 𝜋 𝑖 . - Analogously to 𝜃 =𝑡 𝑆, 𝑈 1 , construct replicate 𝜃 𝑎𝑏 ∗ =𝑡 𝑆 𝑎𝑏 ∗ , 𝑈 1 . 3. Compute the variance estimate for 𝜃 based on pseudo-population 𝑈 𝑎 ∗ as: 𝑣 𝑎 𝜃 = 1 𝐵−1 𝑏=1 𝐵 𝜃 𝑎𝑏 ∗ − 𝜃 𝑎 ∗ 2 , with 𝜃 𝑎 ∗ = 1 𝐵 𝑏=1 𝐵 𝜃 𝑎𝑏 ∗ . Finally compute: var boot 𝜃 = 1 𝐴 𝑎=1 𝐴 𝑣 𝑎 𝜃 .
Bootstrap Key step: Analogously to 𝜃 =𝑡 𝑆, 𝑈 1 , construct replicate 𝜃 𝑎𝑏 ∗ =𝑡 𝑆 𝑎𝑏 ∗ , 𝑈 1 Example: Dutch Virtual Census with mass imputation Original estimator: 𝜃 ℎ𝑐 = 𝑖∈ 𝑈 1 ℎ 𝑖 𝑦 𝑐𝑖 + 𝑖∈ 𝑆 2 ℎ 𝑖 𝑦 𝑐𝑖 + 𝑖∈ 𝑈 2 \ 𝑆 2 ℎ 𝑖 𝑦 𝑐𝑖 Construction of bootstrap replicate 𝜃 ℎ𝑐,𝑎𝑏 ∗ : 𝑈 2𝑎 ∗ is the subpopulation of 𝑈 𝑎 ∗ consisting of copies of units from 𝑆 2 𝑆 2𝑎𝑏 ∗ = 𝑆 𝑎𝑏 ∗ ∩ 𝑈 2𝑎 ∗ ; note: size of overlap is random Use 𝑆 2𝑎𝑏 ∗ to re-estimate the imputation model for 𝑦 1 ,…, 𝑦 𝐶 Impute the missing values of 𝑦 1 ,…, 𝑦 𝐶 in 𝑈 2𝑎 ∗ \ 𝑆 2𝑎𝑏 ∗ Compute: 𝜃 ℎ𝑐,𝑎𝑏 ∗ = 𝑘∈ 𝑈 1 ℎ 𝑘 𝑦 𝑐𝑘 + 𝑘∈ 𝑆 2𝑎𝑏 ∗ ℎ 𝑘 𝑦 𝑐𝑘 + 𝑘∈ 𝑈 2𝑎 ∗ \ 𝑆 2𝑎𝑏 ∗ ℎ 𝑘 𝑦 𝑐𝑘
true standard deviations educational attainment Simulation study true counts true standard deviations age (years) educational attainment low medium high young (15–35) 330 795 400 34.5 42.2 36.8 middle (36–55) 115 560 480 22.3 36.1 old (56+) 120 525 22.8 35.6 Synthetic target population of 𝑁=3725 persons Simple random sample of size 𝑛=𝑁/5=745; no admin. data Mass imputation based on logistic regression: Gender × (Age + Income) True standard deviations: estimated by repeating sampling and imputing 20 000 times
Simulation study true counts true standard deviations age (years) educational attainment low medium high young (15–35) 330 795 400 34.5 42.2 36.8 middle (36–55) 115 560 480 22.3 36.1 old (56+) 120 525 22.8 35.6 Analytical variance approximation (Scholtus, 2018), repeated 100 times Bootstrap procedure with 𝐴=1, 𝐵=200, repeated 100 times estimated analytical st. dev. estimated bootstrap st. dev. age (years) educational attainment low medium high young (15–35) 32.2 39.5 34.5 34.1 41.9 36.4 middle (36–55) 20.6 34.0 33.3 22.7 36.6 36.0 old (56+) 21.1 32.8 31.8 22.5 35.2
Conclusion Bootstrap method for estimating accuracy of statistics based on combined administrative and survey data Advantage over analytical variance estimation: flexibility Possible disadvantage: computational workload Future work: Simulation study with real Dutch Census data (in progress) Extending method to account for additional sources of uncertainty: Micro-integration of survey and admin. data in overlapping part Measurement error …
References J.G. Booth, R.W. Butler, and P. Hall (1994), Bootstrap Methods for Finite Populations. Journal of the American Statistical Association 89, 1282–1289. G. Chauvet (2007), Méthodes de Bootstrap en Population Finie. PhD Thesis (in French), L’Université de Rennes. T. de Waal and J. Daalmans (2018), Mass Imputation for Census Estimation: Methodology. Report, Statistics Netherlands. B. Efron (1979), Bootstrap methods: another look at the jack-knife. The Annals of Statistics 7, 1–26. L. Kuijvenhoven and S. Scholtus (2011), Bootstrapping Combined Estimators based on Register and Sample Survey Data. Discussion Paper, Statistics Netherlands. Z. Mashreghi, D. Haziza, and C. Léger (2016), A Survey of Bootstrap Methods in Finite Population Sampling. Statistics Surveys 10, 1–52. S. Scholtus (2018), Variances of Census Tables after Mass Imputation. Discussion Paper, Statistics Netherlands.