Presentation is loading. Please wait.

Presentation is loading. Please wait.

A bootstrap method for estimators based on combined administrative and survey data Sander Scholtus (Statistics Netherlands) NTTS Conference 13 March 2019.

Similar presentations


Presentation on theme: "A bootstrap method for estimators based on combined administrative and survey data Sander Scholtus (Statistics Netherlands) NTTS Conference 13 March 2019."β€” Presentation transcript:

1 A bootstrap method for estimators based on combined administrative and survey data
Sander Scholtus (Statistics Netherlands) NTTS Conference 13 March 2019

2 Dutch Virtual Census π‘ˆ 1 ( 𝑆 2 ) π‘ˆ 2 ? π‘₯ education level (𝑦)
number of cases (2011 Census) 0-14 year olds Β±2.9 million π‘ˆ 1 admin. data admin. data Β±6.5 million LFS ( 𝑆 2 ) Β±0.34 million π‘ˆ 2 ? Β±6.9 million Β±16.7 million

3 educational attainment (π’š)
Dutch Virtual Census Goal: estimate tables of frequencies involving education level Typical element: πœƒ β„Žπ‘ = π‘–βˆˆπ‘ˆ β„Ž 𝑖 𝑦 𝑐𝑖 ( β„Ž 𝑖 ∈ 0,1 , 𝑦 𝑐𝑖 ∈ 0,1 , π‘βˆˆ 1,…,𝐢 ) other variables educational attainment (π’š) level 1 … level c level C 1 h πœƒ β„Žπ‘ H

4 Dutch Virtual Census Goal: estimate tables of frequencies involving education level Typical element: πœƒ β„Žπ‘ = π‘–βˆˆπ‘ˆ β„Ž 𝑖 𝑦 𝑐𝑖 ( β„Ž 𝑖 ∈ 0,1 , 𝑦 𝑐𝑖 ∈ 0,1 , π‘βˆˆ 1,…,𝐢 ) Proposal (De Waal and Daalmans, 2018): use mass imputation Estimate (e.g., logistic regression) model for 𝑦 1 ,…, 𝑦 𝐢 based on 𝑆 2 For each π‘–βˆˆ π‘ˆ 2 \ 𝑆 2 , impute predictions 𝑦 1𝑖 ,…, 𝑦 𝐢𝑖 based on this model Estimator for πœƒ β„Žπ‘ : πœƒ β„Žπ‘ = πœƒ β„Žπ‘1 + πœƒ β„Žπ‘2 = π‘–βˆˆ π‘ˆ 1 β„Ž 𝑖 𝑦 𝑐𝑖 + π‘–βˆˆ 𝑆 2 β„Ž 𝑖 𝑦 𝑐𝑖 + π‘–βˆˆ π‘ˆ 2 \ 𝑆 2 β„Ž 𝑖 𝑦 𝑐𝑖 Question: how to evaluate the variance of πœƒ β„Žπ‘ ? Analytical approximation: possible but cumbersome (Scholtus, 2018) Bootstrap procedure

5 General set-up Target population π‘ˆ= π‘ˆ 1 βˆͺ π‘ˆ 2 Subpopulation π‘ˆ 1 :
Admin. data available Considered fixed (no variance) Probability sample 𝑆: May have overlap with admin. data 𝑆 1 =π‘†βˆ© π‘ˆ 1 ; 𝑆 2 =π‘†βˆ© π‘ˆ 2 Inclusion probabilities πœ‹ 𝑖 known Estimator of interest: πœƒ =𝑑( π‘ˆ 1 ,𝑆) π‘ˆ 1 𝑆= 𝑆 1 βˆͺ 𝑆 2 π‘ˆ 2 ?

6 Bootstrap Classical bootstrap (Efron, 1979):
Estimator πœƒ =𝑑 𝑆 , with 𝑆 an i.i.d. random sample of size 𝑛 from a distribution 𝐹 Resampling: Draw a with-replacement sample 𝑆 𝑏 βˆ— of size 𝑛 from 𝑆 Compute πœƒ 𝑏 βˆ— =𝑑 𝑆 𝑏 βˆ— Repeat this a large number of times (𝑏=1,2,…,𝐡) Bootstrap estimator for the variance of πœƒ : var boot πœƒ = 1 π΅βˆ’1 𝑏=1 𝐡 πœƒ 𝑏 βˆ— βˆ’ πœƒ βˆ— πœƒ βˆ— = 1 𝐡 𝑏=1 𝐡 πœƒ 𝑏 βˆ—

7 Bootstrap Classical bootstrap does not account for
Finite-population sampling Complex survey design Different extensions of the bootstrap available Overview: Mashreghi, Haziza and LΓ©ger (2016) Here: extension based on pseudo-populations Theory: Booth, Butler and Hall (1994), Chauvet (2007) Previous application: Kuijvenhoven and Scholtus (2011)

8 Bootstrap First assume: 𝑀 𝑖 =1/ πœ‹ 𝑖 is always integer-valued
Bootstrap algorithm: 1. Create a pseudo-population π‘ˆ βˆ— by taking 𝑀 𝑖 copies of each unit π‘–βˆˆπ‘†. 2. For each 𝑏=1,…,𝐡 do the following: - Draw sample 𝑆 𝑏 βˆ— from π‘ˆ βˆ— analogous to design used to draw 𝑆 from π‘ˆ. If π‘˜βˆˆ π‘ˆ βˆ— is a copy of π‘–βˆˆπ‘† then its inclusion probability is πœ‹ π‘˜ βˆ— ∝ πœ‹ 𝑖 . - Analogously to πœƒ =𝑑 𝑆, π‘ˆ 1 , construct replicate πœƒ 𝑏 βˆ— =𝑑 𝑆 𝑏 βˆ— , π‘ˆ 1 . 3. Compute the variance estimate for πœƒ based on pseudo-population π‘ˆ βˆ— as: var boot πœƒ = 1 π΅βˆ’1 𝑏=1 𝐡 πœƒ 𝑏 βˆ— βˆ’ πœƒ βˆ— , with πœƒ βˆ— = 1 𝐡 𝑏=1 𝐡 πœƒ 𝑏 βˆ— .

9 Bootstrap General case: 𝑀 𝑖 =1/ πœ‹ 𝑖 = 𝑀 𝑖 + πœ‘ 𝑖 , with 𝑀 𝑖 βˆˆβ„•, πœ‘ 𝑖 ∈ 0,1 Bootstrap algorithm: 1. Create a pseudo-population π‘ˆ βˆ— by taking πœ” 𝑖 copies of each unit π‘–βˆˆπ‘†. Random inflation weight: πœ” 𝑖 = 𝑀 𝑖 with probability 1βˆ’ πœ‘ 𝑖 , πœ” 𝑖 = 𝑀 𝑖 +1 with probability πœ‘ 𝑖 . 2. For each 𝑏=1,…,𝐡 do the following: - Draw sample 𝑆 𝑏 βˆ— from π‘ˆ βˆ— analogous to design used to draw 𝑆 from π‘ˆ. If π‘˜βˆˆ π‘ˆ βˆ— is a copy of π‘–βˆˆπ‘† then its inclusion probability is πœ‹ π‘˜ βˆ— ∝ πœ‹ 𝑖 . - Analogously to πœƒ =𝑑 𝑆, π‘ˆ 1 , construct replicate πœƒ 𝑏 βˆ— =𝑑 𝑆 𝑏 βˆ— , π‘ˆ 1 . 3. Compute the variance estimate for πœƒ based on pseudo-population π‘ˆ βˆ— as: var boot πœƒ = 1 π΅βˆ’1 𝑏=1 𝐡 πœƒ 𝑏 βˆ— βˆ’ πœƒ βˆ— , with πœƒ βˆ— = 1 𝐡 𝑏=1 𝐡 πœƒ 𝑏 βˆ— .

10 Bootstrap General case: 𝑀 𝑖 =1/ πœ‹ 𝑖 = 𝑀 𝑖 + πœ‘ 𝑖 , with 𝑀 𝑖 βˆˆβ„•, πœ‘ 𝑖 ∈ 0,1 Bootstrap algorithm: For each π‘Ž=1,…,𝐴 do the following: 1. Create a pseudo-population π‘ˆ π‘Ž βˆ— by taking πœ” 𝑖 copies of each unit π‘–βˆˆπ‘†. Random inflation weight: πœ” 𝑖 = 𝑀 𝑖 with probability 1βˆ’ πœ‘ 𝑖 , πœ” 𝑖 = 𝑀 𝑖 +1 with probability πœ‘ 𝑖 . 2. For each 𝑏=1,…,𝐡 do the following: - Draw sample 𝑆 π‘Žπ‘ βˆ— from π‘ˆ π‘Ž βˆ— analogous to design used to draw 𝑆 from π‘ˆ. If π‘˜βˆˆ π‘ˆ π‘Ž βˆ— is a copy of π‘–βˆˆπ‘† then its inclusion probability is πœ‹ π‘˜ βˆ— ∝ πœ‹ 𝑖 . - Analogously to πœƒ =𝑑 𝑆, π‘ˆ 1 , construct replicate πœƒ π‘Žπ‘ βˆ— =𝑑 𝑆 π‘Žπ‘ βˆ— , π‘ˆ 1 . 3. Compute the variance estimate for πœƒ based on pseudo-population π‘ˆ π‘Ž βˆ— as: 𝑣 π‘Ž πœƒ = 1 π΅βˆ’1 𝑏=1 𝐡 πœƒ π‘Žπ‘ βˆ— βˆ’ πœƒ π‘Ž βˆ— , with πœƒ π‘Ž βˆ— = 1 𝐡 𝑏=1 𝐡 πœƒ π‘Žπ‘ βˆ— . Finally compute: var boot πœƒ = 1 𝐴 π‘Ž=1 𝐴 𝑣 π‘Ž πœƒ .

11 Bootstrap Key step: Analogously to πœƒ =𝑑 𝑆, π‘ˆ 1 , construct replicate πœƒ π‘Žπ‘ βˆ— =𝑑 𝑆 π‘Žπ‘ βˆ— , π‘ˆ 1 Example: Dutch Virtual Census with mass imputation Original estimator: πœƒ β„Žπ‘ = π‘–βˆˆ π‘ˆ 1 β„Ž 𝑖 𝑦 𝑐𝑖 + π‘–βˆˆ 𝑆 2 β„Ž 𝑖 𝑦 𝑐𝑖 + π‘–βˆˆ π‘ˆ 2 \ 𝑆 2 β„Ž 𝑖 𝑦 𝑐𝑖 Construction of bootstrap replicate πœƒ β„Žπ‘,π‘Žπ‘ βˆ— : π‘ˆ 2π‘Ž βˆ— is the subpopulation of π‘ˆ π‘Ž βˆ— consisting of copies of units from 𝑆 2 𝑆 2π‘Žπ‘ βˆ— = 𝑆 π‘Žπ‘ βˆ— ∩ π‘ˆ 2π‘Ž βˆ— ; note: size of overlap is random Use 𝑆 2π‘Žπ‘ βˆ— to re-estimate the imputation model for 𝑦 1 ,…, 𝑦 𝐢 Impute the missing values of 𝑦 1 ,…, 𝑦 𝐢 in π‘ˆ 2π‘Ž βˆ— \ 𝑆 2π‘Žπ‘ βˆ— Compute: πœƒ β„Žπ‘,π‘Žπ‘ βˆ— = π‘˜βˆˆ π‘ˆ 1 β„Ž π‘˜ 𝑦 π‘π‘˜ + π‘˜βˆˆ 𝑆 2π‘Žπ‘ βˆ— β„Ž π‘˜ 𝑦 π‘π‘˜ + π‘˜βˆˆ π‘ˆ 2π‘Ž βˆ— \ 𝑆 2π‘Žπ‘ βˆ— β„Ž π‘˜ 𝑦 π‘π‘˜

12 true standard deviations educational attainment
Simulation study true counts true standard deviations age (years) educational attainment low medium high young (15–35) 330 795 400 34.5 42.2 36.8 middle (36–55) 115 560 480 22.3 36.1 old (56+) 120 525 22.8 35.6 Synthetic target population of 𝑁=3725 persons Simple random sample of size 𝑛=𝑁/5=745; no admin. data Mass imputation based on logistic regression: Gender Γ— (Age + Income) True standard deviations: estimated by repeating sampling and imputing times

13 Simulation study true counts true standard deviations age (years) educational attainment low medium high young (15–35) 330 795 400 34.5 42.2 36.8 middle (36–55) 115 560 480 22.3 36.1 old (56+) 120 525 22.8 35.6 Analytical variance approximation (Scholtus, 2018), repeated 100 times Bootstrap procedure with 𝐴=1, 𝐡=200, repeated 100 times estimated analytical st. dev. estimated bootstrap st. dev. age (years) educational attainment low medium high young (15–35) 32.2 39.5 34.5 34.1 41.9 36.4 middle (36–55) 20.6 34.0 33.3 22.7 36.6 36.0 old (56+) 21.1 32.8 31.8 22.5 35.2

14 Conclusion Bootstrap method for estimating accuracy of statistics based on combined administrative and survey data Advantage over analytical variance estimation: flexibility Possible disadvantage: computational workload Future work: Simulation study with real Dutch Census data (in progress) Extending method to account for additional sources of uncertainty: Micro-integration of survey and admin. data in overlapping part Measurement error …

15 References J.G. Booth, R.W. Butler, and P. Hall (1994), Bootstrap Methods for Finite Populations. Journal of the American Statistical Association 89, 1282–1289. G. Chauvet (2007), MΓ©thodes de Bootstrap en Population Finie. PhD Thesis (in French), L’UniversitΓ© de Rennes. T. de Waal and J. Daalmans (2018), Mass Imputation for Census Estimation: Methodology. Report, Statistics Netherlands. B. Efron (1979), Bootstrap methods: another look at the jack-knife. The Annals of Statistics 7, 1–26. L. Kuijvenhoven and S. Scholtus (2011), Bootstrapping Combined Estimators based on Register and Sample Survey Data. Discussion Paper, Statistics Netherlands. Z. Mashreghi, D. Haziza, and C. LΓ©ger (2016), A Survey of Bootstrap Methods in Finite Population Sampling. Statistics Surveys 10, 1–52. S. Scholtus (2018), Variances of Census Tables after Mass Imputation. Discussion Paper, Statistics Netherlands.


Download ppt "A bootstrap method for estimators based on combined administrative and survey data Sander Scholtus (Statistics Netherlands) NTTS Conference 13 March 2019."

Similar presentations


Ads by Google