I n f o r m a t i o n e n www.statistik.at Wir bewegen Data Imputation and Estimation for the Austrian Register Based Census Test I n f o r m a t i o n e n Reinhard Fiedler Peter Schodl April 23rd, 2008 © STATISTIK AUSTRIA www.statistik.at 30.07.2018
Welcome 4/23/2008
Introduction Background Information Estimation procedures Time Plan Pros and Cons Registers used for RBCT (Register Based Census Test) Estimation procedures Record Linkage Estimation Hot-deck technique Clustering 4/23/2008
Background Information Time Plan 2001: last conventional census 31.10.2006: reference date for RBCT April 2008: first report RBCT 2010: first register-based census 4/23/2008
Background Information Pros and cons Pros: economic efficiency faster more often unburden respondents privacy Cons: incomplete data inconsistent data timeliness 4/23/2008
Background Information Registers used for RBCT 8 basis registers, e.g. Central Population Register (CPR) Central Social Security Register (CSSR) Register of Educational Attainment 7 comparison registers for cross-checks, e.g. Register of Family Allowance Register of Social welfare Linkage by unique keys Branch-specific identification number (bPK) (a specific personal code) Social Security Number (RBCT) 4/23/2008
Background Information Missing data Low missing rates Covered by more than one data source Sex (<1% missing) Date of birth (<1% missing) Medium to high missing rates Marital status (11% missing) Graduates (7% missing) Not included in any register Occupation (100% missing) 4/23/2008
Estimation Strategy Record Linkage For all registers Estimation Marital status (high missing rate) Occupation (not contained in any register) Graduates (immigrants since last census) 4/23/2008
Record Linkage Problem: imperfect linkage of registers Wrong or missing keys Attributes used: Date of birth Address Nationality Sex Standardization of notations 4/23/2008
Record Linkage Example: Current school enrolment By record linkage, people in school-age without current school enrolment are reduced by 40% 4/23/2008
Estimation Occupation and graduates Graduates Occupation Source: RBCT itself 6.600.000 people with graduation 409.000 people with missing graduation Occupation Source: Labour Force Survey Quarterly sample survey About 35.000 people with occupation in survey 3.800.000 People with missing occupation (all working persons) 4/23/2008
Estimation Basic idea: Same procedure for estimation of occupation and graduates Estimation on person-level Target-distribution Building of groups, to transfer the distribution of the source to the corresponding group of the target Groups are formed by attributes with significant influence on the target-variable 4/23/2008 12
Hot-deck technique Example: 1000 People from 30 to 34 years living in Tyrol form one deck Labour Force Survey 200 with occupation A 300 with occupation B 500 with occupation C Weighting scheme gets applied to all people within the deck in the RBCT 20% probability for occupation A 30% probability for occupation B 50% probability for occupation C 4/23/2008
Which attributes have influence? Graduates Age Status in employment Sex Nationality Urban / rural environment Occupation Age Status in employment Sex Nationality Region NACE of employment Level of educational achievement 4/23/2008
Clustering Groups must not be too small No donor for many persons Wrong distribution Example: Source: Tyrol, male, 87 years, German nationality: 10 Persons 5 occupation A (50% A) 5 occupation B (50% B) Tyrol, female, 87 years, German nationality: 1 Person 1 occupation B (100% B) Target: Tyrol, male, 87 years, German nationality: 1000 Persons 500 A, 500 B Tyrol, female, 87 years, German nationality: 1000 Persons 1000B Tyrol, male, 87 years, German nationality: 500 occupation A 1500 occupation B 4/23/2008 15
Clustering Optimal groups by cluster analysis Groups must not be too big Correct distribution only on highest level, incorrect distribution on lower levels Example: 2 Groups: Male / Female. Distribution for males and for females are transferred to the target distribution of source and target is the same for males and females. But: distribution for regions, age,… can be incorrect! Optimal groups by cluster analysis 4/23/2008
Clustering Occupation First clustering: Variables with many values (age, nationality, region,…) Second clustering: Since the groups after first clustering are too small, the groups are clustered again nationality age First aggregation Second aggregation 4/23/2008
Results Graduates: No second clustering, 457 groups with about 14.000 Persons in each group Never more than 2.4% deviation on highest level to the Labour force survey 2006 Occupation: 65 groups after second clustering with about 500 Persons in each group Never more than 1.7% deviation to the traditional 2001 census on highest level Never more than 3.2% deviation on medium levels 4/23/2008