1 A theoretical framework for register-based statistics --- Can we carry on without it? Li-Chun Zhang Statistics Norway
Statistical data by combination of sources: Coverage, content & relevance
Quality: Statistical vs. administrative register Wallgren & Wallgren (2007, Wiley): –“An administrative register is maintained to store records on all objects to be administered.” (Ideally) –“A statistical register is based on data from administrative registers that have been processed to suit statistical purposes.” A defining distinction in perspectives –Administrative register: Individual data of all importance –Statistical register: Properties at various aggregated levels Quality of register-based statistics Micro-data quality of a statistical register Notable lag of theoretical framework (Platek and Särndal, Holt, Nanopoulos, 2001) –A framework for quality assessment –Theoretical frameworks for different quality aspects
Process accuracy vs. statistical accuracy: Any unbiased, efficient estimators based on statistical registers? Process accuracy –Matching/mismatching rate –Extent of duplicates –Amount of missing values –… Statistical accuracy –Coverage –Relevance –Inherent stochastic variation An example of the UK claimant register (Holt, 2007, TAS) –people claiming unemployment related benefits –entire population of claimants (say 1.5 million) –no sampling error and arguably a perfect measure –derived once each month on the same working day –daily variation about 10,000 in this count
A historic parallel: Survey sampling before Neyman (1934) The representative method (Kiær, 1895) with a three-stage design using 1890 census as frame: –1st: 128 counties and 23 towns throughout the country –2nd: cohorts of males of age 17, 22, 27, 32, etc. –3rd: persons with surname initial A, B, C, L, M, N ISI-committee 1924 report: “I think I may venture to say that nowadays there is hardly one statistician, who in principle will contest the legitimacy of the representative method”. (Jensen) Representative sampling (Neyman, 1934):
Comparisons to non-sampling errors in sample survey and census Unidentified units in register & non-response in survey –Related to under-coverage –Yes, imputation. But a quite different theory! –Example: register households ‘Imputation’ of household identity Which imputation methods do you use? Hot-deck? Definitional error in register source & measurement error –Related to relevance –Yes, a kind of measurement error. But bias dominates! And often clearly different in different sub-populations. –Example: register unemployment (REG_unemp) REG_unemp = ILO_unemp + Bias + Random_error Sample SurveyCensusRegister-based survey Coverage errors Relevance errors Non-response errors Integration errors Measurement errors Sampling errors Coverage errors Matching/mismatching errors Missing-link errors Aggregation errors (Partial classification)
A theory for detailed statistics: Signal or noise?
A theory for micro-data quality Reality at “Storgata 9”: –H0101: Astrid (72) - widow –H0102: Tommy (32) & Jenny (29) & Ronny (2) - cohabitation –H0201: Olav (29) & Lena (29) - cohabitation since Census 2001 –H0202: Knut (27) - single Register: –H0101: Astrid (72) - widow –H0101: Tommy (32) & Jenny (29) & Ronny (2) - cohabitation –H0101: Olav (29) - single –?: Lena (29) - single –?: Knut (27) - single Only Astrid is correctly registered. But when/how does it matter? Administrative register => Individual data of all importance => Unit-specific error Statistical register => A theory of types - How real is a record: how are variables related to each other - How representative is a record: distribution of the types Imputed cohabitation in household register