Download presentation
Presentation is loading. Please wait.
1
Demography Statistics - looking to the future
Administrative Data Seminar 4th December 2018 Good morning, My name is John Dunne and I head up the Administrative Data Centre at CSO. Part of my job is to promote the use of administrative data sources for statistical purposes. Unfortunately, traditional methods do not always allow us to fully exploit these new data sources and as such there is a requirement to source or develop new methods to ensure we fully exploit the statistical value of these data sources. This will be the meat of my talk today. So the motivation behind this talk is as follows
2
Motivation Requirement:
to provide ‘Census like’ statistics at detailed geography on an annual basis from 2024 onwards It is not feasible to conduct an annual Census in the traditional manner Currently, proposals in the European Statistical System will require Ireland, along with other EU member states to compile ‘Census like’ small area statistics from reference year 2024 onwards. Ireland typically compiles this type of information every 5 years using a traditional census – at considerable cost. It isn’t feasible or cost effective to carry out a Census every year. The intention is that member states will be able to use administrative data, and in particular population registers, as the basis for compiling these estimates. Unfortunately, unlike other countries, Ireland does not have a Central Population register it can use as a starting point. So it looks like if Ireland wants to fulfil this statistical requirement, it will need to either put in place a high quality central population register for administrative purposes, or develop and source new methods that can exploit existing administrative data sources. Given the likelihood that a Central Population Register will come into place in Ireland, the talk today will focus on how these new methods might be developed or deployed to exploit existing data sources to fulfil this requirement. I will talk about some research work already done as well as ideas for future work.
3
Overview Population Estimates
PECADO Project and first Research Outputs Broken down by geography (emerging thoughts) Use EIRCODE Household Composition (emerging thoughts) A Methodological Framework Putting the pieces together First, we have to be confident that we can estimate the population at State level. We have done some work here with the PECADO project and I will present some methods and research outputs in this context. I will then go on to present some thoughts on how we could extend these methods to break down the population estimates by geography and household composition. Finally I will talk about how we then might put these pieces together to create a Census like dataset using administrative data sources. So first, some work done in the PECADO project.
4
PECADO Project Population Estimates Compiled from Administrative Data Only
This work has come out of collaboration between CSO and University of Southampton and the underlying methods are actively being investigated in other countries where similar challenges exist. The PECADO system can be outlined simply as follows:
5
Methodology Outline Create a Statistical Population Dataset (SPD) using Signs of Life (SoL) Adjust SPD counts using DSE to obtain population estimates Validate assumptions no linkage error no erroneous data equal catchability in one list In the absence of a Central Population Register you need to create a Statistical Population Dataset or some type of population spine. The approach we took was one based on Signs of Life in administration systems. So if a person is identified with an interaction on one or more public administration systems they will have a record in the SPD. In theory, if counts are taken from this SPD then it should underestimate the population. It will miss those persons in the population that have not been identified as engaging with Public Administration Systems. To address this undercount problem we then applied a commonly used method called Dual System Estimation. There are a number of assumptions underpinning this methodology which we also attempt to validate. But first, we will have a quick look at the distributions in some of the underlying data sources.
6
Data Source Distributions
(SPD or PAR) Here we use population trees to present the underlying data distributions, age is on the y axis and number of persons is on the x-axis. Males are presented to the left and females to the right. We have plotted each of the underlying data sources here for the year We use data sources that cover all the life stages from the cradle to the grave. For example, the pick continuous line under the black dash in the age category 0 to 18 represents children for which a child benefit payment is being made. The green continuous represents children enrolled in secondary education and the black dashed line on the outside represents the distribution of the SPD taking account of all data sources. Pause Now we have to adjust the SPD to get the population distribution and we use DSE methods to do this.
7
DSE explained Naïve DSE Ideal DSE List A of size x but r is unknown
List B of size n AB list match size m List B ‘equal catchability’ or ‘homogeneous capture’ Matching assumption – no linkage error between List A and List B No erroneous records or over-coverage So in a nutshell, we start with our SPD which in our notation we call list A and has size X. We then match it against a suitable second list which we will call list B and has size n. If the match between the two lists is size m we can estimate the population, Capital N, simply as nx over m or in other words, the product of the size of the two lists divided by the match. For example, if list B has size 20 and the match has size 16 then we adjust list A by 20/16 or in other words we adjust the count in list A upwards by 25% - a simple ratio adjustment. There are a number of assumptions that list B must satisfy and these include the following Equal catchability (each unit in the population should have an equal chance of being caught in list B), this ensures a form of independence between both lists. There should be no linkage error, in this work we use a Protected Identifier Key or PIK based on the PPSN and as such this ensures high quality linkage. A Protected Identifier Key protects a persons identity while at the same time preserves linkage across data sources and over time. All data used in this project is pseudonymised and linkage undertaken with the PIK. There are no erroneous records or overcount in either List A or List B (otherwise the population estimates will be inflated) So before we go on, the candidate we have been using for list B is based on another administrative data source, the Driver Licence Database. Evidence in our work suggests that by compiling a list of those applying for a new driver licence or renewing an existing driver licence will provide a suitable list B. Pause So what do the population estimates look like
8
First attempt population estimates, 2011
(SPD) Blocking by age, gender and nationality group - includes a source subsequently found to have erroneous records in over 65 age group Trimmed Dual System Estimation (TDSE) Extension of DSE methodologies that enables hunting for erroneous records Joint work between CSO and University of Southampton Our first reasonable attempt for the year 2011 looked as follows Again using population trees the SPD is represented by the blue continuous line and then applying DSE methods we get our population estimates represented by the black continuous line, we also include the census counts as a point of reference. On the female side, these estimates look reasonable but on the male side we see we are overestimating when compared to the Census So differences between the PECADO estimates in black and Census counts in red can be attributed to 1. Differences in population concepts - PECADO uses a signs of life over the reference year while Census is based on usually resident and present on Census night 2. Small violations in underlying assumptions Equal catchability 3. Erroneous records or over coverage present in one or both lists DSE methodology have been developed and extended to include a toolkit we call Trimmed Dual System Estimation that can be used to hunt for overcoverage in list A based on suspicions. We have used this and identified evidence of overcoverage in List A in the over 65 age group. This was due to an incorrect use of a proxy to identify those in receipt of a State pension, this source was subsequently dropped when compiling the SPD. Our most recent attempt is shown on the next slide for 2016.
9
Most recent attempt population estimates, 2016
(SPD) Note Different population concepts Possibility of erroneous records remaining These estimates are now far more aligned with those of Census while accepting differences in population concepts and the possibility of some residual erroneous records in either list we now look to have a strong basis with which to build a system of population estimates. The system of estimates also looks to be very stable when we compare each year from 2011 forward to 2016. We will make these research outputs and detailed methodology notes available with the presentations on our website next week. Pause This research is innovative in a number of ways
10
Innovation in Application
SPD SoL approach reduces statistical problems from 4 to 1 Domain misclassification, linkage error, overcoverage, undercoverage Only one list requires equal catchability Use administrative data (Driver Licence Renewals - DLD) as list B in DSE Validate assumptions no linkage error (given) no erroneous data (TDSE extension of DSE methods) equal catchability in list B (swap admin list B for survey list B and compare results) Examine robustness of PECADO system (TDSE – Trimmed Dual System Estimation) The Signs of Life approach (as opposed to looking at registrations) reduces the number of statistical problems we have to deal with from 4 to 1. PPSN based linkage using the PIK eliminates linkage error and domain misclassification (with respect to age, gender and nationality) In theory, we don’t have to deal with over-coverage, although in practice we have found there is still a residual risk. We have extended the DSE methods with a new approach to identify suspicious parts of the SPD that need to be removed. We have also looked at other mechanisms to validate assumptions and these are presented in detailed methodological notes that will be made available next week.
11
Breaking down by Geography - emerging thoughts
We now need a mechanism to break out these estimates into suitable geography.
12
Geographical breakdown
Dependency on EIRCODE in Public Administration Systems In the absence of EIRCODE, big challenge to code address strings And the only show in town is really the EIRCODE. In the absence of EIRCODES there remains a big challenge to geocode address strings.
13
Methodology (emerging thinking)
Business rules to choose geography based on address strings – case of multiple addresses for persons DSE blocking by geography, age and gender clustering of similar small area geographies constrain by State level population estimates If we have high quality geographic information such as EIRCODES on public administration systems then we should be able to estimate the breakdown of the population estimates by detailed geography. Again we would see DSE methods coming into play. Pauls talk, next up, will further demonstrate the statistical potential with administrative data once the EIRCODE is in place. So now turning to household composition
14
Household Composition - emerging thoughts
We also need to be able to provide statistical information on household composition. There should be enough auxiliary information available we are developing some ideas with how we might use the data sources at our disposal and I will try and give a flavour of them here.
15
Data sources Geography (EIRCODE)
Relationships in administrative data sources Household surveys The data sources we have that will help us form the household composition analysis are as follows: EIRCODE information – where persons have the same EIRCODE on administrative data systems then there is a significant likelihood they are in the same household. If EIRCODE is available or mandatory on administrative systems it would make life very easy here. There is also some relationship information in administrative data sources that we can use For example, from Child Benefit systems we can use the relationship between designated parent and child, from tax returns we may be able to identify spouses and partners for tax purposes and also through some administrative data in health systems there may also be some family formation information that says something about the likelihood of people being in the same household. Then we have our household surveys (LFS) where we get the best information (or what we could call the truth) about household composition in a small number of houses. I should also say that CSO has deployed another PIK at property level to preserve linking for statistical analysis purposes while protecting address and x, y co-ordinate information. This prevents unauthorised lookup on EIRCODE or address string in our data systems. Given these data sources, can we now deploy methods in a sound manner to obtain statistical information on household composition. Our early thinking is inspired by some work at University of Utrecht where they have extended DSE methods to consider missing information. I will try and lay out the framework to give you an idea of whats involved.
16
Possible Methodological Framework
Classify each person on SPD by what type of household they belong to (best effort) HH(A) Take Household Survey as correct Household Composition HH(B) Use extension of DSE methods with missing covariates to estimate for population Constrain to population size and estimated number of occupied dwellings Collapse over HH(A) In Survey (B) Not in Survey In SPD (A) Not in SPD HH(A)||A and B||HH(B) HH(A)|| A, not B ||HH(B) HH(A)||In A ||HH(B) HH(A)|| B, not A ||HH(B) HH(A)||not A, not B||HH(B) HH(A)|| In B||HH(B) HH(A)||Population||HH(B) An extension of DSE methods with missing covariates - research work at Utrecht University OK maybe this aint so easy, but I will give it a go and if I am failing I will move on quickly. So the idea is that we classify each person on the SPD by a best guess of what type of household they belong to based on admin data. The household composition classification is simply of the form 1 adult, 2 children, 2adults, 1 child etc. We will label this classification variable HH(A) as we have denoted SPD as A in this set up. This gives us our margin total on the first row. We do the same for the household survey in terms of each person in a survey, but in this instance we know the household composition as we have visited the house. We will label this classification HH(B). This gives us our column total in the first column. We now match SPD and Survey at a person level and get the distribution table for top left cell with HH(A) as per SPD and HH(B) as per Survey. And click to get result OK Now we use the relationship between HH(A) and HH(B) in the match cell to estimate what HH(A) should be in the second row and then by adding the two cells together we get the HH(A) HH(B) joint distribution for the survey in column 1. Now we simply repeat the process on the first row to get an estimate of joint HH(A) HH(B) distribution for the SPD. And now we are left with a DSE setup with covariates HH(A) and HH(B) with SPD again as List A but this time the survey as List B. We follow through on the DSE to get the population broken down by HH(A) HH(B). Finally we constrain our overall population distribution by making small adjustments such that it fits the constraints of population size and possibly the number of occupied houses . The number of occupied houses is another project. Finally we collapse over HH(A) our original best guess effort from admin data to get our population broken down by household composition. OK if you got that I am impressed, maybe more so with me than you. So now we need to put the pieces together.
17
Putting the pieces together
So we need somehow to estimate population broken down by geography and household composition
18
Estimation Workflow Persons Persons Population Year t – 2 Year t - 1
Household Composition Geography + Household Composition Geography Persons Household Composition Geography + Household Composition Geography Year t – 2 Population Household Composition Geography + Household Composition Geography Year t - 1 Year t We have the basis of a system that can estimate the population at state level We may have a methodological framework that will allow us estimate household composition at State Level. The arrow between geography and household composition here simply recognises that if we identify persons living in the same dwelling we can make a determination that they live in the same household. We can get to good quality population estimates broken down by geography once we have good quality EIRCODES on public administration systems Now we need to figure out how to put geography and household composition together to get population estimates broken down by household composition by geography. Once we can do that we can now start building out the SPD to a full SPD covering the whole population as described by the estimates obtained here.
19
Households/Geography
Persons not on SPD/PAR allocated to existing and new households Persons on SPD/PAR allocated to existing and new households Persons within Households/ Geography Persons on SPD/PAR with known geography/ household Households/Geography Building out the SPD First we identify the part of the SPD where we are confident of geography and household (in grey) Then for the remainder of the SPD where we have person records we would need to attach a geography and household identifier. The for those persons that SPD does not cover we could impute records such that the SPD covers the full population. This then becomes a full spine that then can be built out with necessary attributes to create a full Census dataset – another days work This type of approach has a number of advantages There is a methodological framework to underpin the estimates (rather than relying on best rules) The methodological approach should also facilitate quality indicators and diagnostic information with respect to the estimates As the coverage and quality of SPD increases we should also be able to monitor the impact on the quality of the estimates
20
Concluding
21
Concluding comments Objective is to
Enhance quality of data (EIRCODE Coverage) Create a methodological framework Survey will be required (ground truth) Requires a maturing NDI Requires new methods, Collaboration/partnership (with NSIs and Academia) CSO developing new capabilities Hopefully we can meet requirement for reference year 2024 Make a “Virtual Census” a reality To meet the requirement of annual census like estimates and push on towards a Virtual Census we need to Enhance the quality of data that we have access to both in terms of coverage and use for EIRCODES (job for Public Sector) We also need to ensure we have sound methodological underpinnings to what we do - and part of that will be ensuring that we incorporate information from surveys into that framework in an efficient and effective way In terms of the data we need to push and ensure that the NDI continues to mature And in particular the CSO itself needs to continue investing in developing and deploying new methods within a sound methodological framework. We need to ensure we are collaborating with the right people in NSIs and academia – and from this perspective its great to have Peter and Eric over to visit and talk with s today. So hopefully we can meet this requirement for reference year 2024 and push on and mal a Virtual Census a reality here in Ireland. Thank you
22
Thank you
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.