Quality evaluation of register-based statistics Bart F.M. Bakker, Statistics Netherlands and VU University Amsterdam 27 June 2018 Session 11
Background and topic of this paper Increasing use of registers for official statistics Replacement of primary data collection by registers Is the quality of the register-based statistics sufficient? Customary method: assume that primary data collection is the gold standard Naïve: all sources contain error How can we assess the quality of register- based statistics? More and more administrative data are used for official statistics The most important reasons for that is that it is cheaper and the reporting burden becomes lower for companies. However, during this transition, many times the question is raised whether the quality of the register-based statistics are sufficient. Usually this question is answered by comparison of the outcomes of the register and the outcomes of the existing primary data collection. So the existing data collection method is used as a gold standard. If the register-based outcomes differ too much from the existing ones, it is easy concluded that the quality of the register-based statistics is too low too use this source. However, using the outcomes of existing primary data collection as a gold standard is rather naïve, because all sources contain error and should not be used as a gold standard. Therefore, the question is: how can we assess the quality of register-based statistics?
Theoretical framework Total Survey Error approach for register data: Bakker & Daas (2012) and Li-Chun Zhang (2012) Based on the idea that register data are collected mainly with traditional methods My point of departure is the Total Survey Error approach for register data. That was developed by Bakker & Daas and Li-Chun Zhang published in 2012. It is based on the idea that register data are collected mainly with traditional survey techniques. Look at for instance the tax forms you have to fill in, that is an ordinary questionnaire.
Theoretical framework (based on Groves et al. 2007) Measurement Representation Administrative concept Operationalization of administrative concept Response Corrected response statistical concept validity of administrative concept measurement error adm concept processing error Target population Set of registered population elements Postlinking corrections coverage errors linking error correction Outcome Set of linked registered population elements And it looks like this. We have adopted the total survey error framework of Groves and others. I will not go into detail into this, because that would cost too much time. However, the idea is simple. We use the usual process of the production of register-based statistics and look for error sources during that process. We distinguish between representation error and measurement error. Representation error occurs if the data do not represent the target population. Measurement error occurs if a variable is not measured properly, so is biased. All steps in the process can contribute to the total error of the outcome. I will refer to this theoretical framework in the rest of my presentation, but it is not necessary to have this in your head to understand it, I hope.
The regular process: representation Target population Set of registered population elements Postlinking corrections coverage errors linking error correction Outcome Define target population Create longitudinal Base register (BR) Derive target population from BR, creating BR* Link other administrative sources to the BR' Link sample surveys for variables not in administrative sources Correct for representation error Set of linked registered population elements So let us start at the representation side. You start with the theoretical definition of your target population. In the next step you try to find a complete list of the units of the target population. There should be a base register for different kind of units: persons, companies, jobs, etc. The base register should be longitudinal in character, so you can derive a list of the units of the target population for each date or each period. If you are lucky, you have one register of good quality for those units. If that is not available, you can try to combine a number of different registers in order to cover the population entirely. Then you have a Set of registered population elements. In the next step you link other registers with relevant subject matter variables. Then you have a Set of linked registered population elements. If you miss variables in your administrative sources, you can link sample surveys for the remaining variables. You correct for representation error, e.g. by weighting or imputing of missing elements.
Assessment of representation error The BR* can have over coverage, duplicates and under coverage Over coverage can be assessed by applying the definition of the target population Delete the registered not belonging to the target population Start and end dates not perfect; use more registers to verify So that is the regular process. What are then the errors that you can come across in that process? At the representation side you can have three kinds of errors: 1) Over coverage. Over coverage is that you have elements in your register that do not belong to the target population; 2) The second one is that you have duplicates; and 3) the third one is under coverage. Under coverage is that you miss units of the target population in your registers. How can you assess these three kinds of error? Over coverage can be assessed by deleting those who do not belong to the target population. But if you do, it is wise to combine the information from all the registers you use, because it is possible that the starting date in registers differ. An example for persons: people come to your country to work, are registered by the tax authorities but not yet by the municipalities who govern the population register. After 10 month they register themselves in the PR. If you just look at the PR, you underestimate the number of residents. Under coverage Over coverage Covered population BR* population Target
Assessment of representation error Duplicates should be deleted by applying linkage techniques within your BR*. The success of this step depends on the quality of the identifying variables in the BR* One particular problem for the BR* for persons is that if you combine different sources, movers are registered in two places Use longitudinal data on removals to identify these cases You can estimate the number of duplicates by applying linkage techniques within your Base register. And if you have determined which ones are duplicates you should choose which one you delete, or perhaps it is best to combine the information form two records belonging to the same person. The success of this step depends on the quality of the identifying variables in the sources. If you can not identify your individuals properly, then there still will be duplicates in your files. One particular problem is that if you combine different sources, movers are registered in two places. That can be easily the case for registers in which it is not important to be registered correctly or it is advantageous for people to be registered in the wrong place or in more than one place. For example, if you use criminality registers, people who were recorded for a crime use addresses where they do not live, or move to another address to avoid punishment. You can of course use longitudinal data on removals to identify the large part of these cases, but there still will be some error left.
Assessment of representation error Estimate under coverage: capture-recapture The number of fish in a pond Assume independency between capture and recapture first yes no second n1,1 n0,1 n1,0 n0,0 first yes no second 14 106 86 n0,0 Under coverage is one of the main problems in register-based statistics. You can estimate under coverage by applying capture-recapture methods. For those who are not familiar with capture-recapture methods, I explain that with a simple example. Assume that you want to estimate the number of fish in a pond. In the first step you capture 100 fish, you mark them and put them back in the pond again. You leave them a few hours alone, so they can mix together, and then You capture 120 fish, and count how many are marked. In this case: 14. You get then the blue table on the right: 14 fish are in both samples, and 86 are only in the first and 106 only in the second sample. If you assume independency between the two samples, then the odds ratio is one and you can compute the missed part by this formula, and that gives you 86 multiplied by 106 divided by 14 and that is 651. And in total 206 which were in your samples should be added to the 651 you estimated for the missed part and that gives you 857 fish in total. Instead of samples you can also use registers. If you use two registers, the first one is the capture and the second one is the recapture. Well, this is the general idea. However, much more complicated models are used to avoid or to meet the severe assumptions underlying this method. OR= 𝑛 1,1 / 𝑛 0,1 𝑛 1,0 / 𝑛 0,0 = 1 𝑛 0,0 = 𝑛 1,0 × 𝑛 0,1 𝑛 1,1 = 86 ×106 14 = 651
The regular process: measurement Derive the statistical concepts from the administrative ones Or from the sample survey data (Correct for measurement error) Weight the sample survey data Produce tables Measurement Administrative concept Operationalisation of administrative concept Response Corrected response statistical concept validity of administrative concept measurement error adm concept processing error Outcome Now we go to the measurement error. We start with the step that we derive the statistical concepts from the administrative ones. In most cases, you can do this quite straightforward. However, if the administrative concept differs from the statistical one, you have a problem. You should use sometimes the information from different sources to derive your variable correctly. A particular problem can be the reference date. If you want to know whether someone holds a job, for instance, on reference date, the starting and ending dates should be correct. And there is a tendency, in particular in job registers, that ending dates are not correct. Employers keep dismissed employees in there files, because a lot of administrative work has to be done with their former employees. Therefore, employees stay longer in the files than they actually work for this employer. So be careful with that. If you miss information in your registers, you should use sample survey data and derive statistical variables from the survey. Then you can correct for measurement error. This step is between brackets, because not all countries apply measurement error correction methods. Of course most countries do some kind of editing of the data, but actually determining measurement error and correcting for that is not very usual. Weight the sample survey data. Produce tables.
Assessment of measurement error In register-based statistics, bias is more important than reliability Two methods based on the idea of repeated measurement Link a source in which the same concept has been measured, e.g. a PES or other survey Structural Equation Models (SEM) for continuous and Latent Class Analysis for categorical variables SEM’s can be used for intercept bias by using an audit sample So how can we assess measurement error then? In register-based statistics bias is more important than reliability. You assume that you have data of the entire population, so you do not have samples and your reliability is high. However, the data can be biased, in particular because you are working with administrative data and administrative concepts. There are two methods that you can use for determining measurement error. And they are both based on the idea of repeated measurement. You measure a variable twice and the correlation between the two measures is a measurement of the quality. Of course, the differences in means is also important. Therefore you should link a source in which the same concept has been measured to your register, e.g. a PES or other survey Apply then a method to determine the measurement error with Structural Equation Models (SEM) for continuous and Latent Class Analysis for categorical variables. SEM’s can be used for correcting intercept bias by using an audit sample. In the case of Latent Class Analysis, the classification table gives you the corrected outcomes.
Linkage error Linkage error consists of false negatives and false positives False negatives can easily lead to biased estimates Weighing or imputing can be a solution False positives can be corrected by applying SEM’s or LCA’s If you conduct a register-based statistics, you have to link different registers and sometimes surveys. You simply do not have one register that contains all the variables you are interested in. If you link registers, you create error. We distinguish two kinds of linkage errors: false negatives (then you have missed links) and false positives (then you link the records of two different persons as if they are one. False negatives, so missed links, can easily lead to biased estimates. They are a bit like item non-response. You miss part of the information on these people you have missed. Therefore you can use weighing or imputing the information. But that is not always possible. Let us assume that you have a job register and you use that register to determine whether someone has a job or not. You link your job register to your Base register starred and assume that all the persons who do not link to the job register do not have a job. In that case you do have biased outcomes that you cannot correct for by weighting or imputation. False positives, so if you link the records of two different persons, are like measurement error. Therefore, your estimates can be corrected by applying SEM’s or LCA’s.
Conclusions Register-based statistics have measurement and representation error The size of these errors can in most cases be estimated In some cases it is also possible to correct the estimates Visit: Dimitris Pavlopoulos, session 23, 28-6, 14.30, 3B for the use of LCA in official statistics Daan Zult, session 24, 29-6, 10.30, 4A for new developments in capture-recapture models Then I come to my conclusions: Register-based statistics have measurement and representation error The size of the errors can in most cases be estimated In some cases it is also possible to correct the estimates However, a lot of work should be done before we can apply all these measures and correction procedures in the process of the production of official statistics I recommend you to visit Dimitris Pavlopoulos. He is giving a very nice presentation on the use of Latent Class Analysis in official statistics And Daan Zult who is giving a presentation on the correction for linkage error in capture-recapture models. Thank you for your attention!
Quality evaluation of register-based statistics Bart F.M. Bakker, Statistics Netherlands and VU University Amsterdam