The Application of the Concept of Uniqueness for Creating Public Use Microdata Files Jay J. Kim, U.S. National Center for Health Statistics Dong M. Jeong, Korea National Statistical Office
Contents Introduction Introduction Intruders and Disclosure Intruders and Disclosure Measures of Disclosure Risk Measures of Disclosure Risk 1. Narrow Definition of Disclosure Risk 1. Narrow Definition of Disclosure Risk 2. Broader Definition of Disclosure Risk 2. Broader Definition of Disclosure Risk Evaluation of Definition of Disclosure Risk Evaluation of Definition of Disclosure Risk Concluding Remarks Concluding Remarks
1. Introduction. Government agencies release microdata files from their survey data or administrative records data. Government agencies release microdata files from their survey data or administrative records data. Large amounts of information on individuals is available to many organizations and data users, who can become “ intruders ”. Large amounts of information on individuals is available to many organizations and data users, who can become “ intruders ”. If a public use microdata file (PUMF) is released, intruders can try to match their records with the ones from the PUMF and gain access to new information. If a public use microdata file (PUMF) is released, intruders can try to match their records with the ones from the PUMF and gain access to new information.
Intruders use common variables between PUMF and their files for linking the records on two files, which are called “key variables” or “matching variables”. Intruders use common variables between PUMF and their files for linking the records on two files, which are called “key variables” or “matching variables”. In the U.S., laws such as Title 13 stipulates protection of the confidentiality of many types of data. In the U.S., laws such as Title 13 stipulates protection of the confidentiality of many types of data. Thus, the data disseminating agencies must protect the confidentiality of the individuals on the PUMFs. On the other hand, they should not ignore the data users’ needs, i.e., the utility of the data files. Thus, the data disseminating agencies must protect the confidentiality of the individuals on the PUMFs. On the other hand, they should not ignore the data users’ needs, i.e., the utility of the data files.
Here, we develop probability models quantifying disclosure risk for a microdata file. Here, we develop probability models quantifying disclosure risk for a microdata file. This is a modification of the Marsh, et al (1991) procedure. This is a modification of the Marsh, et al (1991) procedure. The model can use population and sample “uniques” only, or it can also include population twins or triplets. The model can use population and sample “uniques” only, or it can also include population twins or triplets. We will show the results of applying the probability model - using population and sample uniques only - for creating disclosure-limited microdata files using the 2005 Korean demographic census data. We will show the results of applying the probability model - using population and sample uniques only - for creating disclosure-limited microdata files using the 2005 Korean demographic census data.
2. Intruders and Disclosure Potential intruders: Potential intruders: i). Organizational intruders, e.g., credit card companies, mortgage departments of banks, insurance companies, credit bureaus, trade associations, etc. ii). Individual intruders: with readily available high powered computers, anyone can assemble his own database using information in the public domain and become an intruder.
Two types of disclosure: i). Identity disclosure – identification. i). Identity disclosure – identification. If the intruder is a journalist and tries to embarrass the data disseminating agencies, his claim that he has been successful in identifying someone on their PUMF would be sufficient. If the intruder is a journalist and tries to embarrass the data disseminating agencies, his claim that he has been successful in identifying someone on their PUMF would be sufficient. If the intruder publicizes the findings in the news media, it could have a devastating effect on the agencies’ data collection efforts. If the intruder publicizes the findings in the news media, it could have a devastating effect on the agencies’ data collection efforts.
ii). Attribute disclosure; After identification is made, one can gain new sensitive information. After identification is made, one can gain new sensitive information. For defining a measure of disclosure risk, we will consider that identity disclosure is the same as disclosure. For defining a measure of disclosure risk, we will consider that identity disclosure is the same as disclosure.
3. Measures of Disclosure Risk Define Define P(a) = the probability of key variables being recorded identically in both PUMF and intruder’s file; P(a) = the probability of key variables being recorded identically in both PUMF and intruder’s file; P(b|a) = the probability that an individual appears in a PUMF is the same as the sampling fraction for that individual in the PUMF; P(b|a) = the probability that an individual appears in a PUMF is the same as the sampling fraction for that individual in the PUMF;
P(c|a,b) = the probability of population unique; P(c|a,b) = the probability of population unique;and P(d|a,b,c) = the probability of verifying population unique. P(d|a,b,c) = the probability of verifying population unique. Marsh, et al (1991) defined the probability of correct identification of an individual as Marsh, et al (1991) defined the probability of correct identification of an individual as P(a) P(b|a) P(c|a,b) P(d|a,b,c) P(a) P(b|a) P(c|a,b) P(d|a,b,c)
We modify the Marsh, et al’s model. We modify the Marsh, et al’s model. We assume in Marsh, et al’s formula that We assume in Marsh, et al’s formula that i). There are no recording or classification errors for the values of the key variables, i.e., P(a) = 1. i). There are no recording or classification errors for the values of the key variables, i.e., P(a) = 1. ii). We can verify correctly population uniqueness with certainty, i.e., P(d|a,b,c) = 1. ii). We can verify correctly population uniqueness with certainty, i.e., P(d|a,b,c) = 1.
Disclosure can occur when all the following 5 conditions are met: Disclosure can occur when all the following 5 conditions are met: i). An individual is unique in a population based on key variables. If the intruder’s file is a 100 percent population file, he can establish uniqueness of a certain individual by using his file. If the intruder’s file is a 100 percent population file, he can establish uniqueness of a certain individual by using his file. ii). The individual is on the PUMF.
iii). The individual is on intruder’s file. An intruder can have information on key variables for a specific person and try to examine whether that person appears in the PUMF. In this case, intruder’s file has a single record. An intruder can have information on key variables for a specific person and try to examine whether that person appears in the PUMF. In this case, intruder’s file has a single record. iv).The individual is unique on PUMF AND v).The individual is unique on intruder’s file.
Define A = an individual of interest; A = an individual of interest; = PUMF; = PUMF; = an intruder’s file; = an intruder’s file; = unique class in the population; = unique class in the population;
= unique class in PUMF; = unique class in PUMF;and = unique class in intruder’s file. = unique class in intruder’s file.
3.1 A Narrow Definition of Disclosure Risk This definition depends on the population and sample uniques only Assume an Intruder does Phising (Fishing) Expedition. Expedition.
The probability of correct identification: (1) (1) If an individual is a population unique, it would also be a sample unique, i.e.,
Equation (1) reduces to which can be further re-expressed as follows: (2)
The event that A is unique in population is independent of whether A is selected in sample or not. Thus, equation (2) reduces to The event that A is unique in population is independent of whether A is selected in sample or not. Thus, equation (2) reduces to(3) The event that A is in the PUMF is usually independent of the event that A is in the intruder’s file. In this case, equation (3) can be simplified as The event that A is in the PUMF is usually independent of the event that A is in the intruder’s file. In this case, equation (3) can be simplified as(4)
However, a survey can be a subset of another survey. For example, U.S. Census Bureau’s PUMF is a subset of their census sample. Thus if is a subset of However, a survey can be a subset of another survey. For example, U.S. Census Bureau’s PUMF is a subset of their census sample. Thus if is a subset of and equation (3) becomes and equation (3) becomes (5) (5)Also, (6) (6)
3.1.2 Assuming an Intruder Already Knows That A is in PUMF If the intruder has response knowledge, then Thus, from equation (4), the disclosure risk will be
3.2 Broader Definition of Disclosure Risk Even if an individual is not unique in the population, he still can be identified with additional information. Even if an individual is not unique in the population, he still can be identified with additional information. Suppose C individuals in the population have the same values of the key variables and matching to any one of them is equally likely. Suppose C individuals in the population have the same values of the key variables and matching to any one of them is equally likely.
Define = Equivalence class of size C in the population. = Equivalence class of size C in the population. Then the probability of correct identification is,
4. Evaluation of Disclosure Risk We used the measures of disclosure risk developed here in creating PUMS from the 2005 Korean census data. We used the measures of disclosure risk developed here in creating PUMS from the 2005 Korean census data. We show the results of the applications on the 2005 census data from Choongchung (CC) Province. We show the results of the applications on the 2005 census data from Choongchung (CC) Province. Masking scheme used is to coarse (group) categories. Masking scheme used is to coarse (group) categories.
Korea National Statistical Office (KNSO) creates the 2 percent PUMFs by taking a 20 percent subsample of the 10 percent census sample, Korea National Statistical Office (KNSO) creates the 2 percent PUMFs by taking a 20 percent subsample of the 10 percent census sample, (0.1 x 0.2 = 0.02). (0.1 x 0.2 = 0.02). : 2 percent PUMF. : 2 percent PUMF. : 10 percent census sample. : 10 percent census sample.
PopulationHouseholds Housing Units Census1,798,397660,526586,757 Census Sample (10%) 189, ,505 71,091 71,091 65,398 65,398 2% Microdata 38,027 38,027 14,218 14,218 13,038 13,038 Table 1. Population Size, and Number of Households and Housing Units – CC Province
Key variables used: gender (2); age (111); marital status (4 ); relationship to householder (14); household type (5 ); tenure (6 ); building type of residence (12); and type of housing and number of floors of the building (12). Key variables used: gender (2); age (111); marital status (4 ); relationship to householder (14); household type (5 ); tenure (6 ); building type of residence (12); and type of housing and number of floors of the building (12). The probability of a population unique is calculated using the 100 percent census file. The probability of a population unique is calculated using the 100 percent census file. Without grouping, the number of uniques is 9,664. It is 0.54 % of 1.8 million. Without grouping, the number of uniques is 9,664. It is 0.54 % of 1.8 million.
If we assume that the intruder has a 10 percent census sample file, the disclosure risk is If we assume that the intruder has a 10 percent census sample file, the disclosure risk is However, whole blocks are selected in the 10 percent census sample, thus residents in the sample blocks know that their neighbors are also in the sample. To those who have response knowledge, the disclosure risk is However, whole blocks are selected in the 10 percent census sample, thus residents in the sample blocks know that their neighbors are also in the sample. To those who have response knowledge, the disclosure risk is
# of Vars GenderAgeRelationship Marital Status # of Uniques 1x 0 1x 2 1x 0 1x 0 2xx 5 2xx 0 2xx 0 2xx xx xx 0 3xxx167 3xxx xxx 2 3xxx349 4xxxx713 Table 2. Number of Unique Persons before Grouping Categories
Table 3. Number of Uniques with 5 Year Intervals for Age # of Vars Gender Grouped Age Relationship Marital Status # of Uniques 1x 2 → 0 2xx 5 → 2 2xx 65 → 6 2xx 11 → 1 3xxx 167 → 18 3xxx 30 → 3 3xxx 349 → 53 4xxxx 713 → 106
Table 4. Number of Uniques with Grouped Age and Relationship Categories # of Vars Gender Gender Grouped Grouped Age Age Grouped Grouped Relationship Relationship Marital Status Marital Status # of # of Uniques Uniques 2 x x 6 → 2 6 → 2 3 x x x 18 → 4 18 → 4 3 x x x 53 → 3 53 → 3 4 x x x x 106 → 8
Table 5. Number of Uniques with Grouped Age, Relationship and Marital Status Categories # of # of Vars VarsGenderGrouped Age Age Grouped Grouped Relationship Relationship Grouped Marital Status Status # of # of Uniques Uniques 3 x x x 3 → 1 3 → 1 3 x x x 3 → 3 3 → 3 4 x x x x 8 → 4 8 → 4
Table 6. Two different groupings in the number Table 6. Two different groupings in the number of categories of categories Relationship Building Building Type Type Type of Housing and # of Floors # of # of Uniques Uniques Grouping 1 9 (14) (14) 6 (12) (12) Grouping 2 3 (14) (14) 4 (12) (12)
Probability of unique =.028 % for both groupings. Probability of unique =.028 % for both groupings. If we assume the intruder has the 10 percent census sample file, the disclosure risk is If we assume the intruder has the 10 percent census sample file, the disclosure risk is < 1 in 100, < 1 in 100,000. If we assume response knowledge, the disclosure risk goes up to If we assume response knowledge, the disclosure risk goes up to
5. Concluding Remarks We developed comprehensive probability models quantifying disclosure risk for microdata files and applied them to the Korean census data. We developed comprehensive probability models quantifying disclosure risk for microdata files and applied them to the Korean census data. Using the models, we measured the disclosure risks for the original census data. The risks were too high. Using the models, we measured the disclosure risks for the original census data. The risks were too high.
We grouped categories of the key variables and re-calculated the disclosure risks. The risks were lowered to a satisfactory level. We grouped categories of the key variables and re-calculated the disclosure risks. The risks were lowered to a satisfactory level. For creating their official 2 percent PUMFs from the census data, KNSO used the approaches mentioned here including the measures of disclosure risks and grouping categories. For creating their official 2 percent PUMFs from the census data, KNSO used the approaches mentioned here including the measures of disclosure risks and grouping categories.
Thank you very much ! Thank you very much ! Jay J. Kim Dong M. Jeong Jay J. Kim Dong M. Jeong Disclaimer: This paper represents the views of the authors and should not be interpreted as representing the views, policies or practices of the Centers for Disease Control and Prevention, National Center for Health Statistics. Disclaimer: This paper represents the views of the authors and should not be interpreted as representing the views, policies or practices of the Centers for Disease Control and Prevention, National Center for Health Statistics.