The Application of the Concept of Uniqueness for Creating Public Use Microdata Files Jay J. Kim, U.S. National Center for Health Statistics Dong M. Jeong,

Slides:



Advertisements
Similar presentations
Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University
Advertisements

Estimating Identification Risks for Microdata Jerome P. Reiter Institute of Statistics and Decision Sciences Duke University, Durham NC, USA.
1 Measures of Disclosure Risk and Harm Measures of Disclosure Risk and Harm Diane Lambert, Journal of Official Statistics, 9 (1993), pp Jim Lynch.
BPS - 5th Ed. Chapter 241 One-Way Analysis of Variance: Comparing Several Means.
Brian A. Harris-Kojetin, Ph.D. Statistical and Science Policy
National Center for Health Statistics DCC CENTERS FOR DISEASE CONTROL AND PREVENTION Changes in Race Differentials: The Impact of the New OMB Standards.
1 A Common Measure of Identity and Value Disclosure Risk Krish Muralidhar University of Kentucky Rathin Sarathy Oklahoma State University.
11 ACS Public Use Microdata Samples of 2005 and 2006 – How to Use the Replicate Weights B. Dale Garrett and Michael Starsinic U.S. Census Bureau AAPOR.
Introduction to Hypothesis Testing
Chi-square Test of Independence
Methods of Geographical Perturbation for Disclosure Control Division of Social Statistics And Department of Geography Caroline Young Supervised jointly.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 9-1 Chapter 9 Fundamentals of Hypothesis Testing: One-Sample Tests Basic Business Statistics.
7-2 Estimating a Population Proportion
Quantitative Evidence for Marketing Data Library, Rutherford North 1 st Floor Chuck Humphrey Data Library March 6, 2009.
Statistics and Data for Marketing Data Library, Rutherford North 1 st Floor Chuck Humphrey Data Library October 27, 2008.
2014 SDC and CIC Annual Training Conference: Accessing ACS PUMS Data Tim Gilbert U.S. Census Bureau April 2, 2014.
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 9-1 Chapter 9 Fundamentals of Hypothesis Testing: One-Sample Tests Business Statistics,
Fundamentals of Hypothesis Testing: One-Sample Tests
Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia, September 2011 Overview of Archiving of Microdata Session 4 United Nations.
Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and Hot Deck Jennifer Huckett Iowa State University June 20, 2007.
Estimation and Confidence Intervals
Understanding the Fair and Accurate Credit Transaction Act, the “Red Flag” Regulations, and their impact on Health Care Providers Raising a “Red Flag”
The Statistical Business Register of Macao SAR Government of Macao SAR Statistics and Census Service.
 Health insurance is a significant part of the Vietnamese health care system.  The percentage of people who had health insurance in 2007 was 49% and.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Introduction to the Public Use Microdata Sample (PUMS) File from the American Community Survey Updated February 2013.
JOINT UNECE-UNFPA TRAINING WORKSHOP ON POPULATION AND HOUSING CENSUSES GENEVA, 5-6 JULY 2010 GOOD PRACTICES IN DISSEMINATING POPULATION CENSUS RESULTS.
1 Statistical Disclosure Control for Communal Establishments in the UK 2011 Census Joe Frend Office for National Statistics.
Data and Social Research Chuck Humphrey Data Library Rutherford North Library.
Survey Harmonisation in Scotland an overview of the theoretical and the practical By Janette Purbrick, Office of the Chief Statistician 24 th January 2008.
Chapter 8 Audit Sampling: An Overview and Application to Tests of Controls McGraw-Hill/IrwinCopyright © 2012 by The McGraw-Hill Companies, Inc. All rights.
Register-based migration statistics and using additional administrative data sources Barica Razpotnik Statistical Office of the Republic of Slovenia UNECE.
American Community Survey Maryland State Data Center Affiliate Meeting September 16, 2010.
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
Copyright © 2009 Pearson Education, Inc LEARNING GOAL Interpret and carry out hypothesis tests for independence of variables with data organized.
1 Improving Data Quality. COURSE DESCRIPTION Introduction to Data Quality- Course Outline.
Audit Sampling: An Overview and Application to Tests of Controls
Exploratory Research Design: Secondary Data. 4-2 Primary vs. Secondary Data Primary data are originated by a researcher for the specific purpose of addressing.
The 2010 Population and Housing Census of Mongolia: Census PES 1 A.Amarbal Director of Population and Housing Census Bureau, National Statistical Office.
Joint UNECE / Eurostat meeting on Population and Housing Censuses 7-9 July 2010, Geneva Disseminating Census information to maximise use and value Keith.
WP 19 Assessment of Statistical Disclosure Control Methods for the 2001 UK Census Natalie Shlomo University of Southampton Office for National Statistics.
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 8-1 Chapter 8 Fundamentals of Hypothesis Testing: One-Sample Tests Statistics.
2008 Population Census of Cambodia Post Enumeration Survey Mrs. Hang Lina Deputy Director General National Institute of Statistics, Min. of Planning Regional.
Economics 173 Business Statistics Lecture 4 Fall, 2001 Professor J. Petry
The availability of Dutch census microdata Eric Schulte Nordholt Senior researcher and project leader of the Census Statistics Netherlands Division Social.
JOINT UN-ECE/EUROSTAT MEETING ON POPULATION AND HOUSING CENSUSES GENEVA, MAY 2009 DETERMINING USER NEEDS FOR THE 2011 UK CENSUS IAN WHITE, Office.
Using Targeted Perturbation of Microdata to Protect Against Intelligent Linkage Mark Elliot, University of Manchester Cathie.
The Civil Registration and Vital Statistics System in Country Names & Titles of Presenters.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin 8-1 Chapter Eight Audit Sampling: An Overview and Application.
Chap 8-1 Fundamentals of Hypothesis Testing: One-Sample Tests.
1 WP 10 On Risk Definitions and a Neighbourhood Regression Model for Sample Disclosure Risk Estimation Natalie Shlomo Hebrew University Southampton University.
Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.
© John M. Abowd 2005, all rights reserved Assessing Data Quality John M. Abowd April 2005.
Disclosure Risk and Grid Computing Mark Elliot, Kingsley Purdam, Duncan Smith and Stephan Pickles CCSR, University of Manchester
United Nations Regional Workshop on the 2010 World Programme on Population and Housing Censuses: Census Evaluation and Post Enumeration Surveys Asunción,
Lecture 4 Ways to get data into SAS Some practice programming
United Nations Workshop on Principles and Recommendations for a Vital Statistics System, Revision 3, for African English-speaking countries Addis Ababa,
Overview and challenges in the use of administrative data in official statistics IAOS Conference Shanghai, October 2008 Heli Jeskanen-Sundström Statistics.
Census 2011 – A Question of Confidentiality Statistical Disclosure control for the 2011 Census Carole Abrahams ONS Methodology BSPS – York, September 2011.
Copyright © 2013 Pearson Education, Inc. Publishing as Prentice Hall Statistics for Business and Economics 8 th Edition Chapter 9 Hypothesis Testing: Single.
Introduction to NCHS Rob Weinzimer, Special Assistant for Outreach Centers for Disease Control and Prevention National Center for Health Statistics.
Unit 3 – Public Health Statistics Chapter 7 – Statistics: Making Sense of Uncertainty.
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
Copyright © 2009 Pearson Education, Inc LEARNING GOAL Interpret and carry out hypothesis tests for independence of variables with data organized.
11 Measuring Disclosure Risk and Data Utility for Flexible Table Generators Natalie Shlomo, Laszlo Antal, Mark Elliot University of Manchester
Audit Sampling: An Overview and Application
Assessing Disclosure Risk in Microdata
Dissemination Workshop for African countries on the Implementation of International Recommendations for Distributive Trade Statistics May 2008,
Presentation transcript:

The Application of the Concept of Uniqueness for Creating Public Use Microdata Files Jay J. Kim, U.S. National Center for Health Statistics Dong M. Jeong, Korea National Statistical Office

Contents Introduction Introduction Intruders and Disclosure Intruders and Disclosure Measures of Disclosure Risk Measures of Disclosure Risk 1. Narrow Definition of Disclosure Risk 1. Narrow Definition of Disclosure Risk 2. Broader Definition of Disclosure Risk 2. Broader Definition of Disclosure Risk Evaluation of Definition of Disclosure Risk Evaluation of Definition of Disclosure Risk Concluding Remarks Concluding Remarks

1. Introduction. Government agencies release microdata files from their survey data or administrative records data. Government agencies release microdata files from their survey data or administrative records data. Large amounts of information on individuals is available to many organizations and data users, who can become “ intruders ”. Large amounts of information on individuals is available to many organizations and data users, who can become “ intruders ”. If a public use microdata file (PUMF) is released, intruders can try to match their records with the ones from the PUMF and gain access to new information. If a public use microdata file (PUMF) is released, intruders can try to match their records with the ones from the PUMF and gain access to new information.

Intruders use common variables between PUMF and their files for linking the records on two files, which are called “key variables” or “matching variables”. Intruders use common variables between PUMF and their files for linking the records on two files, which are called “key variables” or “matching variables”. In the U.S., laws such as Title 13 stipulates protection of the confidentiality of many types of data. In the U.S., laws such as Title 13 stipulates protection of the confidentiality of many types of data. Thus, the data disseminating agencies must protect the confidentiality of the individuals on the PUMFs. On the other hand, they should not ignore the data users’ needs, i.e., the utility of the data files. Thus, the data disseminating agencies must protect the confidentiality of the individuals on the PUMFs. On the other hand, they should not ignore the data users’ needs, i.e., the utility of the data files.

Here, we develop probability models quantifying disclosure risk for a microdata file. Here, we develop probability models quantifying disclosure risk for a microdata file. This is a modification of the Marsh, et al (1991) procedure. This is a modification of the Marsh, et al (1991) procedure. The model can use population and sample “uniques” only, or it can also include population twins or triplets. The model can use population and sample “uniques” only, or it can also include population twins or triplets. We will show the results of applying the probability model - using population and sample uniques only - for creating disclosure-limited microdata files using the 2005 Korean demographic census data. We will show the results of applying the probability model - using population and sample uniques only - for creating disclosure-limited microdata files using the 2005 Korean demographic census data.

2. Intruders and Disclosure Potential intruders: Potential intruders: i). Organizational intruders, e.g., credit card companies, mortgage departments of banks, insurance companies, credit bureaus, trade associations, etc. ii). Individual intruders: with readily available high powered computers, anyone can assemble his own database using information in the public domain and become an intruder.

Two types of disclosure: i). Identity disclosure – identification. i). Identity disclosure – identification. If the intruder is a journalist and tries to embarrass the data disseminating agencies, his claim that he has been successful in identifying someone on their PUMF would be sufficient. If the intruder is a journalist and tries to embarrass the data disseminating agencies, his claim that he has been successful in identifying someone on their PUMF would be sufficient. If the intruder publicizes the findings in the news media, it could have a devastating effect on the agencies’ data collection efforts. If the intruder publicizes the findings in the news media, it could have a devastating effect on the agencies’ data collection efforts.

ii). Attribute disclosure; After identification is made, one can gain new sensitive information. After identification is made, one can gain new sensitive information. For defining a measure of disclosure risk, we will consider that identity disclosure is the same as disclosure. For defining a measure of disclosure risk, we will consider that identity disclosure is the same as disclosure.

3. Measures of Disclosure Risk Define Define P(a) = the probability of key variables being recorded identically in both PUMF and intruder’s file; P(a) = the probability of key variables being recorded identically in both PUMF and intruder’s file; P(b|a) = the probability that an individual appears in a PUMF is the same as the sampling fraction for that individual in the PUMF; P(b|a) = the probability that an individual appears in a PUMF is the same as the sampling fraction for that individual in the PUMF;

P(c|a,b) = the probability of population unique; P(c|a,b) = the probability of population unique;and P(d|a,b,c) = the probability of verifying population unique. P(d|a,b,c) = the probability of verifying population unique. Marsh, et al (1991) defined the probability of correct identification of an individual as Marsh, et al (1991) defined the probability of correct identification of an individual as P(a) P(b|a) P(c|a,b) P(d|a,b,c) P(a) P(b|a) P(c|a,b) P(d|a,b,c)

We modify the Marsh, et al’s model. We modify the Marsh, et al’s model. We assume in Marsh, et al’s formula that We assume in Marsh, et al’s formula that i). There are no recording or classification errors for the values of the key variables, i.e., P(a) = 1. i). There are no recording or classification errors for the values of the key variables, i.e., P(a) = 1. ii). We can verify correctly population uniqueness with certainty, i.e., P(d|a,b,c) = 1. ii). We can verify correctly population uniqueness with certainty, i.e., P(d|a,b,c) = 1.

Disclosure can occur when all the following 5 conditions are met: Disclosure can occur when all the following 5 conditions are met: i). An individual is unique in a population based on key variables. If the intruder’s file is a 100 percent population file, he can establish uniqueness of a certain individual by using his file. If the intruder’s file is a 100 percent population file, he can establish uniqueness of a certain individual by using his file. ii). The individual is on the PUMF.

iii). The individual is on intruder’s file. An intruder can have information on key variables for a specific person and try to examine whether that person appears in the PUMF. In this case, intruder’s file has a single record. An intruder can have information on key variables for a specific person and try to examine whether that person appears in the PUMF. In this case, intruder’s file has a single record. iv).The individual is unique on PUMF AND v).The individual is unique on intruder’s file.

Define A = an individual of interest; A = an individual of interest; = PUMF; = PUMF; = an intruder’s file; = an intruder’s file; = unique class in the population; = unique class in the population;

= unique class in PUMF; = unique class in PUMF;and = unique class in intruder’s file. = unique class in intruder’s file.

3.1 A Narrow Definition of Disclosure Risk This definition depends on the population and sample uniques only Assume an Intruder does Phising (Fishing) Expedition. Expedition.

The probability of correct identification: (1) (1) If an individual is a population unique, it would also be a sample unique, i.e.,

Equation (1) reduces to which can be further re-expressed as follows: (2)

The event that A is unique in population is independent of whether A is selected in sample or not. Thus, equation (2) reduces to The event that A is unique in population is independent of whether A is selected in sample or not. Thus, equation (2) reduces to(3) The event that A is in the PUMF is usually independent of the event that A is in the intruder’s file. In this case, equation (3) can be simplified as The event that A is in the PUMF is usually independent of the event that A is in the intruder’s file. In this case, equation (3) can be simplified as(4)

However, a survey can be a subset of another survey. For example, U.S. Census Bureau’s PUMF is a subset of their census sample. Thus if is a subset of However, a survey can be a subset of another survey. For example, U.S. Census Bureau’s PUMF is a subset of their census sample. Thus if is a subset of and equation (3) becomes and equation (3) becomes (5) (5)Also, (6) (6)

3.1.2 Assuming an Intruder Already Knows That A is in PUMF If the intruder has response knowledge, then Thus, from equation (4), the disclosure risk will be

3.2 Broader Definition of Disclosure Risk Even if an individual is not unique in the population, he still can be identified with additional information. Even if an individual is not unique in the population, he still can be identified with additional information. Suppose C individuals in the population have the same values of the key variables and matching to any one of them is equally likely. Suppose C individuals in the population have the same values of the key variables and matching to any one of them is equally likely.

Define = Equivalence class of size C in the population. = Equivalence class of size C in the population. Then the probability of correct identification is,

4. Evaluation of Disclosure Risk We used the measures of disclosure risk developed here in creating PUMS from the 2005 Korean census data. We used the measures of disclosure risk developed here in creating PUMS from the 2005 Korean census data. We show the results of the applications on the 2005 census data from Choongchung (CC) Province. We show the results of the applications on the 2005 census data from Choongchung (CC) Province. Masking scheme used is to coarse (group) categories. Masking scheme used is to coarse (group) categories.

Korea National Statistical Office (KNSO) creates the 2 percent PUMFs by taking a 20 percent subsample of the 10 percent census sample, Korea National Statistical Office (KNSO) creates the 2 percent PUMFs by taking a 20 percent subsample of the 10 percent census sample, (0.1 x 0.2 = 0.02). (0.1 x 0.2 = 0.02). : 2 percent PUMF. : 2 percent PUMF. : 10 percent census sample. : 10 percent census sample.

PopulationHouseholds Housing Units Census1,798,397660,526586,757 Census Sample (10%) 189, ,505 71,091 71,091 65,398 65,398 2% Microdata 38,027 38,027 14,218 14,218 13,038 13,038 Table 1. Population Size, and Number of Households and Housing Units – CC Province

Key variables used: gender (2); age (111); marital status (4 ); relationship to householder (14); household type (5 ); tenure (6 ); building type of residence (12); and type of housing and number of floors of the building (12). Key variables used: gender (2); age (111); marital status (4 ); relationship to householder (14); household type (5 ); tenure (6 ); building type of residence (12); and type of housing and number of floors of the building (12). The probability of a population unique is calculated using the 100 percent census file. The probability of a population unique is calculated using the 100 percent census file. Without grouping, the number of uniques is 9,664. It is 0.54 % of 1.8 million. Without grouping, the number of uniques is 9,664. It is 0.54 % of 1.8 million.

If we assume that the intruder has a 10 percent census sample file, the disclosure risk is If we assume that the intruder has a 10 percent census sample file, the disclosure risk is However, whole blocks are selected in the 10 percent census sample, thus residents in the sample blocks know that their neighbors are also in the sample. To those who have response knowledge, the disclosure risk is However, whole blocks are selected in the 10 percent census sample, thus residents in the sample blocks know that their neighbors are also in the sample. To those who have response knowledge, the disclosure risk is

# of Vars GenderAgeRelationship Marital Status # of Uniques 1x 0 1x 2 1x 0 1x 0 2xx 5 2xx 0 2xx 0 2xx xx xx 0 3xxx167 3xxx xxx 2 3xxx349 4xxxx713 Table 2. Number of Unique Persons before Grouping Categories

Table 3. Number of Uniques with 5 Year Intervals for Age # of Vars Gender Grouped Age Relationship Marital Status # of Uniques 1x 2 → 0 2xx 5 → 2 2xx 65 → 6 2xx 11 → 1 3xxx 167 → 18 3xxx 30 → 3 3xxx 349 → 53 4xxxx 713 → 106

Table 4. Number of Uniques with Grouped Age and Relationship Categories # of Vars Gender Gender Grouped Grouped Age Age Grouped Grouped Relationship Relationship Marital Status Marital Status # of # of Uniques Uniques 2 x x 6 → 2 6 → 2 3 x x x 18 → 4 18 → 4 3 x x x 53 → 3 53 → 3 4 x x x x 106 → 8

Table 5. Number of Uniques with Grouped Age, Relationship and Marital Status Categories # of # of Vars VarsGenderGrouped Age Age Grouped Grouped Relationship Relationship Grouped Marital Status Status # of # of Uniques Uniques 3 x x x 3 → 1 3 → 1 3 x x x 3 → 3 3 → 3 4 x x x x 8 → 4 8 → 4

Table 6. Two different groupings in the number Table 6. Two different groupings in the number of categories of categories Relationship Building Building Type Type Type of Housing and # of Floors # of # of Uniques Uniques Grouping 1 9 (14) (14) 6 (12) (12) Grouping 2 3 (14) (14) 4 (12) (12)

Probability of unique =.028 % for both groupings. Probability of unique =.028 % for both groupings. If we assume the intruder has the 10 percent census sample file, the disclosure risk is If we assume the intruder has the 10 percent census sample file, the disclosure risk is < 1 in 100, < 1 in 100,000. If we assume response knowledge, the disclosure risk goes up to If we assume response knowledge, the disclosure risk goes up to

5. Concluding Remarks We developed comprehensive probability models quantifying disclosure risk for microdata files and applied them to the Korean census data. We developed comprehensive probability models quantifying disclosure risk for microdata files and applied them to the Korean census data. Using the models, we measured the disclosure risks for the original census data. The risks were too high. Using the models, we measured the disclosure risks for the original census data. The risks were too high.

We grouped categories of the key variables and re-calculated the disclosure risks. The risks were lowered to a satisfactory level. We grouped categories of the key variables and re-calculated the disclosure risks. The risks were lowered to a satisfactory level. For creating their official 2 percent PUMFs from the census data, KNSO used the approaches mentioned here including the measures of disclosure risks and grouping categories. For creating their official 2 percent PUMFs from the census data, KNSO used the approaches mentioned here including the measures of disclosure risks and grouping categories.

Thank you very much ! Thank you very much ! Jay J. Kim Dong M. Jeong Jay J. Kim Dong M. Jeong Disclaimer: This paper represents the views of the authors and should not be interpreted as representing the views, policies or practices of the Centers for Disease Control and Prevention, National Center for Health Statistics. Disclaimer: This paper represents the views of the authors and should not be interpreted as representing the views, policies or practices of the Centers for Disease Control and Prevention, National Center for Health Statistics.