Controlled shuffling experiment: detailed 10% sample of 2011 census of Ireland - Risk, confidentiality and utility Presenter: Robert McCaa,

Slides:



Advertisements
Similar presentations
Balancing Access and Confidentiality Jenny Telford Australian Bureau of Statistics September 2008.
Advertisements

1 A Common Measure of Identity and Value Disclosure Risk Krish Muralidhar University of Kentucky Rathin Sarathy Oklahoma State University.
How IPUMS Harmonizes Microdata Data Sources and Bibliography Data Sources: Original census data are contributed to the IPUMS- International project by.
11 ACS Public Use Microdata Samples of 2005 and 2006 – How to Use the Replicate Weights B. Dale Garrett and Michael Starsinic U.S. Census Bureau AAPOR.
Access routes to 2001 UK Census Microdata: Issues and Solutions Jo Wathan SARs support Unit, CCSR University of Manchester, UK
IPUMS workshop * * * Robert McCaa, Professor of Population History University of Minnesota additional information.
1 Assortative Mating Patterns in the Developing World Albert Esteve* and Robert McCaa** Presented by: Sula Sarkar** * Centre d ’ Estudis Demogr à fics.
Bridging the Gaps: Dealing with Major Survey Changes in Data Set Harmonization Joint Statistical Meetings Minneapolis, MN August 9, 2005 Presented by:
Statistical confidentiality and privacy. 2. Case study: IPUMS-International * * * Robert McCaa Minnesota Population Center.
5. Integration of Microdata and Metadata (9 slides)
Hist.umn.edu/~rmccaa/ipums-europe1 From IPUMS-USA (1989-) & PAU-Aging (1992-) From IPUMS-USA (1989-) & PAU-Aging (1992-) to IPUMS-International (1999-)
Statistical confidentiality and privacy: 1. General considerations * * * Robert McCaa Minnesota Population Center “ Inadequate.
Integrating Disability Census Microdata: What is available from IPUMS-International? (all census documentation used in this paper is available.
IPUMS-EurAsia, : Changing Patterns of Microdata Use * * * Robert McCaa, Professor of Population History University.
Building Historical Social Science Infrastructure: Data Integration Projects of the Minnesota Population Center Robert McCaa and Steven Ruggles Minnesota.
Raw Census Microdata from IPUMS IPUMS Data Structure Household record (shaded) followed by a person record for each member of the household Relationship.
IPUMS-International: August * * * Robert McCaa, Professor of Population History University of Minnesota
Indigenous peoples, ethnicity and identities in contemporary censuses: A global perspective source: *
Harmonizing the World’s Census Microdata: The IPUMS Project Matt Sobek Minnesota Population Center
United Nations Demographic Yearbook Data Collection System Adriana Skenderi United Nations Statistics Division Third Regional Workshop on Production and.
1 Numerical Data Masking Techniques for Maintaining Sub-Domain Characteristics Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State.
IPUMS to IHSN: Leveraging structured metadata for discovering multi-national census and survey data Wendy L. Thomas 4 th Conference of the European Survey.
United Nations Expert Group Meeting on International Standards for Civil Registration and Vital Statistics Systems, June 2011, New York Collection,
Aspects of the National Health Interview Survey (NHIS) Chris Moriarity National Conference on Health Statistics August 16, 2010
11 Comparison of Perturbation Approaches for Spatial Outliers in Microdata Natalie Shlomo* and Jordi Marés** * Social Statistics, University of Manchester,
Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and Hot Deck Jennifer Huckett Iowa State University June 20, 2007.
* * * Robert McCaa and Albert Esteve Palos IPUMS-International and Integrated European Census Microdata.
Data Shuffling for Protecting Confidential Data Data Shuffling for Protecting Confidential Data A Software Demonstration Rathindra Sarathy* and Krish Muralidhar**
Statistical Coherence: Census Hub Hypercubes and IPUMS Microdata UNECE Expert Group on Population and Housing Censuses Geneva, September 2014 Lara.
Using IPUMS.org Katie Genadek Minnesota Population Center University of Minnesota The IPUMS projects are funded by the National Science.
Introduction to the Public Use Microdata Sample (PUMS) File from the American Community Survey Updated February 2013.
Integrating ACS with the World’s Census Data: ACS Microdata and the IPUMS Presented at the Pre-ALAP ACS/IPUMS Workshop November 16, 2010 Trent Alexander.
JOINT UNECE-UNFPA TRAINING WORKSHOP ON POPULATION AND HOUSING CENSUSES GENEVA, 5-6 JULY 2010 GOOD PRACTICES IN DISSEMINATING POPULATION CENSUS RESULTS.
Plans for Access to UK Microdata from 2011 Census Emma White Office for National Statistics 24 May 2012.
American Community Survey Overview September 4, 2013 Tim Gilbert American Community Survey Office.
Design and Use of the IPUMS-International Data Serieshttp://international.ipums.org Matt Sobek Minnesota Population Center
Population census micro data for research: the case of Slovenia Danilo Dolenc Statistical Office of the Republic of Slovenia Ljubljana, First Regional.
* IPUMS-International * Using Integrated unit records for demographic and health research: Local, regional, national, and international * * * Robert McCaa,
IPUMS-International Free census samples (microdata) for researchers and policy makers: * * * Robert McCaa, Minnesota Population.
1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton.
Trans-Border access to Census Microdata: The IPUMS-IECM partnership * * * Robert McCaa and Albert Esteve Palós “You have to.
Use of Administrative Data Seminar on Developing a Programme on Integrated Statistics in support of the Implementation of the SNA for CARICOM countries.
Joint UNECE / Eurostat meeting on Population and Housing Censuses 7-9 July 2010, Geneva Disseminating Census information to maximise use and value Keith.
WP 19 Assessment of Statistical Disclosure Control Methods for the 2001 UK Census Natalie Shlomo University of Southampton Office for National Statistics.
Integrated census microdata: a valuable, virgin source for statistical analysis of internal and international migration See handouts: 1. Card for list.
Disclosure Avoidance at Statistics Canada INFO747 Session on Confidentiality Protection April 19, 2007 Jean-Louis Tambay, Statistics Canada
IPUMS Microdata Relation to head Marital status Literacy Occupation.
The availability of Dutch census microdata Eric Schulte Nordholt Senior researcher and project leader of the Census Statistics Netherlands Division Social.
Using Targeted Perturbation of Microdata to Protect Against Intelligent Linkage Mark Elliot, University of Manchester Cathie.
Disclosure Control in the UK Census Keith Spicer 11 January 2005.
Integrated Public Use Microdata Series IPUMSwww.ipums.org Matt Sobek Minnesota Population Center
Paolo Valente - UNECE Statistical Division Slide 1 Technology for census data coding, editing and imputation Paolo Valente (UNECE) UNECE Workshop on Census.
1 Dissemination Michael J. Levin Harvard Center for Population and Development Studies
Protection of frequency tables – current work at Statistics Sweden Karin Andersson Ingegerd Jansson Karin Kraft Joint UNECE/Eurostat.
Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.
The Integrated Public Use Microdata Series database IPUMSwww.ipums.org Lab 1 Background on the IPUMS and SPSS.
Workshop on Collection and Dissemination of Socio-economic Data from Population and Housing Censuses New Delhi, India, May 2012 United Nations Demographic.
Census 2011 – A Question of Confidentiality Statistical Disclosure control for the 2011 Census Carole Abrahams ONS Methodology BSPS – York, September 2011.
Integrated Public Use Microdata Series IPUMS Internationalwww.ipums.org Matt Sobek Minnesota Population Center
Integrated Public Use Microdata Series IPUMSwww.ipums.org.
Data access and development: The IPUMS perspective United Nations Commission on Population and Development The data revolution in action: National and.
Matt Sobek Minnesota Population Center
Disclosure scenario and risk assessment: Structure of Earnings Survey
Assessing Disclosure Risk in Microdata
Welcome IPUMS/IECM-Europe Workshop: Accomplishments, plans and challenges * * * Robert McCaa, Professor.
Press <spacebar> to continue tutorial
Dissemination Workshop for African countries on the Implementation of International Recommendations for Distributive Trade Statistics May 2008,
Presentation 2b 2018 Census Products & Services Engagement.
IPUMS-International Integration Process
Presentation transcript:

Controlled shuffling experiment: detailed 10% sample of 2011 census of Ireland - Risk, confidentiality and utility Presenter: Robert McCaa, Co-authors: Krish Muralidhar Rathindra Sarathy Michael Comerford Albert Esteve 1

1. The challenge: disseminate high precision household census samples with minimum risk and maximum utility  Test case: Ireland 2011 – 10% sample, 60 variables, 1,500 unique codes, including single years of age, relationship to head, 3 digit occupation and industry, etc. 2. Risk – although anonymized, a highly risky sample 3. Controlled shuffling – 5 variables, 4. Utility – after 3 experiments, amazingly good utility 5. Next steps  Re-do the experiment to increase precision  Apply the IPUMS suite of disclosure controls  Submit the sample to CSO-Ireland for testing and approval  Integrate and disseminate—for launch July 2014  Other candidates? Canada, Italy, Netherlands, South Korea, UK? Outline 2

Trusted researcher MPC NSI 100+ NSI 1 …. MPC integrates metadata and confidentializes microdata samples IPUMS-International manages access and entrusts researchers with custom- tailored, SAS, STATA, and SPSS metadata and microdata extracts for any combination of countries, censuses, sub-populations, and variables Trusted researcher …. IPUMS-International microdata dissemination: Trusted researchers download customized extracts NSI entrusts census metadata and anonymized microdata to MPC 3 MPC

IPUMS-International dark green = 74 countries, 238 samples, 544 millon person records confidentialized, harmonized and disseminating medium green = integrating (25 countries, 75 censuses, 100 mill.) light green = negotiating Mollweide projection IPUMS-International:

238 samples, 74 countries, 544 million person records (2014: ~260 samples, 80 countries—Ireland 2011!) AfricaAmericasAsiaEurope Burkina Faso2Argentina5Armenia1Austria4 Cameroon3Bolivia3Bangladesh3Belarus1 Egypt2Brazil6Cambodia2France7 Ghana1Canada4China2Germany4 Guinea2Chile5Fiji5Greece4 Kenya5Colombia5India5Hungary4 Malawi3Costa Rica4Indonesia9Ireland8 Mali2Cuba 20021Iran1Italy1 Moroco3Ecuador6Iraq1Netherlands3 Rwanda2El Salvador2Israel3Portugal3 Senegal2Haiti3Jordan1Romania3 Sierra Leone1Jamaica3Kyrgyz Republic2Slovenia1 South Africa3Mexico7Malaysia4Spain3 South Sudan1Nicaragua3Mongolia2Switzerland4 Sudan1Panama6Nepal1Turkey3 Tanzania2Peru2Pakistan3United Kingdom2 Uganda2Puerto Rico5Palestine2 Saint Lucia2Philippines3 United States7Thailand4 Uruguay5Vietnam3 Venezuela4 5 More countries and samples added yearly

Risks:  Big Data: vast troves of electronic information in the cybersphere  Data mining – large numbers of highly motivated geeks wanting to be the next Bill, Steve, or … Ed (Snowden)  Public anxiety about identity theft Demands:  Huge challenges the environmental, economic, social, cultural, and political foundations of nations, populations, …  Researchers demand/need more, higher quality data  Population census microdata constitute one of the greatest treasures of official statistics 2010 round of censuses pose increased confidentiality risks, yet the demand for data is greater than ever 6

Initial 2011 sample of Ireland for IPUMS, drained of detail:  5 year age bands: single year suppressed  Household, but relationship variable suppressed!  Geography, but only for 8 regions, no counties, etc. Meanwhile IPUMS is seeking a sample for a confidentiality test  CSO agreed to entrust a second, high precision sample with: single years of age, relationship, geography… 60 variables, 1,500 unique codes, every 10 th household  Test controlled shuffling (Muralidar & Sarathy agreed) 2 challenges for Muralidhar & Sarathy: 1. Persuade IPUMS of data utility (and precision) 2. Convince IPUMS & CSO that confidentiality is protected Ireland: first to entrust 2011 census sample challenge, opportunity 7

k-Anonymity Disclosure Risk Assessment A standard k-anonymity approach used to assess disclosiveness of records: Test parameters drawn from the data environment, sensitivity and characteristics. Different configurations of quasi-identifiers were used. Variables flagged and ranked by number of records effected. We aim to provide some degree of ground truth on the relative uniqueness of records to inform later experiments. Results show that the variables age, education, occupational group, industry classification and geographic identifiers in that order of priority should be considered in the implementation of any disclosure control methods.

 Shuffling is ideal for nominal data with hierarchical structure, such as age, education, occupation, industry, etc.  Shuffling is a multivariate procedure where values are re- assigned based on rank order correlation.  Data shuffling offers the following advantages: 1.The shuffled values Y have the same marginal distribution as the original values X. Hence, the results of all univariate analyses using Y provide exactly the same results as that using X. 2.The rank order correlation matrix of {Y, S} is asymptotically the same as the rank order correlation matrix of {X, S}. Hence, the results of most multivariate analysis using {Y, S} should asymptotically provide the same results as using {X, S}.  “Controlled” shuffling - disclosure protection specified by data administrator Data shuffling (see Muralidhar & Sarathy, 2006) 9

 3 of 4 person records were modified:  Age – 50.1% of records  Sex – 13.6%  Educational attainment – 8.1%  Industry – 13.7%  Occupation – 12.4%  Multiple shuffles for adults, couples and household  Adults – 80% with at least 1 shuffled value; 25% with 2 or more  Couples – 50% of couples with both ages perturbed; 91% with at least one age perturbed  Perturbations at the individual level are compounded for households.  Note: we do not intend to provide this information unless requested by the data provider. Confidential Protections are considered strong 10

1. A. Age gap between spouses (Husband’s age – wife’s) Analytical Utility: 3 tests 11

1. B. Perturbations gone wrong (US 2000 census PUMS): Analytical Utility: 3 tests 12

2. Matches mothers with children to construct annual birth series by single year of age of mothers. Note: CSO confirmed Age- specific and Total Fertility estimates against vital registration figures. Analytical Utility: Test # 2: Own-Child Fertility 13

3. Log-odds of similarity in educational attainment of husbands with wives Analytical Utility: Test # 3: Educational Homogamy 14 Not so good— but the difference may be due to an error in linking couples— discovered after shuffling

 Precision: more closely approximate frequencies in the unperturbed data  Fine-tune controlled shuffling:  When shuffling sex for unmarried children aged 0-19, take into account educational attainment  For industry, take into account 23 first level groups instead of only 10  For occupation and industry, maintain associations with other social variables: segment, social class and disability  For educational attainment, take into account the joint characteristics of spouses, and associate with field of study Conclusions, 1: Refinements to be made 15

 Apply the classic technical protections for all datasets entrusted to IPUMS:  Top/bottom/group codes for sparse categories  Convert large households to “group quarters” removing household identities.  Swap a fraction of households across places of residence  Take into account lessons learned, criticisms and suggestions.  Additional protections required by CSO  Invite others: Canada, Italy, Netherlands, South Korea, UK? Conclusions, 2: Refinements to be made 16

Thank you!