Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University

Slides:



Advertisements
Similar presentations
Simulating Publicly Subsidized Reinsurance Strategies In Three States Lisa Clemans-Cope, Ph.D. (presenter) Randall R. Bovbjerg, J.D. (PI for Reinsurance.
Advertisements

Divide-and-Conquer and Statistical Inference for Big Data
Estimating Identification Risks for Microdata Jerome P. Reiter Institute of Statistics and Decision Sciences Duke University, Durham NC, USA.
Jörg Drechsler (Institute for Employment Research, Germany) NTTS 2009 Brussels, 20. February 2009 Disclosure Control in Business Data Experiences with.
The methodology used for the 2001 SARs Special Uniques Analysis Mark Elliot Anna Manning Confidentiality And Privacy Group ( University.
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek.
1 The Synthetic Longitudinal Business Database Based on presentations by Kinney/Reiter/Jarmin/Miranda/Reznek 2 /Abowd on July 31, 2009 at the Census-NSF-IRS.
Confidentiality risks of releasing measures of data quality Jerry Reiter Department of Statistical Science Duke University
Design of Experiments and Analysis of Variance
March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)
Clustered alignments of gene- expression time series data Adam A. Smith, Aaron Vollrath, Cristopher A. Bradfield and Mark Craven Department of Biosatatistics.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
© John M. Abowd 2005, all rights reserved Recent Advances In Confidentiality Protection John M. Abowd April 2005.
Chapter 11 Multiple Regression.
Chapter 1: Introduction to Statistics. Learning Outcomes Know key statistical terms 1 Know key measurement terms 2 Know key research terms 3 Know the.
A Measure of Disclosure Risk for Fully Synthetic Data Mark Elliot Manchester University Acknowledgements: Chris Dibben, Beata Nowak and Gillian Raab.
UNECE Workshop on Confidentiality Manchester, December 2007 Comparing Fully and Partially Synthetic Data Sets for Statistical Disclosure Control.
United Nations Economic Commission for Europe Statistical Division Applying the GSBPM to Business Register Management Steven Vale UNECE
Environment Change Information Request Change Definition has subtype of Business Case based upon ConceptPopulation Gives context for Statistical Program.
JumpStart the Regulatory Review: Applying the Right Tools at the Right Time to the Right Audience Lilliam Rosario, Ph.D. Director Office of Computational.
Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and Hot Deck Jennifer Huckett Iowa State University June 20, 2007.
2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,
REFACTORING Lecture 4. Definition Refactoring is a process of changing the internal structure of the program, not affecting its external behavior and.
The Application of the Concept of Uniqueness for Creating Public Use Microdata Files Jay J. Kim, U.S. National Center for Health Statistics Dong M. Jeong,
Introduction to: 1.  Goal[DEN83]:  Provide frequency, average, other statistics of persons  Challenge:  Preserving privacy[DEN83]  Interaction between.
Chapter 9 – Classification and Regression Trees
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
1 New Implementations of Noise for Tabular Magnitude Data, Synthetic Tabular Frequency and Microdata, and a Remote Microdata Analysis System Laura Zayatz.
1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton.
Multiple Imputation (MI) Technique Using a Sequence of Regression Models OJOC Cohort 15 Veronika N. Stiles, BSDH University of Michigan September’2012.
Topic (vi): New and Emerging Methods Topic organizer: Maria Garcia (USA) UNECE Work Session on Statistical Data Editing Oslo, Norway, September 2012.
Presenter: Silas Mulwah Organization:Kenya National Bureau of Statistics  th September 2013, United Nations Regional workshop on Data Dissemination.
Environment Change Information Request Change Definition has subtype of Business Case based upon ConceptPopulation Gives context for Statistical Program.
Security Control Methods for Statistical Database Li Xiong CS573 Data Privacy and Security.
MBA7020_01.ppt/June 13, 2005/Page 1 Georgia State University - Confidential MBA 7020 Business Analysis Foundations Introduction - Why Business Analysis.
Some aspects concerning analytical validity and disclosure risk of CART generated synthetic data Hans-Peter Hafner and Rainer Lenz Research Data Centre.
2008 NCHS Data Users’ Conference Omni Shoreham Hotel Washington, DC Wednesday, August 13, 2008.
User Interfaces 4 BTECH: IT WIKI PAGE:
Disclosure Avoidance at Statistics Canada INFO747 Session on Confidentiality Protection April 19, 2007 Jean-Louis Tambay, Statistics Canada
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
1 Using Fixed Intervals to Protect Sensitive Cells Instead of Cell Suppression By Steve Cohen and Bogong Li U.S. Bureau of Labor Statistics UNECE/Work.
IAB homepage: Institut für Arbeitsmarkt- und Berufsforschung/Institute for Employment Research A New Approach for Disclosure Control in the.
Eurostat Statistical matching when samples are drawn according to complex survey designs Training Course «Statistical Matching» Rome, 6-8 November 2013.
© John M. Abowd 2007, all rights reserved General Methods for Missing Data John M. Abowd March 2007.
Disclosure Limitation in Microdata with Multiple Imputation Jerry Reiter Institute of Statistics and Decision Sciences Duke University.
Gillian Raab, Chris Dibben, & Paul Burton UNECE-Eurostat Work Session on Statistical Data Confidentiality, Helsinki, 2015 Running an analysis of combined.
Disclosure Risk and Grid Computing Mark Elliot, Kingsley Purdam, Duncan Smith and Stephan Pickles CCSR, University of Manchester
Eurostat Accuracy of Results of Statistical Matching Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Statistical Confidentiality: Is Synthetic Data the Answer? George Duncan 2006 February 13.
Tutorial I: Missing Value Analysis
Methods of Secure Computation and Data Integration Jerome Reiter, Duke University Alan Karr, NISS Xiaodong Lin, University of Cincinnati Ashish Sanil,
University of Warwick, Department of Sociology, 2012/13 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Survey Design: Some Implications for.
Synthetic Approaches to Data Linkage Mark Elliot, University of Manchester Jerry Reiter Duke University Cathie Marsh Centre.
Bootstrapping James G. Anderson, Ph.D. Purdue University.
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
Expanding the Role of Synthetic Data at the U.S. Census Bureau 59 th ISI World Statistics Congress August 28 th, 2013 By Ron S. Jarmin U.S. Census Bureau.
Natalie Shlomo Social Statistics, School of Social Sciences
Missing data: Why you should care about it and what to do about it
Implementation of Quality indicators for administrative data
Module 11: File Structure
Creation of synthetic microdata in 2021 Census Transformation Programme (proof of concept) Robert Rendell.
Multiple Imputation Using Stata
How to handle missing data values
Discrete Event Simulation - 4
Classification Trees for Privacy in Sample Surveys
Federal Statistical Office Germany Research Data Centre
Imputation as a Practical Alternative to Data Swapping
Jerome Reiter Department of Statistical Science Duke University
Presentation transcript:

Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University

Two general settings Agency seeks to release confidential data to the public. Multiple agencies seek to improve analyses by sharing their confidential data. For both settings, agencies seek strategies that: i) do not reveal identities or sensitive attributes, ii) are useful for a wide range of analyses, iii) are easy for analysts and agencies to use.

Some alternative approaches Remote access servers Synthetic (i.e. simulated) data Secure computation techniques

Definition of servers Server is any system that (i) allows users to submit queries for output from statistical analyses of microdata, but (ii) does not give direct access to microdata. Table Servers / Model Servers

Queries and responses Queries to model server: Users request results from fitting a statistical model to the data. Response from model server: Answerable query: model output. Unanswerable query: no results. Model output also should include diagnostics.

Challenges in developing model servers Non-statistical: Operation costs, server security, etc. Statistical: -- Disclosure risks from smart queries (e.g., subsets, transformations). -- Inferential disclosure risks. -- Enabling complex model fitting.

Synthetic data Rubin (1993, JOS ): create multiple, fully synthetic datasets for public release so that: No unit in released data has sensitive data from actual unit in population. Released data look like actual data. Statistical procedures valid for original data are valid for released data.

Generating fully synthetic data Randomly sample new units from sampling frame. Impute survey variables for new units using models fit from observed data. Repeat multiple times and release datasets.

Modification: Release partially synthetic data Little (1993, JOS ): create multiple, partially synthetic datasets for public release so that: Released data comprise mix of observed and synthetic values. Released data look like actual data. Statistical procedures valid for original data are valid for released data.

Existing applications Kennickel (1997, Record Linkage Techniques): Replace sensitive values for selected units. Liu and Little (2002, JSM Proceedings): Replace values of key identifiers for selected units. Abowd and Woodcock (2001, Confidentiality, Disclosure, and Data Access): Replace all values of sensitive variables.

Sample of research agenda Implement and compare various data generation approaches on genuine data in production settings. Evaluate risk/usefulness profile on genuine data in production setting. Develop packaged synthesizers for data disseminators to use.

Secure computations Horizontally Partitioned: Agencies have different records but same variables. Purely Vertically Partitioned: Agencies have same records but different variables. Partially Overlapping, Vertically Partitioned: Agencies have different records and different variables, with some common records and variables.

Horizontally Partitioned Data: Secure Summation Secure summation -- shares sums without sharing data -- allows regressions, clustering, classifications -- assumes semi-honest

Horizontal Partitioning: Secure summation Obtain without sharing individual values 1. Agency A passes (x + R) to 2 nd agency. 2. Agency B adds its x to this value and passes sum to Agency C. 3. Process continues until all agencies have added their x. 4. Agency A subtracts R from the sum.

Purely vertical partitioning Secure dot/matrix product -- shares dot/matrix products without sharing data. -- allows regressions, clustering, classification. -- assumes semi-honest. Synthetic data approaches -- share synthetic copies of data across agencies. -- allows any analysis when distributions used to generate data are accurate. -- generates public use data file.

A research agenda for secure computation methods - How to specify models without viewing data? - What if sophisticated models needed? - How to incorporate matching errors, differences in data quality and definitions? - How to account for disclosure risks from models that fit too well?

Some References Remote access servers - Rowland (2003, NAS Panel on Data Access). - Gomatam, Karr, Reiter, Sanil (2005, Stat. Science) Synthetic data - Raghunathan, Reiter, and Rubin (2003, JOS ) - Reiter (2003, Surv. Meth.; 2005, JRSSA) Secure computation - Benaloh (1987, CRYPTO86 ) - Karr, Lin, Sanil, and Reiter (2005, NISS tech. rep.)