Natalie Shlomo Social Statistics, School of Social Sciences

Slides:

Advertisements

Similar presentations

Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University

Advertisements

Estimating Identification Risks for Microdata Jerome P. Reiter Institute of Statistics and Decision Sciences Duke University, Durham NC, USA.

Statistical Disclosure Control (SDC) for 2011 Census Progress Update Keith Spicer – ONS SDC Methodology 23 April 2009.

Output Consultation Plans and Statistical Disclosure Control Strategy developments Angele Storey and Jane Longhurst ONS.

Statistical Disclosure Control (SDC) at SURS Andreja Smukavec General Methodology and Standards Sector.

Confidentiality risks of releasing measures of data quality Jerry Reiter Department of Statistical Science Duke University

© Statistisches Bundesamt, IIA - Mathematisch Statistische Methoden Summary of Topic ii (Tabular Data Protection) Frequency Tables Magnitude Tables Web.

SDC for continuous variables under edit restrictions Natalie Shlomo & Ton de Waal UN/ECE Work Session on Statistical Data Editing, Bonn, September 2006.

In a Virtual Data Centre Protecting Confidentiality COMPUTATIONAL INFORMATICS Christine O’Keefe, Mark Westcott, Adrien Ickowicz, Maree O’Sullivan, CSIRO.

Assessing Disclosure Risk in Sample Microdata Under Misclassification

Access routes to 2001 UK Census Microdata: Issues and Solutions Jo Wathan SARs support Unit, CCSR University of Manchester, UK

Metadata driven application for aggregation and tabular protection Andreja Smukavec SURS.

11 Comparison of Perturbation Approaches for Spatial Outliers in Microdata Natalie Shlomo* and Jordi Marés** * Social Statistics, University of Manchester,

Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and Hot Deck Jennifer Huckett Iowa State University June 20, 2007.

1 Tel Aviv April 29th, 2007 Disclosure Limitation from a Statistical Perspective Natalie Shlomo Dept. of Statistics, Hebrew University Central Bureau of.

Disclosure Avoidance: An Overview Irene Wong ACCOLEDS/DLI Training December 8, 2003.

WP. 46 Providing access to data and making microdata safe, experiences of the ONS Jane Longhurst Paul Jackson ONS.

1 Statistical Disclosure Control Methods for Census Outputs Natalie Shlomo SDC Centre, ONS January 11, 2005.

Discussion of “ Statistical Disclosure Limitation: Releasing Useful Data for Statistical Analysis” Nancy J. Kirkendall Energy Information Administration.

Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.

1 New Implementations of Noise for Tabular Magnitude Data, Synthetic Tabular Frequency and Microdata, and a Remote Microdata Analysis System Laura Zayatz.

1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton.

WP 19 Assessment of Statistical Disclosure Control Methods for the 2001 UK Census Natalie Shlomo University of Southampton Office for National Statistics.

Disclosure Avoidance at Statistics Canada INFO747 Session on Confidentiality Protection April 19, 2007 Jean-Louis Tambay, Statistics Canada

1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.

Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.

Disclosure Control in the UK Census Keith Spicer 11 January 2005.

Protection of frequency tables – current work at Statistics Sweden Karin Andersson Ingegerd Jansson Karin Kraft Joint UNECE/Eurostat.

Differential Privacy Some contents are borrowed from Adam Smith’s slides.

1 WP 10 On Risk Definitions and a Neighbourhood Regression Model for Sample Disclosure Risk Estimation Natalie Shlomo Hebrew University Southampton University.

Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.

Disclosure Analysis: What do RDC Analysts do? Research Data Centre Program, Statistics Canada James Chowhan Ontario DLI Training, Queen's University

Access to microdata in the Netherlands: from a cold war to co-operation projects Eric Schulte Nordholt Senior researcher and project leader of the Census.

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Sampling as a way to reduce risk and create a Public Use File maintaining weighted.

Remote Analysis Server for Tabulation and Analysis of Data Tarragonia, October 2011 James Chipperfield and Frank Yu (presenter)

Census 2011 – A Question of Confidentiality Statistical Disclosure control for the 2011 Census Carole Abrahams ONS Methodology BSPS – York, September 2011.

Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.

11 Measuring Disclosure Risk and Data Utility for Flexible Table Generators Natalie Shlomo, Laszlo Antal, Mark Elliot University of Manchester

Expanding the Role of Synthetic Data at the U.S. Census Bureau 59 th ISI World Statistics Congress August 28 th, 2013 By Ron S. Jarmin U.S. Census Bureau.

University of Texas at El Paso

Disclosure scenario and risk assessment: Structure of Earnings Survey

The treatment of uncertainty in the results

Development of UK Virtual Microdata Laboratory

Data Confidentiality and the Common Good.

Creation of synthetic microdata in 2021 Census Transformation Programme (proof of concept) Robert Rendell.

Progress towards a table builder with in-built disclosure control for 2021 Census Keith Spicer UNECE, 22 September 2017.

Assessing Disclosure Risk in Microdata

UK Data Service Secure Lab

Establishing an Automated Confidentiality Service in Stats NZ

Privacy-preserving Release of Statistics: Differential Privacy

SAMPLING (Zikmund, Chapter 12.

Differential Privacy in Practice

Access to European microdata for scientific purposes

The European Statistical Training Programme (ESTP)

Chapter 10: Selection of auxiliary variables

SDMX Information Model: An Introduction

Classification Trees for Privacy in Sample Surveys

SAMPLING (Zikmund, Chapter 12).

Protecting Confidential Data

Disclosure Avoidance: An Overview

Presented by : SaiVenkatanikhil Nimmagadda

GUIDELINES FOR THE COLLECTION OF PESTICIDE USAGE STATISTICS A summary

New Techniques and Technologies for Statistics 2017 Estimation of Response Propensities and Indicators of Representative Response Using Population-Level.

Perturbative methods for ESS census tables

Confidentiality on the Fly

Treatment of statistical confidentiality Part 3: Generalised Output SDC Introductory course Trainer: Felix Ritchie CONTRACTOR IS ACTING UNDER A FRAMEWORK.

Item 5 Wim Kloek, Eurostat

Differential Privacy (1)

Jerome Reiter Department of Statistical Science Duke University

GSIM overview Mauro Scanu ISTAT

Presentation transcript:

Perspective on User Needs for Government Data Where Do We Go From Here? Natalie Shlomo Social Statistics, School of Social Sciences University of Manchester Natalie.Shlomo@manchester.ac.uk 1

Traditional forms of statistical outputs Topics Covered Traditional forms of statistical outputs Disclosure risk and data utility Differential privacy/Inferential disclosure Future dissemination strategies Table generating servers Synthetic data Remote access Remote analysis Challenges and limitations 2

Traditional Statistical Outputs Survey Microdata Social survey data generally released via data archives for registered users Business surveys have large sample fractions, eg. take-all strata, and highly skewed distributions and are generally not released Tabular Data Frequency Tables Census (whole population) counts with careful design of output variables Weighted sample counts Magnitude Tables Mainly for business statistics 3

Types of Disclosure Risk Identity disclosure Identification is widely referred to in confidentiality pledges and code of practice, e.g. “…no statistics will be produced that are likely to identify an individual unless specifically agreed with them” (principle 5 of NS Code of Practice) Examples: Survey microdata – identify respondent through rare categories (population unique) and/or response knowledge Census tables – a small cell (1 or 2) 4

Types of Disclosure Risks Individual attribute disclosure Confidential information about a data subject is revealed and can be attributed to the subject Identity disclosure a necessary condition for individual attribute disclosure Examples: Survey microdata - individual identified and survey target variables learnt, eg. health, income Census table - unique cell on the margin, i.e. structural zeros on the rows/columns 5

Types of Disclosure Risks Group attribute disclosure Confidential information is learnt about a group and may cause harm, i.e. all adults in a village collect unemployment Examples: Survey microdata – difficult to find group attribute disclosure under survey conditions Census tables – caused by structural zeros, i.e. row/column consists of all zeros except one cell 6

Types of Disclosure Risks Inferential Disclosure Confidential information may be revealed exactly or to a close approximation Examples: Survey microdata – a good prediction model with high Census tables – disclosure by differencing This type of disclosure has been largely ignored! 7

Survey Microdata from Social Surveys Standard SDC Methods Survey Microdata from Social Surveys Identity disclosure main concern since it can lead to attribute disclosure Disclosure control methods generally non-perturbative: Deleting highly identifying variables (eg. geography) Recoding identifying variables (eg. age, ethnicity) Magnitude Tables Attribute disclosure (since identities are likely known) and concern is for dominance in a cell Disclosure control methods: Table design Cell suppression 8

Standard SDC Methods Census Tables Identity disclosure, attribute disclosure and disclosure by differencing Disclosure control methods: Careful design of tables and threshold criteria Fixed variables spanning tables to avoid differencing In some countries, long form is a sub-sample Pre-tabular methods eg. record swapping Post-tabular methods eg. forms of rounding 9

Inferential Disclosure (Differential Privacy) Differential privacy based on disclosure of a target unit where the intruder has knowledge of the entire database except for the target unit itself No distinction between key variables and sensitive variables, types of disclosure risks, or whether data arises from a sample or population Differential privacy similar to the notion of disclosure by differencing since in this case even a sum of counts or averages are disclosive

Inferential Disclosure (Differential Privacy) Definition of Differential Privacy with respect to statistical databases (Dwork, et al. 2006, Shlomo and Skinner 2012) Assume a population database from which a sample is drawn Assume the agency releases a set of counts: where Assume the intruder knows the population database except for one target unit Let denote the probability of f with respect to an SDC mechanism where XU is treated as fixed

Inferential Disclosure (Differential Privacy) Then differential privacy holds iff for and maximum taken over all possible pairs which differ by only one unit and across all possible vectors of f Guarantee of differential privacy by adding noise to all outputs Amount of noise depends on the number of units in the query but independent of the data

Inferential Disclosure (Differential Privacy) Does sampling and the release of microdata guarantee differential privacy (Shlomo and Skinner, 2012)? No! Let fk be a sample count It is assumed that an intruder knows everything in the population table except for one unit If Fk=fk and we move one of the counts of Fk to another cell than we may obtain Fk<fk which is impossible Sampling is not differentially private How likely is it to get Fk=fk in a sample? Usually 2-3% Agencies will generally decide to allow the ‘slippage’ and issue the controlled release of microdata

Inferential Disclosure (Differential Privacy) Does perturbation guarantee differential privacy? Assume a perturbation mechanism: Then the ratio in the definition will contain the elements: If the perturbation mechanism does not have a zero probability, then perturbation schemes are differentially private

Inferential Disclosure (Differential Privacy) Examples of perturbation mechanisms: Recoding: Random data swapping: PRAM: In practice we control perturbation and add zeros to ensure edits

‘Safe Data’ vs ‘Safe Access’ In the last decade agencies are increasingly concerned about breaches of confidentiality, particularly with large number of open databases that can be used to attack statistical data Agencies are restricting access to data with more stringent licensing and the use of on-site data labs How can we make statistical data more available to users? Why aren’t agencies making more use of ‘modern’ dissemination strategies? 16

Future Dissemination Strategies Census Tables On-line flexible table generation based on web package Input data are frequency counts in a multi-dimensional hypercube with small geographical areas Disclosure risk measures and SDC methods applied ‘on-the-fly’ Set of rules embedded in the package, eg. population thresholds, proportion of small cells, etc. To avoid disclosure by differencing, must add noise 17

Example: Simulation Hypercube Shlomo, Antal and Elliot, 2015 Population N=1,500,000 NUTS2 Region - two regions Gender – 2 categories Banded age groups – 21 categories Current Activity Status – 5 categories Occupation – 13 categories Educational attainment – 9 categories Country of citizenship – 5 categories 18

Flexible Table Generating Servers Based on restrictions of the server, define a 3- dimensional table with one variable to define the population: banded age group, education group and occupation group defined for NUTS2=1 Table has 2,457 cells, 854,539 individuals, average cell size of 347.8 Cell Value Number of Cells Percentage of Cells 1534 62.43% 1 44 1.79% 2 35 1.42% 3 27 1.10% 4 20 0.81% 5 and over 797 32.44% Total 2457 100.00% 19

Information Based Disclosure Risk and Data Utility Measures To assess attribute disclosure in tables mainly caused by structural zeros, use the entropy where vector of frequency counts and Entropy bounded by 0 if all cells are zero except one cell, and log(K) if all cell values are equal, i.e. cell proportions are 1/K Risk measure: Combine with other measures (proportion of zeros and size of the population)and define weighted average: 20

Information Based Disclosure Risk and Data Utility Measures Risk measure extended to account for perturbation and sampling based on conditional entropy Utility measure: Hellenger’s Distance where original counts perturbed counts Hellenger’s Distance bounded by 0 and and can be used to compare SDC methods: 21

Results Disclosure Risk in (3) Data Utility in (4) Table 1 Original 0.318 - Perturbed Input Record Swapping: 0.282 0.988 Semi-controlled Random Rounding 0.137 0.991 Stochastic Perturbation 0.239 0.995 Perturbed Output: Semi-Controlled Random Rounding 0.135 0.993 Comparing the rounding before and after shows that SDC ‘on the fly’ has lower disclosure risk and the highest utility out of all the methods 22

Future Dissemination Strategies Synthetic Datasets Partially-synthetic micro data Preserves the record structure of the gold standard micro data Replaces data elements with synthetic values sampled from an appropriate probability model Future work to assess disclosure risk Fully-synthetic micro data Preserves some of the gold standard micro data Generates synthetic entities and data elements from appropriate probability models In practice, very difficult to capture all conditional relationships between variables and within sub- populations CTA (controlled tabular adjustment) where suppressed cells take imputed values 23

Future Dissemination Stategies Data Enclaves A secure IT environment where researchers can access confidential data on-site, eg. Virtual Microdata Lab (VML) at the ONS Researchers apply to carry out a project and sign a contract and confidentiality agreement Minimise risk of disclosure: No removal of data, no printers, not linked to internet All outputs checked manually by staff Training course for understanding security rules Research needed on what is a disclosive output 24

Future Dissemination Stategies Remote Access Access to data through remote connection to secure server, typically at Universities and Research Institutes Carry out analysis as if on personal PC and view results on screen Outputs dropped in a mail box to be manually checked and emailed back to researchers 25

Future Dissemination Strategies Remote Analysis Some agencies (eg. Census Bureau, ABS) developing platforms for remote analysis or allowing researchers to submit code to be run on-site Aim to protect outputs without the need for intervention Example (O’keefe and Shlomo, 2012): Comparison of confidential input versus confidential outputs 338 Sugar Canes Farm Data from a 1982 survey of sugar cane industry in Queensland, Australia: Region (4 categories) and 5 continuous variables: Area, Harvest, Receipts, Costs, Profits (=Receipts-Costs) Confidentialized input by additive noise and removing outliers 26

Future Dissemination Strategies Remote Analysis Receipts: Original Input Output

Future Dissemination Strategies Remote Analysis Receipts: Original Input Output

Example

Future Dissemination Strategies Residuals: Original Input Output

Challenges and Discussion In recent years, managing disclosure risk is about restricting access to data More government initiatives for ‘open data’ Agencies need to use modern dissemination strategies to accommodate increasing demands for ‘open data’ Need stricter and tighter definitions of disclosure risk but users will have to work with perturbative SDC methods Agencies should release the methods and parameters of the perturbation so researchers can cope with measurement error For ‘on the fly’ SDC methods, agencies should release utility measures based on the original file/tables