Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.

Slides:



Advertisements
Similar presentations
Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University
Advertisements

Estimating Identification Risks for Microdata Jerome P. Reiter Institute of Statistics and Decision Sciences Duke University, Durham NC, USA.
Statistical Disclosure Control (SDC) for 2011 Census Progress Update Keith Spicer – ONS SDC Methodology 23 April 2009.
The methodology used for the 2001 SARs Special Uniques Analysis Mark Elliot Anna Manning Confidentiality And Privacy Group ( University.
Output Consultation Plans and Statistical Disclosure Control Strategy developments Angele Storey and Jane Longhurst ONS.
CAPRI CCSR Analysis of Information Loss: a Case Study From a UK Survey Mark Elliot Kingsley Purdam Confidentiality and Privacy Group (CAPRI) CCSR, University.
Data Monitoring Confidentiality and the Grid Mark Elliot Confidentiality And Privacy Group ( University of Manchester.
The Samples of Anonymised Records: Understanding Individual differences Mark Brown.
Statistical Disclosure Control (SDC) at SURS Andreja Smukavec General Methodology and Standards Sector.
Data linking – Project update 15 th May 2012 – Homecare & SDS event Atlantic Quay Ellen Lynch & Euan Patterson.
Confidentiality risks of releasing measures of data quality Jerry Reiter Department of Statistical Science Duke University
EVAL 6970: Meta-Analysis Vote Counting, The Sign Test, Power, Publication Bias, and Outliers Dr. Chris L. S. Coryn Spring 2011.
WP 9 Assessing Disclosure Risk in Microdata using Record Level Measures Natalie Shlomo University of Southampton Office for National Statistics
© Statistisches Bundesamt, IIA - Mathematisch Statistische Methoden Summary of Topic ii (Tabular Data Protection) Frequency Tables Magnitude Tables Web.
METHODS FOR HAPLOTYPE RECONSTRUCTION
1 A Common Measure of Identity and Value Disclosure Risk Krish Muralidhar University of Kentucky Rathin Sarathy Oklahoma State University.
In a Virtual Data Centre Protecting Confidentiality COMPUTATIONAL INFORMATICS Christine O’Keefe, Mark Westcott, Adrien Ickowicz, Maree O’Sullivan, CSIRO.
Assessing Disclosure Risk in Sample Microdata Under Misclassification
Privacy Statistics and Data Linkage
11 ACS Public Use Microdata Samples of 2005 and 2006 – How to Use the Replicate Weights B. Dale Garrett and Michael Starsinic U.S. Census Bureau AAPOR.
Access routes to 2001 UK Census Microdata: Issues and Solutions Jo Wathan SARs support Unit, CCSR University of Manchester, UK
An Overview of Today’s Class
Methods of Geographical Perturbation for Disclosure Control Division of Social Statistics And Department of Geography Caroline Young Supervised jointly.
© John M. Abowd 2005, all rights reserved Recent Advances In Confidentiality Protection John M. Abowd April 2005.
Security in Databases. 2 Outline review of databases reliability & integrity protection of sensitive data protection against inference multi-level security.
Manual on Disability Statistics Central Statistics Office Ministry of Statistics & PI Government of India New Delhi.
Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester.
A Measure of Disclosure Risk for Fully Synthetic Data Mark Elliot Manchester University Acknowledgements: Chris Dibben, Beata Nowak and Gillian Raab.
United Nations Economic Commission for Europe Statistical Division Applying the GSBPM to Business Register Management Steven Vale UNECE
Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and Hot Deck Jennifer Huckett Iowa State University June 20, 2007.
The Application of the Concept of Uniqueness for Creating Public Use Microdata Files Jay J. Kim, U.S. National Center for Health Statistics Dong M. Jeong,
Program Evaluation. Program evaluation Methodological techniques of the social sciences social policy public welfare administration.
Intruder Testing: Demonstrating practical evidence of disclosure protection in 2011 UK Census Keith Spicer, Caroline Tudor and George Cornish 1 Joint UNECE/Eurostat.
Chapter 6 : Software Metrics
WP. 46 Providing access to data and making microdata safe, experiences of the ONS Jane Longhurst Paul Jackson ONS.
1 Statistical Disclosure Control Methods for Census Outputs Natalie Shlomo SDC Centre, ONS January 11, 2005.
Record matching for census purposes in the Netherlands Eric Schulte Nordholt Senior researcher and project leader of the Census Statistics Netherlands.
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton.
Beyond surveys: the research frontier moves to the use of administrative data to evaluate R&D grants Oliver Herrmann Ministry of Business, Innovation.
DTC Quantitative Methods Survey Research Design/Sampling (Mostly a hangover from Week 1…) Thursday 17 th January 2013.
for statistics based on multiple sources
WP 19 Assessment of Statistical Disclosure Control Methods for the 2001 UK Census Natalie Shlomo University of Southampton Office for National Statistics.
Disclosure Avoidance at Statistics Canada INFO747 Session on Confidentiality Protection April 19, 2007 Jean-Louis Tambay, Statistics Canada
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
Understanding Sampling
JOINT UN-ECE/EUROSTAT MEETING ON POPULATION AND HOUSING CENSUSES GENEVA, MAY 2009 DETERMINING USER NEEDS FOR THE 2011 UK CENSUS IAN WHITE, Office.
Using Targeted Perturbation of Microdata to Protect Against Intelligent Linkage Mark Elliot, University of Manchester Cathie.
Question paper 1997.
Protection of frequency tables – current work at Statistics Sweden Karin Andersson Ingegerd Jansson Karin Kraft Joint UNECE/Eurostat.
1 WP 10 On Risk Definitions and a Neighbourhood Regression Model for Sample Disclosure Risk Estimation Natalie Shlomo Hebrew University Southampton University.
Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.
Access to microdata in the Netherlands: from a cold war to co-operation projects Eric Schulte Nordholt Senior researcher and project leader of the Census.
Disclosure Risk and Grid Computing Mark Elliot, Kingsley Purdam, Duncan Smith and Stephan Pickles CCSR, University of Manchester
The 2011 Census: Estimating the Population Alexa Courtney.
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Sampling as a way to reduce risk and create a Public Use File maintaining weighted.
Census 2011 – A Question of Confidentiality Statistical Disclosure control for the 2011 Census Carole Abrahams ONS Methodology BSPS – York, September 2011.
Slide 7.1 Saunders, Lewis and Thornhill, Research Methods for Business Students, 5 th Edition, © Mark Saunders, Philip Lewis and Adrian Thornhill 2009.
Statistical Concepts Basic Principles An Overview of Today’s Class What: Inductive inference on characterizing a population Why : How will doing this allow.
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
Table 1. Methodological Evaluation of Observational Research (MORE) – observational studies of incidence or prevalence of chronic diseases Tatyana Shamliyan.
11 Measuring Disclosure Risk and Data Utility for Flexible Table Generators Natalie Shlomo, Laszlo Antal, Mark Elliot University of Manchester
CHAPTER 6: SAMPLING, SAMPLING DISTRIBUTIONS, AND ESTIMATION Leon-Guerrero and Frankfort-Nachmias, Essentials of Statistics for a Diverse Society.
Random Testing: Theoretical Results and Practical Implications IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 2012 Andrea Arcuri, Member, IEEE, Muhammad.
Creation of synthetic microdata in 2021 Census Transformation Programme (proof of concept) Robert Rendell.
Assessing Disclosure Risk in Microdata
Anonymisation: Theory and Practice
Presented by : SaiVenkatanikhil Nimmagadda
New Techniques and Technologies for Statistics 2017  Estimation of Response Propensities and Indicators of Representative Response Using Population-Level.
Presentation transcript:

Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester

Overview CAPRI –who we are / what we do SDC – some basics SD Risk Assessment and Microdata –General Concepts –Our Approach SD Risk Assessment and Aggregate Data –General Concepts –Our Approach Statistical Disclosure and the Grid

C onfidentiality A nd PRI vacy group University of Manchester

Purpose To investigate the Confidentiality and Privacy issues that arise from the collection, dissemination and analysis of data.

Multidisciplinary Approach Mark Elliot, Knowledge and Data Engineering Kingsley Purdam, Politics and Information Society Anna Manning, Data Mining and HPC Elaine Mackey, Social Policy Duncan Smith, Statistics and Stochastic Systems Karen McCullagh, the Law and Social Policy

Associate Members in Manchester C S: Alan Rector, John Gurd, Len Freeman, Adel Taweel. Computation: John Keane. Psychology: Karen Lander, Lee Wickham. Medicine: Iain Buchan. Manchester Computing Centre: Stephen Pickles. Law:Joseph Jakaneli, John Harris.

Research Programmes The Social and Political Aspects of Confidentiality and Privacy The Detection of Risky Records: Special Uniqueness The Disclosure risk issues posed by the Grid High Performance Computing and statistical Disclosure Medical Records: Clinical E-Science Framework The SAMDIT methodology: Data Monitoring Centre

Consultancy ONS Census Social Survey Neighbourhood statistics US Census Bureau Australian Bureau of Statistics Statistics New Zealand

Statistical Disclosure Control

Sub Fields Disclosure risk assessment. Disclosure control methodology. Analytical validity. Microdata and Aggregate data. Business and Personal data. Intentional and Consequential data

Our General Approach: The SAMDIT method Scenario Analysis (Elliot and Dale 1999) Metric Development Implementation Testing

Microdata

The Microdata Disclosure Risk Problem:An Example NameAddressSexAge.. Income.. SexAge.. ID variables Key variables Target variables Identification file Target file

Risk Assessment methods File Level –Population Uniqueness e.g Bethlehem(1990), Samuels(1998) –DIS; Skinner and Elliot(2002) Record level –Statistical modelling (Fienberg and Makov 1998, Skinner and Holmes 1998) –Computational Search Elliot et al (2002)

Data Intrusion Simulation Uses microdata set (or table) itself to estimate risk - no population data. An estimate of the probability of a correct match (given a unique match). Special method: sub-sampling and re- sampling. General method: derivation from the equivalence class structure.

The DIS Method Remove a small number of records Microdata sample

The DIS Method II Copy back a random number of the removed records (at a probability equivalent to the original sampling fraction)

The DIS Method III Match the removed fragment against the truncated microdata file

Validation Empirical validation studies comparing with the results obtained using population data: Empirical results: No bias and small error. Elliot (2001) Mathematical proof: Skinner and Elliot (2002).

Pr(cm|um) for 2% sample with basic key (age sex marital status)

Levels of Risk Analysis DIS –Works at the file level –Very good for comparative analyses e.g. SAMs

Levels of Risk Analysis Record level risk is important –Variations in risk topography –Risky records

Special Uniques Original concept –Counterintuitive geographical effect, indicated two types of sample uniques. –Random and Special –Special Epidemiological peculiarity –Random Effect of sampling and variable definition

Special Uniques Changing definition: 1.Sample uniques which remain unique despite geographical aggregation 2.Sample uniques which remain unique through any variable aggregation 3.Sample uniques on subset of key variables 4.Dichotomy to Dimension

Minimal Sample Unique A set of sample unique set of variable values –for which no subset is also unique.

Risk Signatures: combinations of minimal uniques Example –Unique pairs 0 –Unique triples 5 –Unique fourfolds 1 –Unique fivefolds 3 –Unique sixfolds 0 –Unique sevenfolds 0 –………

Special Uniques Problem: how to look at all the variables? –File may contain hundreds –Even with scenario keys individual records can contain hundreds of minimal sample uniques –Combinatorial explosion

HIPERSTAD Projects Funded by ESRC, ONS and EPSRC Use of high performance computing –Enables comprehensive analysis of patterns of uniqueness within each record –Has allowed investigation of more complex grading systems

Risk Signatures II Allow grading and classification of records –Differential treatment –Low impact high efficacy disclosure control

Combining DIS and SUDA A heuristic method for combining the two methods to provide a per record matching confidence has proved very effective ONS evaluation studies show that combined method picks out high probability risk very well

SUDA software Available free under licence Used at ONS, ABS and Stats new Zealand

Aggregate Data

Introduction Measurement of Disclosure Risk is an important precursor for its control Intruder/scenario based metrics are better than abstract ones Such metrics are available for microdata but not for aggregate data

Overview Overview of the issues and introducing the method on a conceptual level Details of the algorithms Ongoing and Future Work

The Issues Aggregate data is usually 100% data, so measures based on identification disclosure and sampling are meaningless A better approach is to evaluate what can be inferred through attribute disclosure

Attribute Disclosure

The Approach Rather than assess the risk of actual attribute disclosure we propose estimating the probability of producing a potentially disclosive table, which we define as any table containing at least one zero The method/measure we propose can be applied to: –Single tables –Groups of tables –Unperturbed and perturbed tables –Unpublished tables

The Bounds Problem In a general sense any set of tables can be viewed as a set of bounds on the full table. For example if we release two one way frequency tables:

The Bounds Problem We are effectively releasing the marginals to a two-way frequency table where the entire joint distribution has been suppressed

The cells in the joint distribution can be expressed as a set of bounds (or ranges of feasible values)

The Subtraction – Attribution Probability (SAP) Method The risk associated with a table release depends on the set of tables jointly, rather than on the individual tables. SAP can be used on single tables, groups of tables, perturbed or unperturbed tables. Bounds are calculated and then the probability of an intruder producing one or more upper bounds of zero by subtracting k random individuals from the table is calculated The output can be set for user defined levels of k

Var1 Var2AB C39 D22 Var1 Var3AB E110 F41

Var2 Var3CD E83 F41 Var1 and Var2 Var3A, CA, DB, CB, D E0182 F3110 Original cell counts can be recovered from the marginal tables

Subtraction We consider that an intruder might have knowledge of the relevant population, as well as information in the table release We assume (at least initially) that the intruder has perfect knowledge of k randomly selected individuals

Single exact tables The lower / upper bounds are equal to the published counts The probability of an intruder recovering at least one zero by subtracting known individuals is found by calculating Hypergeometric probabilities and applying the inclusion / exclusion principle

The marginal probability of observing all individuals in a cell is calculated for each individual cell, and the sum is added to a total (initially zero) The marginal probability of observing all individuals in a pair of cells is calculated for each pair of cells, and subtracted from the total The marginal probability of observing all individuals in a triple of cells is calculated for each triple of cells, and added to the total And so on, until we have considered the table total, or all subsequent probabilities are zero

For example, For k = 3 and the following table (and not showing zero probability terms), 124

Example output

1)What new data possibilities does the Grid provide and what confidentiality implications do they have? 2)How could the Grid (or a Grid) be used to enable disclosure risk assessment and control? 3)How could a grid enable a data intruder? 4)What are the possibilities and issues provided by remote access? Confidentiality and the Grid

New data One of the key potentials for the Grid is the possibility of bringing together different data sources through linking and fusing. This is precisely the disclosure risk situation. Our pilot project work shows that adding a third data set tends to increases the linkability of two other datasets.

New Access Virtual remote access has the potential to provide a safe setting model for data access. New question how safe is that output?

Data Intrusion Detection Virtual access allows the possibility of monitoring use. Use patterns by user and across users can be analysed for patterns resembling intrusion (similar to fraud detection).

Confidential data access via a grid PRE-ACCESS Data Quality Monitor Raw Datasets Treated Datasets Data Intrusion sentry Grid Firewall PRE-OUTPUT Disclosure Control PRE-ACCESS Disclosure Control PRE-Output Data Quality Monitor User Analytical request

Conclusions Statistical Disclosure Control is a maturing field. –Basic issues well defined –Theory and Practice still in development Grid presents new opportunities and new confidentiality risks.

Finally a plug….. International Symposium on Confidentiality, Privacy and Disclosure in the 21 st Century –Date: 3 rd May –Venue: Manchester MANDEC Centre –See