Disclosure Risk and Grid Computing Mark Elliot, Kingsley Purdam, Duncan Smith and Stephan Pickles CCSR, University of Manchester

Slides:



Advertisements
Similar presentations
1 OECD Conference: Assessing the Feasibility of Micro-Data Access Micro-Data Access Questionnaire: Synthesis Luxembourg, October 2006 Nadim Ahmad,
Advertisements

Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University
Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.
The Economic and Social Data Service (ESDS) Kevin Schürer ESDS/UKDA ESDS Awareness Day 5 December 2003.
Access to Economic and Social Data via the UK Data Archive Jack Kneeshaw UKDA.
The methodology used for the 2001 SARs Special Uniques Analysis Mark Elliot Anna Manning Confidentiality And Privacy Group ( University.
Output Consultation Plans and Statistical Disclosure Control Strategy developments Angele Storey and Jane Longhurst ONS.
CAPRI CCSR Analysis of Information Loss: a Case Study From a UK Survey Mark Elliot Kingsley Purdam Confidentiality and Privacy Group (CAPRI) CCSR, University.
Data Monitoring Confidentiality and the Grid Mark Elliot Confidentiality And Privacy Group ( University of Manchester.
Eurostat T HE E UROPEAN PROCESS OF ENHANCING ACCESS TO E UROSTAT DATA A LEKSANDRA B UJNOWSKA E UROSTAT.
Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek.
The Challenge of the New Data Mark Elliot, Social Sciences University of Manchester January 2013
Data linking – Project update 15 th May 2012 – Homecare & SDS event Atlantic Quay Ellen Lynch & Euan Patterson.
In a Virtual Data Centre Protecting Confidentiality COMPUTATIONAL INFORMATICS Christine O’Keefe, Mark Westcott, Adrien Ickowicz, Maree O’Sullivan, CSIRO.
Using synthetic data to improve the accessibility of the SLS Susan Carsley, SLS Project Manager.
Privacy Statistics and Data Linkage
March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)
The e-Social Science Research Agenda Peter Halfpenny and Rob Procter School of Social Sciences - University of Manchester UK e-Science All Hands Meeting.
Learning and Teaching with the UK Census Developing the Collection of Historical and Contemporary Census Data and Materials into a Major Learning and Teaching.
MoSeS meets NEC 10 th March 2008 MoSeSMoSeS Andy Turner
Methods of Geographical Perturbation for Disclosure Control Division of Social Statistics And Department of Geography Caroline Young Supervised jointly.
Oxford eResearch Conference 2008 Paper Session 4A: NCeSS Oxford, UK, ( ) Experience of e-Social Science: A Case of Andy Turner and MoSeS Andy.
CCG 1 MoSeS Introduction and Progress Report Andy Turner
Shirley Crompton Source: Rob Allan. Institutional Repository Subject Repository Data Producer Repository share resources solve bigger problems integrate.
An Introduction to Social Simulation Andy Turner Presentation as part of Social Simulation Tutorial at the.
1 ©IRWIN a Times Mirror Higher Education Group, Inc., company 1997 Collecting and Using Marketing Information.
The Nuffield Council on Bioethics Report : The collection, linking and use of data in biomedical research and health care: ethical issues. Martin Richards.
A Measure of Disclosure Risk for Fully Synthetic Data Mark Elliot Manchester University Acknowledgements: Chris Dibben, Beata Nowak and Gillian Raab.
MOLLA HUNEGNAW STATISTICIAN AFRICAN CENTRE FOR STATISTICS ECASTATS.UNECA.ORG Confidentiality and Anonymization of Microdata 1 United Nations Regional Seminar.
Metadata driven application for aggregation and tabular protection Andreja Smukavec SURS.
IPUMS to IHSN: Leveraging structured metadata for discovering multi-national census and survey data Wendy L. Thomas 4 th Conference of the European Survey.
Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and Hot Deck Jennifer Huckett Iowa State University June 20, 2007.
The Application of the Concept of Uniqueness for Creating Public Use Microdata Files Jay J. Kim, U.S. National Center for Health Statistics Dong M. Jeong,
© Federal Statistical Office, Research Data Centre, Maurice Brandt Folie 1 Analytical validity and confidentiality protection of anonymised longitudinal.
Future Use of Stored Samples & Data and the NIH Policy on GWAS and dbGaP NIAID/DAIDS Dione Washington, M.S. -- ProPEP Sudha Srinivasan, Ph.D.-- TRP Tanisha.
Dissemination to support Research & Analysis John Cornish.
1 Welcome: To the second learning sequence “ Data Base (DB) and Data Base Management System (DBMS) “ Recap : In the previous learning sequence, we discussed.
Transparency and Open Data: GSS Response Iain Bell HoP MoJ.
Plans for the Research and Testing Phase of the 2020 Census Presentation to the State Data Centers October 15, 2010 Daniel H. Weinberg (Assistant Director.
User-focused Threat Identification For Anonymised Microdata Hans-Peter Hafner HTW Saar – Saarland University of Applied Sciences
Census/NeSS Roadshows March 2003 Better Information Initiatives.
Innovations in Data Dissemination Thomas L. Mesenbourg, Jr. Acting Director U.S. Census Bureau United Nations Seminar on Innovations in Official Statistics.
The power of information Putting all of us in control of the health and care information we need Dr Susan Hamer National Director of Nursing, Midwifery.
Presenter: Silas Mulwah Organization:Kenya National Bureau of Statistics  th September 2013, United Nations Regional workshop on Data Dissemination.
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
Editing of linked micro files for statistics and research.
The experience of a National Statistical Institute after a law change: Estonia First Regional Workshop Microdata Access in European Countries ― Cooperation.
Access to microdata in Statistics Estonia First DwB European Data Access Forum Luxembourg, 28th March 2012 Tuulikki Sillajõe.
Using Targeted Perturbation of Microdata to Protect Against Intelligent Linkage Mark Elliot, University of Manchester Cathie.
Disclosure Limitation in Microdata with Multiple Imputation Jerry Reiter Institute of Statistics and Decision Sciences Duke University.
Pilot Census in Poland Some Quality Aspects Geneva, 7-9 July 2010 Janusz Dygaszewicz Central Statistical Office POLAND.
Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.
Marketing Research An introduction. Marketing Research Marketing research is a combination of two words i.e marketing and research Marketing is essentially.
Threat Prevention and Detection (within Critical Infrastructures) under EU Data Protection Legislation– Purpose Specification and Limitation. Laurens Naudts.
Keeping Children Safe Summer School.... Pathways to information 15 th September 2011
HETUS Pilot Group 8 Privacy procedures and ethical issues Kimberly Fisher, Centre for Time Use Research – co-ordinator External consultant Kai Ludwigs.
Census 2011 – A Question of Confidentiality Statistical Disclosure control for the 2011 Census Carole Abrahams ONS Methodology BSPS – York, September 2011.
1 OECD Project: Assessing the Feasibility of Micro-Data Access Entrepreneurship Indicators Project Steering Group Nadim Ahmad, Statistics Directorate,
Synthetic Approaches to Data Linkage Mark Elliot, University of Manchester Jerry Reiter Duke University Cathie Marsh Centre.
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
 Exists to serve the community’s interests by providing social conditions in which people maintain health  Describes epidemics and the spread of disease,
11 Measuring Disclosure Risk and Data Utility for Flexible Table Generators Natalie Shlomo, Laszlo Antal, Mark Elliot University of Manchester
Expanding the Role of Synthetic Data at the U.S. Census Bureau 59 th ISI World Statistics Congress August 28 th, 2013 By Ron S. Jarmin U.S. Census Bureau.
An agency of the European Union Guidance on the anonymisation of clinical reports for the purpose of publication in accordance with policy 0070 Industry.
WHO The World Health Survey General Introduction
Anonymisation: Theory and Practice
Data Management: Documentation & Metadata
Harmonisation process of anonymisation of microdata
Anonymisation: what is it and how do I do it
Item 2.2 Scientific Use Files for the Time Use Survey
Presentation transcript:

Disclosure Risk and Grid Computing Mark Elliot, Kingsley Purdam, Duncan Smith and Stephan Pickles CCSR, University of Manchester Cathie Marsh Centre for Census and Survey Research, University of Manchester

Overview What is disclosure risk? The confidentiality and e-science research programme Confidentiality and Grid computing project –Data Environment analysis –Disclosure risk experiments

The Disclosure Risk Problem: Type I: Identification NameAddressSexAge.. Income.. SexAge.. ID variables Key variables Target variables Identification file Target file

The Disclosure Risk Problem: Type II: Attribution

1)What new data possibilities does grid computing provide and what confidentiality implications do they have? (1st NCeSS PDP) 2)How could the grid computing be used to enable disclosure risk assessment and control? (2 nd NCeSS PDP) 3)How could grid computing enable a data intruder? 4)What are the possibilities and issues provided by remote access? (CLEF project and further funding) The Confidentiality and e-Science research program: key questions

Confidentiality and the Grid Project Aims To develop methods for classifying the data environment To investigate the risk associated with release of multiple overlapping datasets To produce prototype disclosure risk assessment software for assessing risk of multiple

Data Data Everywhere… Massive and exponential increase in data; Mackey and Purdam(2002); Purdam and Elliot(2003,2005). –These studies have led to the setting up of the data monitoring service. Singer(1999) noted three behavioural tendencies: –Collect more information on each population unit –Replace aggregate data with person specific databases –Given the opportunity collect personal information Purdam and Elliot (2003) add: –Link data whenever you can

“New data” One of the key potentials for e-social science is the possibility of bringing together different data sources through linking and fusing. However, this is precisely the disclosure risk situation.

Data Environment Analysis The increasing availability of personal information has impacts on the disclosure control problem in 4 ways: 1.Decrease in sensitivity of information 2.Decrease in value to an intruder 3.Increase in probability of intruder access to key data 4.Increase in amount of key data intruder has access to

Data Environment Analysis Need to move with the technology from: –One shot analyses of individual datasets –Ongoing analyses of the data environment The question is not “How safe is my data” but “How disclosive is the data environment?”. A process of data monitoring is one aspect of this.

DEA provides a measure of the amount and type of individual information in –the public domain –restricted access datasets and –commercially available data Metadata are generated through –form analysis –metadata questionnaires –web-crawling software. Ultimately the process could be automated and tailored to specific grid computing systems.

DEA Interface

The DEA meta-data provides an understanding of: –what variables are available – under what coverage, –which could be linked with the anonymised release sets

The potential value of DEA: 1.it provides a potential to enable more appropriate understanding and classification of the total real risk of disclosive events. 2.it gives description of the de facto attitude of our culture towards personal data, thus enabling us to make more informed decisions on such subjects as privacy and data protection law.

Risk Experiments Experiment 1: Assessed the impact on data intruder’s ability to link microdata records arising from co-presence of population aggregate data. Experiment 2: Assessed the impact on an intruders ability to make attribution inferences arising from population data arising from the co-presence of microdata samples.

Headline findings Adding aggregate data increases the linkability of two microdata sets. Adding microdata to the mix significantly increases the accuracy of attributions, once the sample fraction rises above 5%. i.e. the more data concerning a given population existing in a given data environment the greater the disclosure risk.

Concluding Remarks Grid computing provides the potential for unprecedented access to high quality individual level data. However, as the amount of data on individual population units stored on computing systems increases, so does the threat to anonymised data releases.

Such data release may come to a halt as it becomes impossible to maintain sufficient data quality whilst meeting ever more stringent disclosure control constraints. It is vital that: – creative data access solutions are developed. –Grounded measures of data utility are developed. – data environment analysis is developed as an alternative to bureaucratic control