Gillian Raab, Chris Dibben, & Paul Burton UNECE-Eurostat Work Session on Statistical Data Confidentiality, Helsinki, 2015 Running an analysis of combined.

Slides:



Advertisements
Similar presentations
Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University
Advertisements

Secure Statistics Software Francisco Vera National Institute of Statistical Sciences.
Managing data using CSPro
Centre for Reviews and Dissemination May 2013 Registering a systematic review on PROSPERO.
Integrating Data for Analysis, Anonymization, and SHaring Supported by the NIH Grant U54HL to the University of California, San Diego Shuang Wang,
Using synthetic data to improve the accessibility of the SLS Susan Carsley, SLS Project Manager.
Administrative Data Research Centre – Scotland Chris Dibben.
Analysis and Communication of US News Rankings using Monte Carlo Simulations: A Comparison to Regression Modeling Presented by Chris Maxwell Purdue University.
Networking Theory (Part 1). Introduction Overview of the basic concepts of networking Also discusses essential topics of networking theory.
Privacy Preserving K-means Clustering on Vertically Partitioned Data Presented by: Jaideep Vaidya Joint work: Prof. Chris Clifton.
Valid Statistical Analysis for Logistic Regression with Multiple Sources Rob Hall (Dept of Machine Learning, CMU) Joint work with Yuval Nardi and Steve.
Long-term Archive Service Requirements draft-ietf-ltans-reqs-00.txt.
Using EDC-Rave to Conduct Clinical Trials at Genentech
Linking lives through time Marital Status, Health and Mortality: The Role of Living Arrangement Paul Boyle, Peteke Feijten and Gillian Raab.
Balance of Payments Collection and Compilation 23 Feb 2012 Central Statistics Office Ireland.
Measuring recycling in the UK Cindy Lee Waste Data Strategy Manager.
Chemistry CPD Presentation Programme Welcome and outline for event Overview of Internal Assessment in Chemistry Verification –brief outline Key messages.
Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and Hot Deck Jennifer Huckett Iowa State University June 20, 2007.
Good practice in Research Data Management Module 6: Tools, training and support.
Copyright 2010, The World Bank Group. All Rights Reserved. Prosecution Statistics Part 2 Crime, Justice & Security Statistics Produced in Collaboration.
Intruder Testing: Demonstrating practical evidence of disclosure protection in 2011 UK Census Keith Spicer, Caroline Tudor and George Cornish 1 Joint UNECE/Eurostat.
Qualifications Portal Guide Personal Development and Employability Qualification.
TIMES 3 Technological Integrations in Mathematical Environments and Studies Jacksonville State University 2010.
Providing Access to Census- based Interaction Data in the UK: That’s WICID! John Stillwell School of Geography, University of Leeds Leeds, LS2 9JT, United.
Piotr Wolski Introduction to R. Topics What is R? Sample session How to install R? Minimum you have to know to work in R Data objects in R and how to.
Tools for Privacy Preserving Distributed Data Mining
Scot Exec Course Nov/Dec 04 Survey design overview Gillian Raab Professor of Applied Statistics Napier University.
Cryptographic methods for privacy aware computing: applications.
Monitoring public satisfaction through user satisfaction surveys Committee for the Coordination of Statistical Activities Helsinki 6-7 May 2010 Steve.
Frameworks for the Access and Use of Administrative Data, With the Example of Current Practice in the UK Steven Vale Office for National Statistics UK.
CE Operating Systems Lecture 3 Overview of OS functions and structure.
Publication / Fit for Purpose Kirsty Anderson Principal Information Analyst.
Use of Administrative Data Seminar on Developing a Programme on Integrated Statistics in support of the Implementation of the SNA for CARICOM countries.
Administrative procedures for microdata access at SURS October 2013.
David Carr The Wellcome Trust Data management and sharing: the Wellcome Trust’s approach Economic & Social Data Service conference.
General Register Office for S C O T L A N D information about Scotland's people Comparison between NHSCR and Community health index sources of migration.
2008 NCHS Data Users’ Conference Omni Shoreham Hotel Washington, DC Wednesday, August 13, 2008.
Research for 2021 Census in England and Wales Potential innovations.... Garnett Compton, ONS UNECE Census Meeting, 30 September – 2 October 2015.
Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.
Using EDC-Rave to Conduct Clinical Trials at Genentech Susanne Prokscha Principal CDM PTM Process Analyst February 2012.
Has Agent Scripting Got You FRUSTRATED?. Agent Scripting Simplified!
Access to microdata in Statistics Estonia First DwB European Data Access Forum Luxembourg, 28th March 2012 Tuulikki Sillajõe.
The DataSHIELD Legal Analysis Template Susan Wallace, University Of Leicester, Leicester, UK Jennifer Harris, Norwegian Institute of Public Health, Oslo,
Modelling international migration to produce local level estimates Ruth Fulton Office for National Statistics.
PREPARING FOR PARTICIPATION IN HALT-2 LECTURE 6. To outline the steps necessary to prepare successfully for the HALT 2013 PPS. LECTURE OBJECTIVES.
Security fundamentals Topic 5 Using a Public Key Infrastructure.
BY: CHRIS GROVES Privacy in the Voting Booth. Reason for Privacy Voters worry that their vote may be held against them in the future  People shouldn’t.
Virtual Research Environments (VREs) to enable access to confidential data for scientific purposes Joint UNECE/Eurostat Work Session on Statistical Data.
Restitution on Work Session 1 Paul Jackson DwB – WP3.
Research and teaching with the SHS data Gillian Raab Professor of Applied Statistics Napier University.
Geneva, April 2010 Joint UNECE/Eurostat Work Session on Migration Statistics Migration Statistics Mainstreaming Katarzyna Kraszewska European Commission,
Methods of Secure Computation and Data Integration Jerome Reiter, Duke University Alan Karr, NISS Xiaodong Lin, University of Cincinnati Ashish Sanil,
1 What types of historical records are available?.
Data Management Practices for Early Career Scientists: Closing Robert Cook Environmental Sciences Division Oak Ridge National Laboratory Oak Ridge, TN.
Joint UNECE/Eurostat work session on statistical data confidentiality October 2015 Helsinki, Finland Circle of trust Maurice Brandt DESTATIS.
Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality, 5-7 September 2015, Helsinki Beata Nowok Administrative Data Research Centre – Scotland.
Armenia Twinning 2011 Component F – Information Society, 2 – 6 May DEVELOPMENT OF INFORMATION SOCIETY STATISTICS IN LITHUANIA SURVEY ON.
With the support of the LPP programme of the European Union 1 This project has been funded with support from the European Commission. This publication.
11 Measuring Disclosure Risk and Data Utility for Flexible Table Generators Natalie Shlomo, Laszlo Antal, Mark Elliot University of Manchester
Differentially Private Verification of Regression Model Results
Paul Burton University of Bristol, DataSHIELD Research Program
Beata Nowok Chris Dibben & Gillian Raab Administrative Data
Accelerated Sampling for the Indian Buffet Process
Paul Burton University of Bristol, SSCM, DataSHIELD Research Program
The Beginnings of a European Remote Access Network
Artificial data in social science
Point 6. Eurostat plans for Time Use Survey data processing and dissemination Working Group on Time Use Surveys 10 April 2013.
Coordination : Roxane Silberman CNRS/Réseau Quetelet Presented by
Summary of the Consultation on Labour Mobility and Globalisation
Presentation transcript:

Gillian Raab, Chris Dibben, & Paul Burton UNECE-Eurostat Work Session on Statistical Data Confidentiality, Helsinki, 2015 Running an analysis of combined data when the individual records cannot be combined: practical issues in secure computation }

UNECE-Eurostat Work Session on Statistical Data Confidentiality, Helsinki, October 2015 Raab, Dibben, Burton Running an analysis of combined data when the individual records cannot be combined: practical issues in secure computation What is multi-party computation?  The data to contribute to an analysis are distributed across several sites  Confidentiality concerns prevent them being pooled for analysis  Methods have been developed to allow analyses to be carried out without pooling the data

UNECE-Eurostat Work Session on Statistical Data Confidentiality, Helsinki, October 2015 Raab, Dibben, Burton Running an analysis of combined data when the individual records cannot be combined: practical issues in secure computation Methods developed for this  Since around 2005  By statisticians who call it the analysis of distributed data  And computer scientists who call it privacy- preserving data mining (PPDM)

UNECE-Eurostat Work Session on Statistical Data Confidentiality, Helsinki, October 2015 Raab, Dibben, Burton Running an analysis of combined data when the individual records cannot be combined: practical issues in secure computation Horizontally or vertically partitioned data ? v1V2v3V4 v1V2v3V4 Study 1 Study 2 v1v2v3v4v5v6 Study 2 Study 1

UNECE-Eurostat Work Session on Statistical Data Confidentiality, Helsinki, October 2015 Raab, Dibben, Burton Running an analysis of combined data when the individual records cannot be combined: practical issues in secure computation Horizontally partitioned data - examples  Genomic data  Adverse drug effects  Comparable surveys or censuses collected by different agencies

UNECE-Eurostat Work Session on Statistical Data Confidentiality, Helsinki, October 2015 Raab, Dibben, Burton Running an analysis of combined data when the individual records cannot be combined: practical issues in secure computation The UK Longitudinal Studies  Three studies each run by a different agency  England and Wales – (ONS LS)  Scottish Longitudinal Study – (SLS)  Norther Ireland Longitudinal study (NILS)  Each run by a different National Agency  Data needs to be held very securely (Census act)  The servers that hold the data have no internet access

UNECE-Eurostat Work Session on Statistical Data Confidentiality, Helsinki, October 2015 Raab, Dibben, Burton Running an analysis of combined data when the individual records cannot be combined: practical issues in secure computation The UK Longitudinal Studies  Three studies each run by a different agency  England and Wales – (ONS LS)  Scottish Longitudinal Study – (SLS)  Norther Ireland Longitudinal study (NILS)  Each run by a different National Agency  Data needs to be held very securely (Census act)  The servers that hold the data have no internet access

UNECE-Eurostat Work Session on Statistical Data Confidentiality, Helsinki, October 2015 Raab, Dibben, Burton Running an analysis of combined data when the individual records cannot be combined: practical issues in secure computation Methods for multi-party computation  Rely on exchanging and combining sufficient statistics from the different studies  For horizontally partitioned data, the sufficient statistics for the pooled data are just the sum over those from the individual studies  E.g. For regression inferences can be obtained from the sum of the information matrices (I)(X’X) and the score statistics (S) X’Y.  Coefficients are (∑I ) -1 ∑S  For iterative methods (e.g. GLMs) these quantities need to be exchanged at each iteration until the models converge.

UNECE-Eurostat Work Session on Statistical Data Confidentiality, Helsinki, October 2015 Raab, Dibben, Burton Running an analysis of combined data when the individual records cannot be combined: practical issues in secure computation Iterative estimation for GLMs  An analysis computer (AC) can control the estimation  Data is held on several data computers (DCs)  The following steps are needed  Starting values for the coefficients are sent to all DCs  Each DC returns their current I and S to the AC  The AC calculates a new value of the coefficients and checks if model converged  These steps are repeated until convergence

UNECE-Eurostat Work Session on Statistical Data Confidentiality, Helsinki, October 2015 Raab, Dibben, Burton Running an analysis of combined data when the individual records cannot be combined: practical issues in secure computation The DataSHIELD project   Provides open source code to carry out these computations in the R package  They also provide a means of setting up an interface (opal server) between the AC and the DCs that restricts what objects can be transferred  In this protocol the AC controls all the analyses  After data harmonisation between the sites the AC establishes communication with each one and can then run a limited number of commands to extract data summaries from the DCs

UNECE-Eurostat Work Session on Statistical Data Confidentiality, Helsinki, October 2015 Raab, Dibben, Burton Running an analysis of combined data when the individual records cannot be combined: practical issues in secure computation The DataSHIELD protocol  AC (yellow) controls the whole analysis  The interface restricts what can be extracted from the DCs

UNECE-Eurostat Work Session on Statistical Data Confidentiality, Helsinki, October 2015 Raab, Dibben, Burton Running an analysis of combined data when the individual records cannot be combined: practical issues in secure computation The E-DataSHIELD protocol

UNECE-Eurostat Work Session on Statistical Data Confidentiality, Helsinki, October 2015 Raab, Dibben, Burton Running an analysis of combined data when the individual records cannot be combined: practical issues in secure computation At each iteration an analyst at a DC must  receive a vector of coefficients from the AC by  transfer it to the secure server via a secure data stick (for some LSs this transfer can only be done by agency staff, not be the individual researcher)  read it into their R workspace  run a routine to update the information matrix and score vector write them to a file  export the file from the secure setting via the secure data stick  it to the AC.

UNECE-Eurostat Work Session on Statistical Data Confidentiality, Helsinki, October 2015 Raab, Dibben, Burton Running an analysis of combined data when the individual records cannot be combined: practical issues in secure computation At each iteration an analyst at a DC must  receive a vector of coefficients from the AC by  transfer it to the secure server via a secure data stick (for some LSs this transfer can only be done by agency staff, not be the individual researcher)  read it into their R workspace  run a routine to update the information matrix and score vector write them to a file  export the file from the secure setting via the secure data stick  it to the AC.

UNECE-Eurostat Work Session on Statistical Data Confidentiality, Helsinki, October 2015 Raab, Dibben, Burton Running an analysis of combined data when the individual records cannot be combined: practical issues in secure computation Features of the E-DataSHIELD method  Automatic calculation of starting values  Fits a whole set of models together and stops when they have all converged  Generates the names of the files to transfer automatically at each iteration, and then checks that everything is as it should be.  Other routines available for pre-analysis harmonisation and summaries of data and presentation of results  Two projects using the LSs have used this methodology successfully

UNECE-Eurostat Work Session on Statistical Data Confidentiality, Helsinki, October 2015 Raab, Dibben, Burton Running an analysis of combined data when the individual records cannot be combined: practical issues in secure computation Secure multi-part computation  The individual parts of the summaries are encrypted using homomorphic encryption  Simplest example of this (used by statisticians) is for the first DS to add a random number to their result and pass it round the others, then when it comes back to the first one the random number is subtracted  Computer scientists use more sophisticated ones  But it wouldn’t work with DataShield

UNECE-Eurostat Work Session on Statistical Data Confidentiality, Helsinki, October 2015 Raab, Dibben, Burton Running an analysis of combined data when the individual records cannot be combined: practical issues in secure computation The DataSHIELD protocol  DCs don’t communicate with each other  But no reason why it should not work with E-DataShield

UNECE-Eurostat Work Session on Statistical Data Confidentiality, Helsinki, October 2015 Raab, Dibben, Burton Running an analysis of combined data when the individual records cannot be combined: practical issues in secure computation E-DataSHIELD with secure summaries

Will shortly to be submitted to CRAN E-DataSHIELD package eds

UNECE-Eurostat Work Session on Statistical Data Confidentiality, Helsinki, October 2015 Raab, Dibben, Burton Running an analysis of combined data when the individual records cannot be combined: practical issues in secure computation Sample eds code At each DC >eds.glm.update("spine",itno=1,data=SEast,quietly=T) File written Project_spine_Itno_1_Study_2.R Now file to analysis computer At the AC > eds.joint.update(project="spine",itno=1,studies=1:3,quietly=T) which gives the message 0 out of 3 models converged File written: Joint_fit_project_spine_Itno_2.R.

UNECE-Eurostat Work Session on Statistical Data Confidentiality, Helsinki, October 2015 Raab, Dibben, Burton Running an analysis of combined data when the individual records cannot be combined: practical issues in secure computation Sample eds code with secure summaries At the first DC >eds.glm.update("spine",itno=1,data=London,quietly=F,secure=T) Result read from file Start_project_spine_Itno_1.R Fitting 3 models for project spine at iteration 1 File with random objects is written as Project_spine_Itno_1_Randoms.R Result for project spine iteration 1 study 1 written to file Project_spine_Itno_1_Study_1.R Now file to next data computer, study 2 At the next DC Same commands but the random file is only written by the first DC And at the next iteration the first DC reads it before the Joint update and writes a new one for the next iteration.

UNECE-Eurostat Work Session on Statistical Data Confidentiality, Helsinki, October 2015 Raab, Dibben, Burton Running an analysis of combined data when the individual records cannot be combined: practical issues in secure computation Further points in the written paper  Examples of how the results of the analyses can be computed  How covariate by study interactions are fitted  An example using synthetic data generated from one of the LSs  References to other papers

UNECE-Eurostat Work Session on Statistical Data Confidentiality, Helsinki, October 2015 Raab, Dibben, Burton Running an analysis of combined data when the individual records cannot be combined: practical issues in secure computation Conclusions  We have a package that works for the LSs  We plan to make it publicly available via CRAN  And hope it will be useful to others  More details and future directions in the published paper Farr Institute International Conference 2015 | Joshua Snoke, Beata Nowok, Gillian M. Raab, Chris Dibben, and Aleksandra B. Slavković | 27 August 2015

UNECE-Eurostat Work Session on Statistical Data Confidentiality, Helsinki, October 2015 Raab, Dibben, Burton Running an analysis of combined data when the individual records cannot be combined: practical issues in secure computation Acknowledgments Farr Institute International Conference 2015 | Joshua Snoke, Beata Nowok, Gillian M. Raab, Chris Dibben, and Aleksandra B. Slavković | 27 August 2015 We are grateful to the DataSHIELD project for having introduced us to these ideas and shared their code with us. Also to the staff of the LSs for helping us to develop eds. The DataSHIELD project is supported by funding from: the European Union’s Seventh Framework Programme - BioSHaRE- EU (Biobank Standardisation and Harmonisation for Research Excellence in the European Union); a strategic award from MRC and Wellcome Trust for the ALSPAC project; and the Welsh and Scottish Farr Institutes, MRC funded E-Health Informatics Research Centres (EHIRCs ).