Paul Burton University of Bristol, DataSHIELD Research Program

Slides:



Advertisements
Similar presentations
Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University
Advertisements

April 2010 MRC Data Sharing Policy Peter Dukes Policy Lead – Data Sharing & Preservation.
Comparing Results from the England and Wales, Scotland and Northern Ireland Longitudinal Studies: Health and Mortality as a case study Census Microdata.
Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek.
NetPay provides best and effective solution for company Managers to maintain their employee scheduling task (including staff in/out details, overtime,
Data Storage & Security Dr Alastair F. Brown Head of Computing MRC Human Genetics Unit MRC Institute of Genetics and Molecular Medicine The University.
The Multiple Regression Model.
Administrative Data Research Centre for England 1.
1 Web Servers / Deployment Alastair Dawes Original by Bhupinder Reehal.
Definition  Regression Model  Regression Equation Y i =  0 +  1 X i ^ Given a collection of paired data, the regression equation algebraically describes.
The General Linear Model. The Simple Linear Model Linear Regression.
Latent Growth Curve Modeling In Mplus:
T HE W EB - BASED I NTERFACE TO C ENSUS I NTERACTION D ATA - WICID Presentation to the ESRC Research Methods Festival Adam Dennett Centre for Interaction.
Design of Web-based Systems IS Development: lecture 10.
Update on GSE Activities Klaus-Dieter Barbknecht GSE President GIE General Assembly Madrid 21 November 2007.
Proposed IPUMS-International Secure Data Enclave Patricia Kelly Hall
DWB – 2 nd Regional Workshop Athens, October 2014 Adolfo Gálvez INE Accesing microdata for scientific research purposes- INE Spain.
Tryggve project developing services for sensitive biomedical data: Call for Nordic use cases NeiC 2015 Conference Workshop on sensitive data Antti Pursula.
Valid Statistical Analysis for Logistic Regression with Multiple Sources Rob Hall (Dept of Machine Learning, CMU) Joint work with Yuval Nardi and Steve.
Data Management Development and Implementation: an example from the UK SLA Conference, Boston, June 2015 Geraldine Clement-Stoneham Knowledge and Information.
UK GRID Firewall Workshop Matthew J. Dovey Technical Manager Oxford e-Science Centre.
Disclosure Avoidance: An Overview Irene Wong ACCOLEDS/DLI Training December 8, 2003.
Tunis International Centre for Environmental Technologies Small Seminar on Networking Technology Information Centers UNFCCC secretariat offices Bonn, Germany.
Transparency and Open Data: GSS Response Iain Bell HoP MoJ.
Innovations in Data Dissemination Thomas L. Mesenbourg, Jr. Acting Director U.S. Census Bureau United Nations Seminar on Innovations in Official Statistics.
Access to official statistical micro data at the Statistical Office of the Republic of Slovenia and cooperation with the Slovenian Social Science Data.
Grid Execution Management for Legacy Code Applications Grid Enabling Legacy Code Applications Tamas Kiss Centre for Parallel.
2008 NCHS Data Users’ Conference Omni Shoreham Hotel Washington, DC Wednesday, August 13, 2008.
The DataSHIELD Legal Analysis Template Susan Wallace, University Of Leicester, Leicester, UK Jennifer Harris, Norwegian Institute of Public Health, Oslo,
Gillian Raab, Chris Dibben, & Paul Burton UNECE-Eurostat Work Session on Statistical Data Confidentiality, Helsinki, 2015 Running an analysis of combined.
Chapter 22: Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
Disclosure Analysis: What do RDC Analysts do? Research Data Centre Program, Statistics Canada James Chowhan Ontario DLI Training, Queen's University
Development of UK Virtual Microdata Laboratory Felix Ritchie Shanghai, March 2010.
Restitution on Work Session 1 Paul Jackson DwB – WP3.
Joint UNECE/Eurostat work session on statistical data confidentiality October 2015 Helsinki, Finland Circle of trust Maurice Brandt DESTATIS.
Grid Execution Management for Legacy Code Architecture Exposing legacy applications as Grid services: the GEMLCA approach Centre.
From Data to Paper [via Stata!] Tim Croudace and Jon Heron ^ Jon works in Bristol too ;-) ESRC Funded Researcher Development Initiative Project Grant:
Privacy and ‘Big Data’: the European perspective Human Subjects’ Protections in the Digital Age: IRB, Privacy and Big Data Peter Elias, University of Warwick.
E-Infrastructure for Sensitive biomedical data NeiC 2015 Conference Espoo, Finland Antti Pursula.
Virtual Private Network Access for Remote Networks
Development of UK Virtual Microdata Laboratory
Creation of synthetic microdata in 2021 Census Transformation Programme (proof of concept) Robert Rendell.
National e-Infrastructure Vision
Cameron Blashka| Informer Implementation Specialist
AMI – Status November Solveig Albrand Jerome Fulachier
Paul Burton University of Bristol, SSCM, DataSHIELD Research Program
Advanced Security Architecture System Engineer Cisco: practice-questions.html.
Jayne Tierney1, Angelo Tinazzi2 Sarah Burdett1,Lesley Stewart3
How does Ongoing Data in the Repository come to MCHP?
Privacy Preserving Record Linkage
Sabrina Iavarone Senior User Services Officer
Kerberos Kerberos is an authentication protocol for trusted hosts on untrusted networks.
J.-F. Pâris University of Houston
Chapter 8: Weighting adjustment
Workshop on Decentralised Access to European Microdata
Web Servers / Deployment
Item 3 of the draft agenda ESS.VIP ADMIN: progress report
Disclosure Avoidance: An Overview
Fixed, Random and Mixed effects
ESSnet project on the use of administrative and accounts data for business statistics (ESSnet AdminData) MEETS project Workshop II, Implementation.
数据的矩阵描述.
Point 6. Eurostat plans for Time Use Survey data processing and dissemination Working Group on Time Use Surveys 10 April 2013.
Mapping Data Production Processes to the GSBPM
CSCE 715: Network Systems Security
CSCE 715: Network Systems Security
The European Statistical Training Programme (ESTP)
Simplex method (algebraic interpretation)
Carlos Ordonez, Javier Garcia-Garcia,
EMEP Monitoring strategy
EMEP Monitoring strategy
Presentation transcript:

DataSHIELD: taking the analysis to the data not the data to the analysis Paul Burton University of Bristol, DataSHIELD Research Program McGill University, OICR, Maelstrom Research MRC Epidemiology Unit, Cambridge The Norwegian Institute of Public Health, Dept of Epidemiology Technical University of Eindhoven University College, London University of Bristol, School of Social and Community Medicine WUN DAPPER Workshop: 22nd-23rd August 2016, Bristol Satellite meeting to the 2016 International Population Data Linkage Conference (IPDLN)    

Constraints and barriers to sharing and combining individual-level data (microdata) Ethico-legal or other governance restrictions Maintaining control of intellectual property Physical size of data

The DataSHIELD approach Take “analysis to data” ….. not “data to analysis” Leave the data to be analysed on local servers behind the firewalls where they usually reside The analysis centre co-ordinates parallelised analyses in all studies simultaneously Tie analyses together with non-disclosive information: (i) non-disclosive statistics of an appropriate nature (ideally sufficient statistics); (ii) encrypted information that can be decrypted as a final step of processing Analytic processing - and options for disclosure control - located with the data

DataSHIELD: Data Aggregation Through Anonymous Summary-statistics from Harmonized Individual-levEL Databases Horizontal DataSHIELD Different sources hold all variables but on different individuals Secure meta-analysis (IPD and Study-Level) Secure single-site analysis Vertical DataSHIELD Different sources hold different variables on the same individuals Secure processing and analysis of linked data without bringing the data together

Horizontal DataSHIELD One step analyses: e.g. ds.table2D - request non-disclosive output from all sources Multi-step analyses: e.g. ds.lexis – set up and then request Iterative analyses: e.g. ds.glm - parallel processes linked together by non- identifying summary statistics – e.g. for glm = score vectors and information matrices Can be used as equivalent to full individual level analysis or to study level meta-analysis

DataSHIELD: a novel solution b.vector<-c(0,0,0,0) glm(cc~1+BMI+BMI.456+SNP, family=binomial, start=b.vector, maxit=1) Analysis commands (1)

DataSHIELD: a novel solution [36, 487.2951, 487.2951, 149] Information Matrix Study 5 Score vector Study 5 Summary Statistics (1) Summary Statistics (1) Score vector Study 5 Information Matrix Study 5

DataSHIELD: a novel solution [36, 487.2951, 487.2951, 149] Information Matrix Study 5 Score vector Study 5 Summary Statistics (1) Summary Statistics (1) Score vector Study 5 Σ Information Matrix Study 5

DataSHIELD: a novel solution b.vector<- c(-0.322, 0.0223, 0.0391, 0.535) Analysis commands (2) glm(cc~1+BMI+BMI.456+SNP, family=binomial, start=b.vector, maxit=1)

and so on .....

DataSHIELD: a novel solution Updated parameters (4) Final parameter estimates Coefficient Estimate Std Error Intercept -0.3296 0.02838 BMI 0.02300 0.00621 BMI.456 0.04126 0.01140 SNP 0.5517 0.03295 Σ

Direct conventional analysis Coefficients: Estimate Std. Error (Intercept) -0.32956 0.02838 BMI 0.02300 0.00621 BMI.456 0.04126 0.01140 SNP 0.55173 0.03295 DataSHIELD analysis Does it work?

Does it work? Conventional analysis DataSHIELD analyses Association of Individual level meta-analysis Random effects study level meta-analysis Association of diabetes with BMI>30 in 9 HOP studies

Horizontal DataSHIELD: current implementation Server-side functions Client-side functions Individual level data never transmitted or seen by the statistician in charge, or by anybody outside the original centre in which they are stored. R Web services Data server Opal Finrisk Prevend 1958BC BioSHaRE web site Analysis client

Vertical DataSHIELD + IM5: Regression coefficients = XTY/ XTX 500 70.56657 297 7646.29164 65.39412 382 IM5: Regression coefficients = XTY/ XTX XTX: Need to calculate Analysis Computer R Web services Data computer Opal NHS ALSPAC Education XAXA XAXB XAXC XBXB XBXC XCXC XB MB XAXB XA1 * XB1 + XA2 * XB2 XA3 * XB3 …… MA MC XA

Vertical DataSHIELD IM5: Regression coefficients = XTY/ XTX 500 70.56657 297 7646.29164 65.39412 382 IM5: Regression coefficients = XTY/ XTX XTX: Need to calculate Analysis Computer R Web services Data computer Opal NHS ALSPAC Education MB XAXA XAXB XAXC XBXB XBXC XCXC XB MA XTA MAXTAXBMB MA XA (MA)-1 MAXTAXBMB (MB)-1 = XAXB

Why DataSHIELD? Horizontal DataSHIELD multi-site Secured IPD meta-analysis or study-level meta-analysis where different studies hold the same variables on different individuals Data remains behind firewall of study holding data (but could be a formal repository) and is invisible and unobtainable externally. Appropriate disclosure settings under control of data controller Open-source freeware Horizontal DataSHIELD single site Open-source freeware approach to creating a secure data enclave Vertical DataSHIELD Ultra-secure analysis of very sensitive linked data where no source is prepared for its linked data to be held by any other sources or a trusted third party Securing the linkage process itself (possibly!) Single site Horizontal DataSHIELD Post-publication “open access” to sensitive data “Open access” to simple descriptive stats from rigorously governanced studies Analytic access to sensitive (but not ultra-sensitive) linked data Analytic access to data collected by researchers in resource-poor regions

A proposal: securing linked data on Farr/ADRN secure enclave behind Horizontal DataSHIELD SUMMARY STATISTICS SV5: [36, 487.2951, 487.2951, 149] Individual level data or disclosive analysis not transmitted or seen outside repository 500 70.56657 297 7646.29164 65.39412 382 IM5: ALSPAC data linked to National Pupil database R User 1 R User community Web services Web services Farr ADRN Portal Web services Linked datasets in secure enclave R User 2 Web services R User 3 Web services NHS WALES Outpatient data Emergency dept Patient episode database UK Biobank R

THANK YOU FOR LISTENING