Download presentation
Presentation is loading. Please wait.
1
Topics at the Interface of Privacy and Genomics
Anthony Philippakis, MD PhD Chief Data Officer, Broad Institute June 4th, 2018
2
Who am I? Disclaimer: I am not a crypto person!!!
Chief Data Officer of the Broad Institute Lead the Data Sciences Platform Previously studied pure math, but that’s ancient history Trained as a cardiologist at BWH Care for patients with rare, genetic CV diseases Venture Partner at GV (Venture Capital) Invest at intersection of tech and life sciences Disclaimer: I am not a crypto person!!!
3
The Challenge of Scalability in Genomics
Globally, genomic data doubles every 8 months
4
Inverting the Model of Genomic Data Sharing
Traditional Approach: Bring data to researchers Opportunity: Bring researchers to the data Data Public Cloud Problems Data sharing = data copying Security (data handoffs) Huge infrastructure needed Siloed compute Advantages Cost Threat detection and auditing Increased Accessibility Shared & elastic compute
5
Opportunities in Security & Compliance
Areas technology can have a big impact: Management of data use Multiparty Computation Differential Privacy
6
Our Current Protocol for Data Access
Data Access Committee Data Depositors Data Use Limitations Project Request Forms Data Requestors No! This data is available for cancer research in a non-profit setting. I am studying Breast cancer at a company.
7
Our Current Protocol for Data Access
Data Access Committee Data Depositors Data Use Limitations Project Request Forms Data Requestors Yes! This data is available for cancer research in a non-profit setting. I am studying Breast cancer at a non-profit.
8
Human review of data access does not scale
Data Access Committee Data Depositors Data Use Limitations Data Access Request Data Requestors Scales Poorly!! O(N2) As of July 1, 2017 50,167 Submitted 34,16 Approved Number of studies in dbGaP 5, Number of PIs requesting data Number of PI countries Number of publications resulting from secondary data use dbGaP at PRIM&R 2017
9
Problem: Data Use is not Coded!
Data Use Restrictions: What are you doing with the data? “The donor wants her data used only for non-commercial cancer research” Satisfies Data Use Restrictions Yes No Appropriate Permissions Available No access Permissions: Who are you? “Only consortium members can READ this data until it is published.” Main Question: Can Data Use Restrictions be made machine-readable?
10
DUOS- Broad Data Use Oversight system
What is DUOS? Interfaces to transform data use restrictions and data access requests to machine-readable code (ADA-M & Consent Codes) A matching algorithm that checks if data access requests are compatible with data use restrictions Interfaces for the Data Access Committee to adjudicate whether structuring and matching has been done appropriately
11
Validation of DUOS Claim: Data Use Can Test: Run a trial!
Be Structured Test: Run a trial! Data Access Committee Pearl O’Rourke (Partners) Laura Rodriguez (NIH) John Wilbanks (Sage) Stacey Donnelly (Broad) Anthony Philippakis (Broad) Diseases: Diabetes research only, Breast cancer research only, etc Commercial Use: allowed/not allowed. Special populations: Ethnicities, gender, pediatric, etc. Future use for Methods Development, Aggregate Statistics, Controls Review of ~150 Data Use Limitations Letters at Broad demonstrated that ~90% can be structured with the following ontologies We have formed a DAC to compare automated review of access to traditional mode.
12
Initial results are very promising!
Validation of DUOS Initial results are very promising! >90% of data use restrictions were approved in structured form by the DAC
13
Opportunities in Security & Compliance
Areas technology can have a big impact: Management of data use Multiparty Computation Differential Privacy
14
Multiparty Computation in Genomics
Areas where SMC could have a big impact in genomics: The meta-analysis problem Cohort 1 Cohort 2 Large cohorts have been assembled over many years as part of clinical trials and epidemiology research. Many researchers (especially industry) are reluctant to share the whole cohort with another group. However, it is often mutually advantageous to do a focused meta-analysis (e.g., enrichment of a variant)
15
Multiparty Computation in Genomics
Areas where SMC could have a big impact in genomics: Geographic restrictions on data storage Many countries are passing laws that data from human subjects needs to physically be stored in that country. Clearly, we want to be able to cross-analyze large genetic cohorts from different countries. Most would agree that it storing encrypted data is acceptable, however (i.e., secret-sharing paradigm).
16
Secure Data Exchange Secure Multiparty computation
Challenges to software-based implementations Requires learning new, specialized programming languages Increased computational overhead (a big deal, given size of genomic datasets) With N parties, need to keep N copies of the data
17
Secure Data Exchange Secure Multiparty computation
Idea: Hardware-based approach My group is in early days of exploring hardware-based SMC with Intel Idea of building a “Data Switzerland” for life sciences
18
Opportunities in Security & Compliance
Areas technology can have a big impact: Management of data use Multiparty Computation Differential Privacy
19
Differential Privacy O(10^7) O(10^1 – 10^4) Individuals O(10^6) But growing fast Genotypes Phenotypes Potential for Differential Privacy in human genetics Huge interest in correlating genotypes and phenotypes to discover genetic basis of disease. Real appetite for the idea of a trusted and trustworthy database that researchers can query against without risk of re-identifying participants
20
Differential Privacy Why hasn’t it happened??? (my $0.02)
O(10^7) O(10^1 – 10^4) Individuals O(10^6) But growing fast Genotypes Phenotypes Why hasn’t it happened??? (my $0.02) I’m told that allowing arbitrary computations requires adding a LOT of noise... How much privacy is ok to leak? How do you doll out the privacy budget to researchers? Is there a robust, production-grade system that is ready to use (and at this scale)?
21
Closing thoughts The Broad Data Sciences Platform is a team of nearly 150 people that focus on making robust software products. We are organized more like a tech company than an academic research group. We are heavily involved in applied security efforts as part of large, national sequencing initiatives. I would love for us to be more involved in innovation in things at intersection of life sciences and data sciences. Please me if you think you might want to collaborate!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.