Download presentation
Presentation is loading. Please wait.
1
The Data Commons An introduction & Overview
BD2K AHM, November 29, 2016 Vivien Bonazzi (ADDS) Current snapshot of Commons status
2
Outline What’s driving the need for a Data Commons?
Development of the Data Commons at NIH Current Data Commons Pilots Next steps Considerations & Concluding Thoughts
3
What’s driving the need for a Data Commons?
4
Convergence of factors
Mountains of Data Increasing need and support for Data sharing Availability of digital technologies and infrastructures that support Data at scale
7
Went into effect January 25, NCI guidance: management/nci-policies/genomic-data Requires public sharing of genomic data sets
8
Recommendation #4: A national cancer data ecosystem for sharing and analysis.
Create a National Cancer Data Ecosystem to collect, share, and interconnect a broad array of large datasets so that researchers, clinicians, and patients will be able to both contribute and analyze data, facilitating discovery that will ultimately improve patient care and outcomes. 8
11
Challenges with Biomedical Data The Journal Article is the end goal Data is a means to an ends (low value) Data is not FAIR Findable, Accessible, Interoperable, Reproducible Limited e-infrastructures to support FAIR data
12
What’s Changing? Digital ecosystems
13
Development of the NIH Data Commons
14
Changing the conversation around Data sharing and access
NIH Data Commons How do we find data, software, standards? How can we make (large) data, annotations, software, metadata accessible? How do we reuse data, tools and standards? How do we make more data machine readable? How do we leverage existing digital technologies systems, infrastructures? How do we collaborate? How do we enable digital ecosystem?
15
Data Commons enabling data driven science
Enable investigators to leverage all possible data and tools in the effort to accelerate biomedical discoveries, therapies and cures by driving the development of data infrastructure and data science capabilities through collaborative research and robust engineering Matthew Trunnel, FHC
16
Data Commons’s
17
Developing a Data Commons
Treats products of research – data, methods, papers etc. as digital objects These digital objects exist in a shared virtual space Find, Deposit, Manage, Share, and Reuse data, software, metadata and workflows Digital object compliance through FAIR principles: Findable Accessible (and usable) Interoperable Reusable
18
The Data Commons. is a framework
The Data Commons is a framework that supports FAIR data access and sharing and fosters the development of a digital ecosystem
19
The Data Commons Framework
Compute Platform: Cloud Services: APIs, Containers, Indexing, Software: Services & Tools scientific analysis tools/workflows Data “Reference” Data Sets User defined data Digital Object Compliance App store/User Interface SaaS PaaS Detailed description of the Commons Framework can be found at : IaaS
20
Mapping BD2K Activities and Commons Pilots to the Commons Framework
BD2K Centers, MODS, HMP & Interoperability Supplements Compute Platform: Cloud or HPC Services: APIs, Containers, Indexing, Software: Services & Tools scientific analysis tools/workflows Data “Reference” Data Sets User defined data Digital Object Compliance App store/User Interface NCI & NIAID Cloud Pilots + GDC BioCADDIE/Other Indexing NIH + Community defined data sets Cloud credits model (CCM)
21
Current Data Commons Pilots
22
Current Data Commons Pilots
Making large and/or high impact NIH funded data sets and tools accessible in the cloud Reference Data Sets Explore feasibility of the Commons Framework Facilitate collaboration and interoperability Commons Framework Pilots Developing Data and Software indexing methods Leveraging BD2K Efforts: bioCADDIE and others. Collaborating with external groups Resource Search & Index Cloud Credit Model Provide access to cloud (IaaS) and PaaS/SaaS via credits Connecting credits to the grants system
23
Reference Data Sets Pilot Large, High-Impact Datasets in the Cloud
Reference Data Sets Pilot Large, High-Impact Datasets in the Cloud Vivien Bonazzi
24
Software: Services & Tools
Mapping to the Commons Framework Large, High-Impact Datasets in the Cloud - Populating the Commons Compute Platform: Cloud or HPC Services: APIs, Containers, Indexing, Software: Services & Tools scientific analysis tools/workflows Data “Reference” Data Sets User defined data Digital Object Compliance App store/User Interface Large, High-Impact Data Sets in the Cloud
25
Kick-start the Commons with Commons-compliant data and tools
Overview: Large, High-Impact Datasets in the Cloud - Populating the Commons Make large, high impact, NIH funded data sets available in the cloud/commons Co-locate large datasets and compute power, to improve access, use, re-use, and sharing of data and tools Kick-start the Commons with Commons-compliant data and tools Data must adhere to Common compliance /FAIR principles Provide an indexable test data sets for bioCADDIE (and other indexing efforts)
26
What will we learn: Large, High-Impact Datasets in the Cloud - Populating the Commons
This pilot project will inform NIH on: Which Clouds are most functional, practical, and cost effective? What is involved in moving data resources to the Cloud? What will it cost? How to manage challenges associated with both open access and controlled access data? How do we find data and resources across clouds? How do we compute across clouds?
27
Proposed Components: Large, High-Impact Datasets in the Cloud
Biomedical data resources and tools Support to migrate large, high-impact datasets and associated tools into multiple cloud providers Data an tools sets must be FAIR Cloud Infrastructure Support for cloud storage and architectural engineering to support data and tools Coordination Facilitate activities across the biomedical data resources and cloud providers Development of market place/app store approaches Auth: Authorization & Access controls Tracking metrics (cost, usage etc.) and impact of the overall project
28
Reference Data Sets – Next Steps
NIH Data Task Force Chaired by Francis Collins Involves many NIH ICs Developing some shorter term preliminary pilots for larger NIH funded data sets in the cloud Expect to see some announcements in Jan/Feb 2017 RFI – engage in dialoged with the community Planned Winter 2017 FOAs – Supporting large high impact data sets in the cloud Spring 2017
29
Commons Framework Pilots Exploring feasibility of the Commons Framework : Software and Services layer Valentina Di Francesco
30
Commons Framework Pilots (CFPs)
Exploring feasibility of the Commons Framework Facilitating connectivity, interoperability and access to digital objects Providing digital research objects to populate the Commons
31
Commons Framework Pilots
Parent grant’s IC Project description TOGA NIBIB Cloud-hosted data publication system Allows the automatic creation and publication of data a personalized data repository MUSEN NIAID Smart APIs – improved handling for metadata within APIs Ontological support for metadata within an API Improving smart API discoverability: a registry of APIs HAN NIGMS Docker container hub for BD2K community Docker containers for genomic analysis applications and pipelines Benchmark, Evaluation & best practices COOPER/KOHANE NHGRI Cloud based authenticated API access and exchange of causal modeling data , tools + genomic and phenomic data (PIC) Docker containers for CCD tools available in AWS HAUSSLER Secure sharing of germline genetic variations for a targeted panel of breast cancer susceptibility genes and variations (GA4GH) API : being able to query this data and metadata Ohno-Machado NHLBI Development of an ecosystem for repeatable science easy reuse of data AND software; tracking of provenance. Use of container technologies for software and data reuse. White The entire HMP1 data set made accessible on AWS Analysis tools for microbiome data in AWS Ma’ayan A Cloud-Based Microscopy Imaging Commons Portal with microscopy data and metadata Sternberg Development of a cloud-based literature curation system for specific curation tasks of the collaborating sites. An API to provide programmatic access to the relevant papers in PMC MODs PIs Development of a common data model for the MODs Development of APIs accessing data across the MODs You may want to remove the ICs column – not relevant
32
Commons Framework Pilots
APIs Containerization: Docker containers, guidelines, registry store Workbenches, Connectors Indexing Market Place/App Store
33
Mapping the Commons Framework PILOTS to the Commons Framework
White - HMP Compute Platform: Cloud or HPC Services: APIs, Containers, Indexing, Software: Services & Tools scientific analysis tools/workflows Data “Reference” Data Sets User defined data App store/User Interface Musen Ma’ayan Cooper Han Haussler MODs Sternberg Ohno-Machado Toga This slide maps the FY15 funded CFPs to the framework.
34
Commons Framework Pilots : Updates
Sept – First set of CFPs awarded Nov CFPs participated in the AHM and the Commons breakout session Feb Established Common Framework Working Group (CFWG) CFWG members: Pilots’ PIs and/or technical leads; few PIs of the BD2K interoperability projects Meeting in person on March 1, 2016
35
Commons Framework Pilots : Updates
March 2016 – CFPs meeting in person To develop an initial plan for the implementation of Commons Framework Meeting presentations here A manuscript describing the outcomes of the meeting was submitted Established the Commons Framework Working Group (CFWG) and sub-WGs on the following topics: FAIRness Metrics (Neil McKenna & Michel Dumontier) Data-object registry (Lucila Ohno-Machado, Michel Dumontier, Wei Wang) Interoperability of APIs (Michel Dumontier) Workflow sharing and docker registry (Umberto Ravaioli & Brian O’Connor) Commons Framework Publications (Owen White) Nov 28, 2016 – Held a CFWG meeting in person These groups will present a report of their activities at the Commons Session tomorrow at 10:30am
36
Commons Framework WG - Next Steps
GET INVOLVED: See Valentina Di Francesco or WG leads for details A broad announcement to the BD2K research community went out in late summer – we are seeking more participants Contribute to the implementation of the Commons Framework Suggest other scientific areas of interest that need coordination Generate guidelines that all of our peers will use as we begin to jumpstart the NIH Commons Participate in meetings of the CFWG and hear the latest news
37
Commons Framework – Next Steps
FOA: Support investigator-initiated projects to further develop the Data Commons Framework Could leverage and expand upon resources developed with the Reference data sets Planned Fall 2017 FOA: Making existing data and tools Commons Compliant/FAIR Competitive Supplements to existing NIH Awards. Provide support to existing projects to make current digital resources FAIR & Commons Compliant Digital resources could include: data, analytical software, or workflows
38
Resource Search & Indexing
Resource Search & Indexing Discoverability of data and software Ian Fore, Ron Margolis, Alison Yao, Claire Schulkey Dawei Lin
39
Software: Services & Tools
Mapping to the Commons Framework Large, High-Impact Datasets in the Cloud - Populating the Commons Compute Platform: Cloud or HPC Services: APIs, Containers, Indexing, Software: Services & Tools scientific analysis tools/workflows Data “Reference” Data Sets User defined data Digital Object Compliance App store/User Interface Indexing
40
An Indexing Ecosystem for the Commons: a virtual environment for ‘FIND’
Enable biomedical research by providing scientists with the ability to FIND digital resources Establish a mature resource discovery tool(s) that can be sustained as long as the need for it exists Focus on characteristics of the tool as infrastructure Maintains a defined level of service Contribute to a Commons that is reliable, available, easy to use, and adaptable
41
Current Activities Compare ongoing activities and identify needs
Identify indexing activities in and outside NIH BD2K: bioCADDIE, Centers of Excellence ICs: NLM, NCI, NHGRI, other Non-BD2K: Elixir (EBI), Publishers (Elsevier), Repositories, schema.org Compare ongoing activities and identify needs Benchmarking Identify gaps in strategy Dimensions to consider Content, Metadata, Platform/ Technology Coordinate with other BD2K PMWGs Standards Specific Center WGs
42
Cloud Credits Model George Komatsoulis
43
Software: Services & Tools
Mapping to the Commons Framework Large, High-Impact Datasets in the Cloud - Populating the Commons Compute Platform: Cloud or HPC Services: APIs, Containers, Indexing, Software: Services & Tools scientific analysis tools/workflows Data “Reference” Data Sets User defined data Digital Object Compliance App store/User Interface Cloud Credits Pilot
45
How do credits work from the point of view of an investigator?
Investigators receive credits worth a certain amount (in dollars) that can be used at the conformant provider(s) of their choice Credits are pre-purchased and applied to the account of the investigator with the relevant provider(s) As the investigator uses services with a conformant provider, the provider debits the value of the investigators usage against the pre-loaded credits INVESTIGATORS ARE NOT BILLED BY PROVIDERS AS LONG AS THEY DO NOT EXCEED THEIR CREDIT ALLOCATION.
46
Commons Credits Model Pilot
3 year pilot to test this business model to facilitate researcher use of cloud resources (enhance data sharing and potentially reduce costs). Contract with the CMS Alliance to Modernize Healthcare (CAMH) Federally Funded Research and Development Center (FFRDC) managed by the MITRE corporation FFRDCs are special purpose, government-owned but contractor-managed entities that meet R&D needs that can’t be well managed by traditional grants and contracts Examples: National Labs and organizations like RAND Pilot will not directly interact with the existing grant system. Instead is modeled on the mechanisms being used to gain access to NSF and DOE national resources (HPC, light sources, etc.) The only required qualification for applying for credits will be that the investigator must have an existing NIH grant
47
Commons Credits Model Pilot
Current List of Approved Vendors DLT = Amazon Web Services Reseller IBM Onix = Google Reseller Broad and ISB NCI Cloud Pilots accessible via Google Two more approved but negotiating participation agreement First batch of credits issued Sep 29, 2016 8 Investigators (cohort 1) that are part of an ‘alpha test’ Only IBM/AWS at the time 93% AWS, 7% IBM First credits have been used, usage information coming First “production” credit request period opening this month
48
Considerations and Concluding Thoughts
49
Considerations Communication
Metrics – Understanding and accounting of data usage patterns Cost Cloud Storage Pay for use cloud compute (NIH credits pilot) Indirect costs for cloud Hybrid Clouds – Institution (private) and commercial (public) clouds Managing Open vs Controlled access data Auth: single sign on - dreams/nightmares? Archive vs Working Copies of data Interoperability with other Commons (clouds)
50
Standards – Metadata, UIDs, APIs
Discoverability – Finding digital objects across clouds Interfaces – For users with different needs and capabilities Consent – Reconsenting data, Dynamic consents? Policies Data sharing policies that are useful and effective Keep pace with use of technology (e.g. dbGAP data in the Cloud) Incentives Access to, and shareability of FAIR Data as part of NIH grant review criteria Governance – Community involvement in governance models Sustainability – Long term support
51
Summary We need an unprecedented level of convergence and collaboration to drive biomedical science to the next level. Supporting this model of data-intensive collaborative science requires a shift in academic research culture and new investments in data infrastructure and capabilities. Matthew Trunnel, FHC
52
Acknowledgments ADDS Office: Jennie Larkin, Phil Bourne, Michelle Dunn,Mark Guyer, Allen Dearry, Sonynka Ngosso, Tonya Scott, Lisa Dunneback, Vivek Navale (CIT/ADDS) NCBI: George Komatsoulis NHGRI: Valentina di Francesco NIGMS: Susan Gregurick CIT: Andrea Norris, Debbie Sinmao NIH Common Fund: Jim Anderson , Betsy Wilder, Leslie Derr NCI Cloud Pilots/ GDC: Warren Kibbe, Tony Kerlavage, Tanja Davidsen Commons Reference Data Set Working Group: Weiniu Gan (HL), Ajay Pillai (HG), Elaine Ayres, (BITRIS), Sean Davis (NCI), Vinay Pai (NIBIB), Maria Giovanni (AI), Leslie Derr (CF), Claire Schulkey (AI) RIWG Core Team: Ron Margolis (DK), Ian Fore, (NCI), Alison Yao (AI), Claire Schulkey (AI), Eric Choi (AI) OSP: Dina Paltoo, Kris Langlais, Erin Luetkemeier, Agnes Rooke, Research and Industry: Mathew Trunnell (FHC), Bob Grossman (Chicago), Toby Bloom (NYGC)
53
Acknowledgements- CFPs
NIH CFPs WG Valentina Di Francesco Sam Moore Vivien Bonazzi Allen Dearry Maria Giovanni Susan Gregurick Weiniu Gan James Luo Stacia Friedman-Hill Ajay Pillai Leslie Derr Debbie Sinmao Eric Choi Claire Schulkey George Komatsoulis CFWG Owen White Neil McKenna Michel Dumontier Umberto Ravaioli Brian O’Connor Lucila Ohno-Machado Wei Wang All the other members It is up to you what to do with the acknowlegmenets. I would reduce considerably the list of NIH staff and keep the CFWG – the names listed are those of the leaders of the subgroups.
54
Acknowledgements - Credits Model
ADDS Office Vivien Bonazzi Phil Bourne Jennie Larkin Mark Guyer MITRE Ari Abrams-Kudan Wenling (Eileen) Chang Peter Gutgarts Lynette Hirschman William Kim Eldred Rubeiro Bruce Shirk David Tanenbaum Lisa Tutterow Grant Thornton Katie Beringer Mike Clifford Tamara Reynolds NIH Tanja Davidsen (NCI) Valentina di Franceso (NHGRI) Susan Gregurick (NIGMS) David Lipman (NCBI) Vivek Navale (CIT) Jim Ostell (NCBI) Debbie Sinmao (CIT) Nick Weber (NIAID) NITRD Peter Lyster
55
Vivien Bonazzi bonazziv@mail.nih.gov Stay in Touch QR Business Card
LinkedIn @Vivien.Bonazzi Slideshare Blog (Coming soon!)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.