Presentation is loading. Please wait.

Presentation is loading. Please wait.

Enhancements to Galaxy for delivering on NIH Commons

Similar presentations


Presentation on theme: "Enhancements to Galaxy for delivering on NIH Commons"— Presentation transcript:

1 Enhancements to Galaxy for delivering on NIH Commons
Ravi K Madduri

2 Outline NIH Commons NIH BD2K Center - BDDS
Building blocks for NIH Commons Data Management Data Identification Data Analysis Data Publication

3 What is Commons?

4 NIH Commons The Commons is a shared virtual space where scientists can work with the digital objects of biomedical research, i.e. it is a system that will allow investigators to find, manage, share, use and reuse data, software, metadata and workflows. It will be a complex ecosystem and thus the realization of the Commons will require the use, further development and harmonization of several components. A reference architecture A collection of best-practices A policy

5 The Commons is a distributed system
NCI GDC Cloud 1 Cloud 2 Data Commons 2 Bionimbus

6 Building the Commons https://datascience.nih.gov/commons
A computing environment, such as the cloud or HPC resources, which support access, utilization, and storage of digital objects Public data sets that adhere to Commons Digital Object Compliance principles Software services and tools that enable: Scalable provisioning of compute resources Interoperability between digital objects within the Commons Indexing and thus discoverability of digital objects Sharing of digital objects between individuals or groups Access to and deployment of scientific analysis tools and pipeline workflows Connectivity with other repositories, registries and resources that support scholarly research

7 Building the Commons https://datascience.nih.gov/commons
A computing environment, such as the cloud or HPC resources, which support access, utilization, and storage of digital objects Public data sets that adhere to Commons Digital Object Compliance principles Software services and tools that enable: Scalable provisioning of compute resources Interoperability between digital objects within the Commons Indexing and thus discoverability of digital objects Sharing of digital objects between individuals or groups Access to and deployment of scientific analysis tools and pipeline workflows Connectivity with other repositories, registries and resources that support scholarly research

8 The Commons “To meet the most basic level of compliance, it is expected that digital objects would have the following elements: Unique digital object identifiers A minimal set of searchable metadata Physical availability through a cloud-based Commons provider Clear access rules and controls (especially important for human subjects data) An entry (with metadata) in one or more indices”       

9 The Big Data for Discovery Science Center (BDDS) - comprised of leading experts in biomedical imaging, genetics, proteomics, and computer science - is taking an "-ome to home" approach toward streamlining big data management, aggregation, manipulation, integration, and the modeling of biological systems across spatial and temporal scales.

10 Globus and the research data lifecycle
Compute Facility Instrument Globus transfers files reliably, securely 2 Globus controls access to shared files on existing storage; no need to move files to cloud storage! 4 Curator reviews and approves; data set published on campus or other system 7 Transfer Researcher selects files to share, selects user or group, and sets access permissions 3 Collaborator logs in to Globus and accesses shared files; no local account required; download via Globus 5 Share Researcher assembles data set; describes it using metadata (Dublin core and domain-specific) 6 Researcher initiates transfer request; or requested automatically by script, science gateway 1 Publication Repository Peers, collaborators search and discover datasets; transfer and share using Globus 8 SaaS  Only a web browser required Use storage system of your choice Access using your campus credentials Publish Personal Computer Discover

11 BDbag: Packaging data for interchange
A packaging format for encapsulating Payload: arbitrary content Tags: metadata describing the payload Checksums: supports verification of content Bio_data_bag/ |-- data | \-- genomic | \-- 2a673.fastq | -- manifest-md5.txt | afbfa bfa data/genomic/2a673.fasta | -- bagit.txt Contact-Name: John Smith

12 Minimal viable identifiers (minid)
Every data item that you create can be automatically assigned a digital id You can reference it, share it, resolve it

13 Resolve a minid

14 Bringing it all together
BDDS Collection ERMrest PPMI ADNI Adenocarcinoma 1. Query and discover data 3. Publish bags 2. Transfer bags Alignment Files Adrenal Brain QC Alignment Feature count Alignment QC Run workflow on each normal and tumor and publish Qc, alignment, feature count, alignment qc  QC files, alignment file, and count file. Differential expression 3. Execute parallel alignment workflow on dynamically provisioned cloud resources 4. Discover published data and execute comparison workflow Alignment Files Differential expression Differential expression

15 Bringing it all together: Phenome-Wide Association Study (PheWAS)
3. Query for specific genotype data Raw genetic data Alleles per subject 4. Create new bags of derived data Process genetic data 1. Query and discover data (wherever it is) dbGaP IDA 2. Create bags BDDS Data Catalog Alignment Files Dynamic database 5. Query for specific imaging information based on the derived genetic data Raw Brain MRI data Processed MRI data 6. Create new bags of derived data Process imaging data 7. Transfer bags out for PheWAS analysis Genetic Data Brain MRI

16 Galaxy tools created The following tools are being created
Tools to retrieve BDBags using minids Tools to expand BDBags into input datasets Tools to create BDBags of results along with minids Tools to publish BDBags into Publication Service Minids for Docker containers Minids for Galaxy workflows Available at:

17 Building the Commons: Review
Transfer, share, synchronize, track data Package and identify data for sharing Scalable cloud- based analysis BDbag

18 Thank you to our supporters!
U.S. DEPARTMENT OF ENERGY


Download ppt "Enhancements to Galaxy for delivering on NIH Commons"

Similar presentations


Ads by Google