Enhancements to Galaxy for delivering on NIH Commons Ravi K Madduri
Outline NIH Commons NIH BD2K Center - BDDS Building blocks for NIH Commons Data Management Data Identification Data Analysis Data Publication
What is Commons?
NIH Commons The Commons is a shared virtual space where scientists can work with the digital objects of biomedical research, i.e. it is a system that will allow investigators to find, manage, share, use and reuse data, software, metadata and workflows. It will be a complex ecosystem and thus the realization of the Commons will require the use, further development and harmonization of several components. A reference architecture A collection of best-practices A policy
The Commons is a distributed system NCI GDC Cloud 1 Cloud 2 Data Commons 2 Bionimbus
Building the Commons https://datascience.nih.gov/commons A computing environment, such as the cloud or HPC resources, which support access, utilization, and storage of digital objects Public data sets that adhere to Commons Digital Object Compliance principles Software services and tools that enable: Scalable provisioning of compute resources Interoperability between digital objects within the Commons Indexing and thus discoverability of digital objects Sharing of digital objects between individuals or groups Access to and deployment of scientific analysis tools and pipeline workflows Connectivity with other repositories, registries and resources that support scholarly research
Building the Commons https://datascience.nih.gov/commons A computing environment, such as the cloud or HPC resources, which support access, utilization, and storage of digital objects Public data sets that adhere to Commons Digital Object Compliance principles Software services and tools that enable: Scalable provisioning of compute resources Interoperability between digital objects within the Commons Indexing and thus discoverability of digital objects Sharing of digital objects between individuals or groups Access to and deployment of scientific analysis tools and pipeline workflows Connectivity with other repositories, registries and resources that support scholarly research
The Commons “To meet the most basic level of compliance, it is expected that digital objects would have the following elements: Unique digital object identifiers A minimal set of searchable metadata Physical availability through a cloud-based Commons provider Clear access rules and controls (especially important for human subjects data) An entry (with metadata) in one or more indices” https://datascience.nih.gov/commons
The Big Data for Discovery Science Center (BDDS) - comprised of leading experts in biomedical imaging, genetics, proteomics, and computer science - is taking an "-ome to home" approach toward streamlining big data management, aggregation, manipulation, integration, and the modeling of biological systems across spatial and temporal scales.
Globus and the research data lifecycle Compute Facility Instrument Globus transfers files reliably, securely 2 Globus controls access to shared files on existing storage; no need to move files to cloud storage! 4 Curator reviews and approves; data set published on campus or other system 7 Transfer Researcher selects files to share, selects user or group, and sets access permissions 3 Collaborator logs in to Globus and accesses shared files; no local account required; download via Globus 5 Share Researcher assembles data set; describes it using metadata (Dublin core and domain-specific) 6 Researcher initiates transfer request; or requested automatically by script, science gateway 1 Publication Repository Peers, collaborators search and discover datasets; transfer and share using Globus 8 SaaS Only a web browser required Use storage system of your choice Access using your campus credentials Publish Personal Computer Discover
BDbag: Packaging data for interchange A packaging format for encapsulating Payload: arbitrary content Tags: metadata describing the payload Checksums: supports verification of content Bio_data_bag/ |-- data | \-- genomic | \-- 2a673.fastq | -- manifest-md5.txt | afbfa231324812378123bfa data/genomic/2a673.fasta | -- bagit.txt Contact-Name: John Smith
Minimal viable identifiers (minid) Every data item that you create can be automatically assigned a digital id You can reference it, share it, resolve it
Resolve a minid
Bringing it all together BDDS Collection ERMrest PPMI ADNI Adenocarcinoma http://bit.ly/1M0h6Yx http://bit.ly/A10R89y 1. Query and discover data 3. Publish bags 2. Transfer bags Alignment Files Adrenal Brain QC Alignment Feature count Alignment QC Run workflow on each normal and tumor and publish Qc, alignment, feature count, alignment qc QC files, alignment file, and count file. Differential expression 3. Execute parallel alignment workflow on dynamically provisioned cloud resources 4. Discover published data and execute comparison workflow Alignment Files Differential expression Differential expression
Bringing it all together: Phenome-Wide Association Study (PheWAS) 3. Query for specific genotype data Raw genetic data Alleles per subject 4. Create new bags of derived data Process genetic data 1. Query and discover data (wherever it is) dbGaP IDA 2. Create bags BDDS Data Catalog Alignment Files Dynamic database 5. Query for specific imaging information based on the derived genetic data Raw Brain MRI data Processed MRI data 6. Create new bags of derived data Process imaging data 7. Transfer bags out for PheWAS analysis Genetic Data Brain MRI
Galaxy tools created The following tools are being created Tools to retrieve BDBags using minids Tools to expand BDBags into input datasets Tools to create BDBags of results along with minids Tools to publish BDBags into Publication Service Minids for Docker containers Minids for Galaxy workflows Available at: http://bd2k.ini.usc.edu/tools/
Building the Commons: Review Transfer, share, synchronize, track data Package and identify data for sharing Scalable cloud- based analysis BDbag
Thank you to our supporters! U.S. DEPARTMENT OF ENERGY