Enhancements to Galaxy for delivering on NIH Commons

Slides:



Advertisements
Similar presentations
CVRG Presenter Disclosure Information Tahsin Kurc, PhD Center for Comprehensive Informatics Emory University CardioVascular Research Grid Core Infrastructure.
Advertisements

ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.
High Performance Computing Course Notes Grid Computing.
Collaboration on Large Datasets using Globus Rachana Ananthakrishnan University of Chicago.
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWS Ravi K Madduri University of Chicago and ANL.
Hydra Partners Meeting March 2012 Bill Branan DuraCloud Technical Lead.
William Y. Arms Corporation for National Research Initiatives March 22, 1999 Object models, overlay journals, and virtual collections.
Mike Smorul Saurabh Channan Digital Preservation and Archiving at the Institute for Advanced Computer Studies University of Maryland, College Park.
Presented by Mina Haratiannezhadi 1.  publishing, editing and modifying content  maintenance  central interface  manage workflows 2.
Vivien Bonazzi Ph.D. Program Director: Computational Biology (NHGRI) Co Chair Software Methods & Systems (BD2K) Biomedical Big Data Initiative (BD2K)
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
WORKFLOWS IN CLOUD COMPUTING. CLOUD COMPUTING  Delivering applications or services in on-demand environment  Hundreds of thousands of users / applications.
Building Data-intensive Pipelines Ravi K Madduri Argonne National Lab University of Chicago.
Publishing Digital Content to a LOR Publishing Digital Content to a LOR 1.
Publishing and Visualizing Large-Scale Semantically-enabled Earth Science Resources on the Web Benno Lee 1 Sumit Purohit 2
Department of Biomedical Informatics Service Oriented Bioscience Cluster at OSC Umit V. Catalyurek Associate Professor Dept. of Biomedical Informatics.
1 Matthew J. McAuliffe, Ph.D., Chief, Biomedical Imaging Research Services Section (BIRSS) CIT Ramona Hicks, Ph.D., Program Director, Repair and Plasticity.
Updates from EOSDIS -- as they relate to LANCE Kevin Murphy LANCE UWG, 23rd September
LIS 506 (Fall 2006) LIS 506 Information Technology Week 11: Digital Libraries & Institutional Repositories.
material assembled from the web pages at
Using the Open Metadata Registry (openMDR) to create Data Sharing Interfaces October 14 th, 2010 David Ervin & Rakesh Dhaval, Center for IT Innovations.
Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.
Big Data to Knowledge (BD2K) Jennie Larkin, Ph.D. NIH RDA P5 March 10,2015.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
-- Don Preuss NCBI/NLM/NIH
Linking Tasks, Data, and Architecture Doug Nebert AR-09-01A May 2010.
Federated Discovery and Access in Astronomy Robert Hanisch (NIST), Ray Plante (NCSA)
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
CaGrid Overview and Core Services caGrid Knowledge Center February 2011.
The Global Land Cover Facility is sponsored by NASA and the University of Maryland.The GLCF is a founding member of the Federation of Earth Science Information.
Cooperative experiments in VL-e: from scientific workflows to knowledge sharing Z.Zhao (1) V. Guevara( 1) A. Wibisono(1) A. Belloum(1) M. Bubak(1,2) B.
The Mint Mapping tool The MoRe aggregator Vassilis Tzouvaras, Dimitris Gavrilis National Technical University of Athens Digital Curation Unit - IMIS, Athena.
Children’s Health Exposure Analysis Resource (CHEAR) CHEAR Center for Data Science Susan Teitelbaum, PhD November 4, 2015.
Cyberinfrastructure Overview Russ Hobby, Internet2 ECSU CI Days 4 January 2008.
MPS Workshop 1: Gauging the Impact of Requirements for Public Access to Data November 19, 2015 Jennie Larkin, Ph.D. Office of the Associate Director for.
Research data management using Globus ESIP Summer Meeting 2015 Rachana Ananthakrishnan University of Chicago
Globus and ESGF Rachana Ananthakrishnan University of Chicago
Globus.org/genomics Globus Galaxies Science Gateways as a Service Ravi K Madduri, University of Chicago and Argonne National Laboratory
CyVerse-enabled NCBI Sequence Read Archive (SRA) Submission Pipeline
NIH: DATA SCIENCE & BD2K Jennie Larkin, PhD Senior Advisor, Extramural Programs and Strategic Planning Office of the Associate Director for Data Science,
International Planetary Data Alliance Registry Project Update September 16, 2011.
EGI-InSPIRE RI EGI Compute and Data Services for Open Access in H2020 Tiziana Ferrari Technical Director, EGI.eu
IPDA Registry Definitions Project Dan Crichton Pedro Osuna Alain Sarkissian.
Introduction: AstroGrid increases scientific research possibilities by enabling access to distributed astronomical data and information resources. AstroGrid.
TOWARDS AN ARCHITECTURE FOR NATIONAL DATA SERVICES Ian Foster Director, Computation Institute Argonne National Laboratory & The University of
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI Overview for ENVRI Gergely Sipos, Malgorzata Krakowian EGI.eu
Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,
Cancer Bioinformatics Grid (caBIG) CANS 2006 Chicago, Illinois
To develop the scientific evidence base that will lessen the burden of cancer in the United States and around the world. NCI Mission Key message:
Accessing the VI-SEEM infrastructure
Jennie Larkin, PhD Senior Advisor
Pasquale Pagano (CNR-ISTI) Project technical director
Solutions to Clinical Data Visualization and Analysis
Tools and Services Workshop
University of Chicago and ANL
EOSC MODEL Pasquale Pagano CNR - ISTI
Joslynn Lee – Data Science Educator
Pasquale Pagano CNR – ISTI (Pisa, Italy)
Software infrastructure for a National Research Platform
Joseph JaJa, Mike Smorul, and Sangchul Song
VI-SEEM Data Repository
Exploitation of ISS Scientific data - sustainability
University of Technology
SRA Submission Pipeline
USF Health Informatics Institute (HII)
Virtual Global File System
Introduction to D4Science
Distributing META-pipe on ELIXIR compute resources
Data Management Components for a Research Data Archive
Presentation transcript:

Enhancements to Galaxy for delivering on NIH Commons Ravi K Madduri

Outline NIH Commons NIH BD2K Center - BDDS Building blocks for NIH Commons Data Management Data Identification Data Analysis Data Publication

What is Commons?

NIH Commons The Commons is a shared virtual space where scientists can work with the digital objects of biomedical research, i.e. it is a system that will allow investigators to find, manage, share, use and reuse data, software, metadata and workflows. It will be a complex ecosystem and thus the realization of the Commons will require the use, further development and harmonization of several components. A reference architecture A collection of best-practices A policy

The Commons is a distributed system NCI GDC Cloud 1 Cloud 2 Data Commons 2 Bionimbus

Building the Commons https://datascience.nih.gov/commons A computing environment, such as the cloud or HPC resources, which support access, utilization, and storage of digital objects Public data sets that adhere to Commons Digital Object Compliance principles Software services and tools that enable: Scalable provisioning of compute resources Interoperability between digital objects within the Commons Indexing and thus discoverability of digital objects Sharing of digital objects between individuals or groups Access to and deployment of scientific analysis tools and pipeline workflows Connectivity with other repositories, registries and resources that support scholarly research

Building the Commons https://datascience.nih.gov/commons A computing environment, such as the cloud or HPC resources, which support access, utilization, and storage of digital objects Public data sets that adhere to Commons Digital Object Compliance principles Software services and tools that enable: Scalable provisioning of compute resources Interoperability between digital objects within the Commons Indexing and thus discoverability of digital objects Sharing of digital objects between individuals or groups Access to and deployment of scientific analysis tools and pipeline workflows Connectivity with other repositories, registries and resources that support scholarly research

The Commons “To meet the most basic level of compliance, it is expected that digital objects would have the following elements: Unique digital object identifiers A minimal set of searchable metadata Physical availability through a cloud-based Commons provider Clear access rules and controls (especially important for human subjects data) An entry (with metadata) in one or more indices”        https://datascience.nih.gov/commons

The Big Data for Discovery Science Center (BDDS) - comprised of leading experts in biomedical imaging, genetics, proteomics, and computer science - is taking an "-ome to home" approach toward streamlining big data management, aggregation, manipulation, integration, and the modeling of biological systems across spatial and temporal scales.

Globus and the research data lifecycle Compute Facility Instrument Globus transfers files reliably, securely 2 Globus controls access to shared files on existing storage; no need to move files to cloud storage! 4 Curator reviews and approves; data set published on campus or other system 7 Transfer Researcher selects files to share, selects user or group, and sets access permissions 3 Collaborator logs in to Globus and accesses shared files; no local account required; download via Globus 5 Share Researcher assembles data set; describes it using metadata (Dublin core and domain-specific) 6 Researcher initiates transfer request; or requested automatically by script, science gateway 1 Publication Repository Peers, collaborators search and discover datasets; transfer and share using Globus 8 SaaS  Only a web browser required Use storage system of your choice Access using your campus credentials Publish Personal Computer Discover

BDbag: Packaging data for interchange A packaging format for encapsulating Payload: arbitrary content Tags: metadata describing the payload Checksums: supports verification of content Bio_data_bag/ |-- data | \-- genomic | \-- 2a673.fastq | -- manifest-md5.txt | afbfa231324812378123bfa data/genomic/2a673.fasta | -- bagit.txt Contact-Name: John Smith

Minimal viable identifiers (minid) Every data item that you create can be automatically assigned a digital id You can reference it, share it, resolve it

Resolve a minid

Bringing it all together BDDS Collection ERMrest PPMI ADNI Adenocarcinoma http://bit.ly/1M0h6Yx http://bit.ly/A10R89y 1. Query and discover data 3. Publish bags 2. Transfer bags Alignment Files Adrenal Brain QC Alignment Feature count Alignment QC   Run workflow on each normal and tumor and publish Qc, alignment, feature count, alignment qc  QC files, alignment file, and count file. Differential expression 3. Execute parallel alignment workflow on dynamically provisioned cloud resources 4. Discover published data and execute comparison workflow Alignment Files Differential expression Differential expression

Bringing it all together: Phenome-Wide Association Study (PheWAS) 3. Query for specific genotype data Raw genetic data Alleles per subject 4. Create new bags of derived data Process genetic data 1. Query and discover data (wherever it is) dbGaP IDA 2. Create bags BDDS Data Catalog Alignment Files Dynamic database 5. Query for specific imaging information based on the derived genetic data Raw Brain MRI data Processed MRI data 6. Create new bags of derived data Process imaging data 7. Transfer bags out for PheWAS analysis Genetic Data Brain MRI

Galaxy tools created The following tools are being created Tools to retrieve BDBags using minids Tools to expand BDBags into input datasets Tools to create BDBags of results along with minids Tools to publish BDBags into Publication Service Minids for Docker containers Minids for Galaxy workflows Available at: http://bd2k.ini.usc.edu/tools/

Building the Commons: Review Transfer, share, synchronize, track data Package and identify data for sharing Scalable cloud- based analysis BDbag

Thank you to our supporters! U.S. DEPARTMENT OF ENERGY