NCI’s Genomics Data Commons (GDC) & NCI Cloud Pilots

Slides:



Advertisements
Similar presentations
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWS Ravi K Madduri University of Chicago and ANL.
Advertisements

What’s new in GDAC Firehose? Raw MAFs For many cancer types, mutation samples continued to be sequenced after paper publication. Previously, we only packaged.
National Cancer Institute U.S. DEPARTMENT OF HEALTH AND HUMAN SERVICES National Institutes of Health NCI Perspective on Informatics and Clinical Decision.
Mike Smorul Saurabh Channan Digital Preservation and Archiving at the Institute for Advanced Computer Studies University of Maryland, College Park.
An Introduction to DuraCloud Carissa Smith, Partner Specialist Michele Kimpton, Project Director Bill Branan, Lead Software Developer Andrew Woods, Lead.
Using the Drupal Content Management Software (CMS) as a framework for OMICS/Imaging-based collaboration.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
Biology.sdsc.edu CIPRes in Kepler: An integrative workflow package for streamlining phylogenetic data analyses Zhijie Guan 1, Alex Borchers 1, Timothy.
Lecture 8 – Platform as a Service. Introduction We have discussed the SPI model of Cloud Computing – IaaS – PaaS – SaaS.
Crystal Hoyer Program Manager IIS Team Preview of features that will be announced at MIX09 Please do not blog, take pictures or video of session.
Department of Biomedical Informatics Service Oriented Bioscience Cluster at OSC Umit V. Catalyurek Associate Professor Dept. of Biomedical Informatics.
BIRN Update Carl Kesselman Professor of Industrial and Systems Engineering Information Sciences Institute Fellow Viterbi School of Engineering University.
The analyses upon which this publication is based were performed under Contract Number HHSM C sponsored by the Center for Medicare and Medicaid.
Data Analysis Summary. Elephant in the room General Comments General understanding that informatics is integral in medical sequencing and other –omics.
Sage Bionetworks A non-profit organization with a vision to enable networked team approaches to building better models of disease BIOMEDICINE INFORMATION.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
The Future of the iPlant Cyberinfrastructure: Coming Attractions.
Introduction to caArray caBIG ® Molecular Analysis Tools Knowledge Center April 3, 2011.
-- Don Preuss NCBI/NLM/NIH
NCI Cloud Pilot Collaboration Meeting
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
Sage Bionetworks A non-profit organization with a vision to enable networked team approaches to building better models of disease BIOMEDICINE INFORMATION.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
Cooperative experiments in VL-e: from scientific workflows to knowledge sharing Z.Zhao (1) V. Guevara( 1) A. Wibisono(1) A. Belloum(1) M. Bubak(1,2) B.
Introduction to caIntegrator caBIG ® Molecular Analysis Tools Knowledge Center April 3, 2011.
Importance of Semantics in Precision Oncology at NCI
DuraCloud Open technologies and services for managing durable data in the cloud Michele Kimpton, CBO DuraSpace.
CBioPortal Web resource for exploring, visualizing, and analyzing multidimentional cancer genomics data.
The National Cancer Imaging Archive (NCIA) In Action: An Introduction for Users A Tool Demonstration from caBIG™ Carl Jaffe, MD NCI-Cancer Imaging Program.
High Risk 1. Ensure productive use of GRID computing through participation of biologists to shape the development of the GRID. 2. Develop user-friendly.
CyVerse Workshop Discovery Environment Overview. Welcome to the Discovery Environment A Simple Interface to Hundreds of Bioinformatics Apps, Powerful.
Justin Kirby1, Lawrence Tarbox2, John Freymann1, Carl Jaffe3, Fred Prior2 1 Leidos Biomedical Research, Frederick National.
Enhancements to Galaxy for delivering on NIH Commons
To develop the scientific evidence base that will lessen the burden of cancer in the United States and around the world. NCI Mission Key message:
Semantic Web - caBIG Abstract: 21st century biomedical research is driven by massive amounts of data: automated technologies generate hundreds of.
Meemim's Microsoft Azure-Hosted Knowledge Management Platform Simplifies the Sharing of Information with Colleagues, Clients or the Public MICROSOFT AZURE.
Solutions to Clinical Data Visualization and Analysis
CyVerse Tools and Services
Tools and Services Workshop
University of Chicago and ANL
Joslynn Lee – Data Science Educator
American Evaluation Association, Evaluation 2016
CyVerse Discovery Environment
MATLAB Distributed, and Other Toolboxes
INTAROS WP5 Data integration and management
KnowEnG: A SCALABLE KNOWLEDGE ENGINE FOR LARGE SCALE GENOMIC DATA
Web-based Tools for Integrative Analysis of Pancreatic Cancer Data
Making “Open Data” Work: Challenges for Data Integration in Genomics Research
Spark Presentation.
Platform as a Service.
The PedcBioPortal & DiseaseXpress
Data challenges in the pharmaceutical industry
Using the Drupal Content Management Software (CMS) as a framework for OMICS/Imaging-based collaboration.
The Improvement of PaaS Platform ZENG Shu-Qing, Xu Jie-Bin 2010 First International Conference on Networking and Distributed Computing SQUARE.
USF Health Informatics Institute (HII)
Population Information Integration, Analysis and Modeling
Open Data Cubes Cloud Services Experiences and Lessons Learned
DeFacto Planning on the Powerful Microsoft Azure Platform Puts the Power of Intelligent and Timely Planning at Any Business Manager’s Fingertips Partner.
EOSCpilot All Hands Meeting 8 March 2018 Pisa
Appcelerator Arrow: Build APIs in Minutes. Connect to Any Data Source
Integration of EGA secure data access into Galaxy
AWS Cloud Computing Masaki.
WIS Strategy – WIS 2.0 Submitted by: Matteo Dell’Acqua(CBS) (Doc 5b)
Storing and Accessing G-OnRamp’s Assembly Hubs outside of Galaxy
TOPMed Analysis Workshop Genetic Analysis Center Biostatistics Department University of Washington TOPMed Data Coordinating Center August 7-9, 2017 Introduction.
Agenda Need of Cloud Computing What is Cloud Computing
VIFI : Virtual Information Fabric for Data-Driven Discovery from Distributed Fragmented Repositories PI: Dr. Ashit Talukder Bank of America Endowed Chair.
The NCI Genomic Data Commons as an engine for precision medicine
Presentation transcript:

NCI’s Genomics Data Commons (GDC) & NCI Cloud Pilots 9/15/2018 NCI’s Genomics Data Commons (GDC) & NCI Cloud Pilots Tanja Davidsen, PhD NCI Center for Biomedical Informatics and IT March, 2017 National Cancer Institute

The NCI Genomic Data Commons Provide the cancer research community with a unified data repository that enables data sharing across cancer genomic studies in support of precision medicine One of the NCI resources supporting this vision in the NCI Genomics Data Commons which will provide….

The NCI Genomic Data Commons Support the receipt, quality control, integration, storage, and redistribution of standardized genomic data sets derived from cancer research studies Available data NCI Funded cancer genomics datasets User submissions Data searching and retrieval/downloading Harmonization of raw sequence (alignment and variant calling) of all GDC data Application of state-of-the-art methods of generating derived data Developed, supported, and hosted by U. Chicago The GDC achieves this goal knowledge network supporting the import and standardization of genomic and clinical data from cancer research programs. Genomic Data Commons https://gdc.cancer.gov/

NCI Genomic Data Commons a unified data repository for the research community NCI Genomic Data Commons Data Storage Retrieval, Submission, & Harmonization Researchers

NCI Genomic Data Commons The GDC went live on June 6, 2016 with approximately 4.1 PB of data. This includes: 2.6 PB of legacy data 1.5 PB of “harmonized” data 577,878 files about 14194 cases (patients), in 42 cancer types, across 29 primary sites. 10 major data types, ranging from Raw Sequencing Data, Raw Microarray Data, to Copy Number Variation, Simple Nucleotide Variation and Gene Expression. Data are derived from 17 different experimental strategies, with the major ones being RNA- Seq, WXS, WGS, miRNA-Seq, Genotyping Array and Expression Array. Foundation Medicine announced the release of 18,000 genomic profiles to the GDC at the Cancer Moonshot Summit.

GDC: Data Submission & Harmonization Data Harmonization https://gdc.cancer.gov/

GDC: Data Retrieval GDC Website Data Transfer Tool Data Portal Visualization Tools Legacy Archive { "data": { "hits": [ {"project_id": "TCGA-SKCM”,"primary_site": "Skin”} , {"project_id": "TCGA-PCPG”,"primary_site": "Nervous System”} , {"project_id": "TCGA-LAML”,"primary_site": "Blood”} , {"project_id": "TCGA-CNTL”,"primary_site": "Not Applicable”} , {"project_id": "TCGA-UVM”,"primary_site": "Eye”} GDC API https://gdc-api.nci.nih.gov/projects?fields=project_id,primary_site&pretty=true API URL Endpoint URL parameters Query parameters 8

Content in the Genomic Data Commons TCGA 11,353 cases TARGET 3,178 cases Current ~58,000 cases Foundation Medicine 18,000 cases Cancer studies in dbGaP ~4,000 cases MMRF ~1,000 cases Coming soon NCI-MATCH ~3,000 cases Clinical Trial Sequencing Program ~3,000 cases Planned (1-3 years) Cancer Driver Discovery Program ~5,000 cases Human Cancer Model Initiative ~1,000 cases APOLLO – VA and DoD ~8,000 cases GDC launched with two of the major NCI genomic data sets, TCGA and TARGET.

The NCI Cancer Genomics Cloud Pilots Understanding how to meet the research community’s need to analyze large-scale cancer genomic and clinical data

NCI Cancer Genomics Cloud Pilots Cloud Pilots provide: Access to large genomic data sets without need to download Access to popular pipelines and visualization tools Ability for researchers to bring their own tools and pipelines to the data Ability for researchers to bring their own data and analyze in combination with NCI genomic data Workspaces, for researchers to save and share their data and results of analyses Democratize access to NCI-generated genomic and related data, and to create a cost-effective way to provide scalable computational capacity to the cancer research community. These pilots were initiated two years ago based on our awareness that the traditional model of data download and management by every research group was no longer scalable and the NCI wanted to explore the effectiveness of co-locating data and compute in a cloud environment for access and analysis. The overall goals is to…..

NCI Genomic Data Commons GDC/Cloud Pilot Ecosystem Researchers Broad FireCloud ISB CGC SBG CGC Cancer Genomic Data NCI Genomic Data Commons NCI Cloud Pilots Data Storage Retrieval, Submission, & Harmonization Data Compute Analysis, Workflows, & Pipelines

Three NCI Genomics Cloud Pilots PI: Gad Getz Google Cloud Firehose in the cloud including Broad best practices workflows http://firecloud.org Broad Institute PI: Ilya Shmulevich Leverage Google infrastructure; Novel query and visualization http://cgc.systemsbiology.net/ Institute for Systems Biology PI: Deniz Kural Amazon Web Services Interactive data exploration; > 30 public pipelines http://www.cancergenomicscloud.org Seven Bridges Genomics

Broad Institute Cloud Pilot Targeted at users performing analyses at scale. Modeled after their Firehose analysis infrastructure developed for the TCGA program. Users can upload their own data and tools and/or run the Broad’s best practice tools and pipelines on pre-loaded data.

Institute for Systems Biology Cloud Pilot Closely tied with Google Cloud Platform tools including BigQuery, App Engine, Cloud Datalab, Google Genomics, and Compute Engine Level-3 TCGA data in BigQuery allows fast SQL-like queries across the entire dataset Web interface allows scientists to interactively compare and define cohorts PI / Biologist web access Computational Research Scientist Python, R, SQL Algorithm Developer ssh, programmatic access ISB-CGC Web App Google Cloud Console Google APIs ISB-CGC APIs Compute Engine VMs Cloud Storage BigQuery Genomics Local Storage ISB-CGC Hosted Data Controlled-Access Data Open-Access Data User Data

Seven Bridges Genomics Cloud Pilot Built upon the SBG commercial cloud-based genomics platform Graphical query interface to identify hosted data of interest Includes a native implementation of the Common Workflow Language specification for creating user-defined workflows

Timeline & Extension Selection Design/Build I Design/Build II Evaluation Extension Jan 2014 Sept 2014 April 2015 Jan 2016 Sept 2016 One year contract extension for all three NCI Cloud Pilots Continue to make all current tools and data available for an additional year Build on the tools/analyses available Continue to make the platform, pipeline, and tools portable (Dockerization, Workflow languages - CWL/WDL) New datasets added (including pediatric cancer data) New datatypes added: Proteomics, Imaging data, multiple genome builds Overall ~2.5 PB of data in extension

Community Evaluation of the Cloud Pilots Cloud Credits Storage and compute credits are available to researchers through a tiered system to use the Cloud Pilots Grant supplements Support NCI grantees to serve as beta-testers and conduct genomic analysis relevant to their research on one or more Cloud Pilots DREAM Challenge Crowd-based competition to identify the optimal methods for detecting and quantifying mRNA fusions and isoforms from RNA-Seq data