NCI Cloud Pilot Collaboration Meeting

Slides:



Advertisements
Similar presentations
OASIS OData Technical Committee. AGENDA Introduction OASIS OData Technical Committee OData Overview Work of the Technical Committee Q&A.
Advertisements

The basics and troubleshooting tips
1 Jaql → pipes Unix pipes for the JSON data model Kevin Beyer, Vuk Ercegovac, Eugene Shekita, Jun Rao, Ning Li, Sandeep Tata IBM Almaden Research Center.
Microsoft Excel 2003 Illustrated Complete Excel Files and Incorporating Web Information Sharing.
The Maize Inflorescence Project Website Tutorial Nov 7, 2014.
UNESCO ICTLIP Module 4. Lesson 3 Database Design, and Information Storage and Retrieval Lesson 3. Information storage and retrieval using WinISIS.
Microsoft ® Official Course Interacting with the Search Service Microsoft SharePoint 2013 SharePoint Practice.
LHCbPR V2 Sasha Mazurov, Amine Ben Hammou, Ben Couturier 5th LHCb Computing Workshop
Web Services 101 James Payne Managing Director for New Media / Advancement July 30, 2013.
Federated Searching Pre-Conference Workshop - The federated searching cookbook Qin Zhu HP Labs Research Library February 18, 2007.
Simple Web SQLite Manager/Form/Report
NGS Analysis Using Galaxy
ASP.NET Programming with C# and SQL Server First Edition
Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.
Concept demo System dashboard. Overview Dashboard use case General implementation ideas Use of MULE integration platform Collection Aggregation/Factorization.
What’s New in VRS? GUGM May 15, 2008 Presenter: Kelly P. Robinson GIL Service Georgia State University
Support for MAGE-TAB in caArray 2.0 Overview and feedback MAGE-TAB Workshop January 24, 2008.
Gene Expression Omnibus (GEO)
HTRC API Overview Yiming Sun. HTRC Architecture Data API Portal access Direct programmatic access (by programs running on HTRC machines) Security (OAuth2)
Let VRS Work for You! ELUNA Conference 2008 Presenter: Kelly P. Robinson GIL Service Georgia State University
PubMed/How to Search, Display, Download & (module 4.1)
Part 1 – PubMed Interface, Display options, Saving, Printing, and ing results. Instructions This part of the course is a PowerPoint demonstration.
Chapter 8 Cookies And Security JavaScript, Third Edition.
Taverna Workflow. A suite of tools for bioinformatics Fully featured, extensible and scalable scientific workflow management system – Workbench, server,
NGS data analysis CCM Seminar series Michael Liang:
Copyright OpenHelix. No use or reproduction without express written consent1.
Introduction to caArray caBIG ® Molecular Analysis Tools Knowledge Center April 3, 2011.
Introduction to the SharePoint 2013 REST API. 2 About Me SharePoint Solutions Architect at Sparkhound in Baton Rouge
Water Web Services. Connecting a Catalog with Users and Servers Server User Catalog Data Services HydroServerHydroDesktop HIS Central Data Services WISKI.
L JSTOR Tools for Linguists 22nd June 2009 Michael Krot Clare Llewellyn Matt O’Donnell.
Analysis of GEO datasets using GEO2R Parthav Jailwala CCR Collaborative Bioinformatics Resource CCR/NCI/NIH.
OAI Overview DLESE OAI Workshop April 29-30, 2002 John Weatherley
Mercury – A Service Oriented Web-based system for finding and retrieving Biogeochemical, Ecological and other land- based data National Aeronautics and.
IGV tools. Pipeline Download genome from Ensembl bacteria database Export the mapping reads file (SAM) Map reads to genome by CLC Using the mapping.
Fundamentals of Web DevelopmentRandy Connolly and Ricardo HoarFundamentals of Web DevelopmentRandy Connolly and Ricardo Hoar Fundamentals of Web DevelopmentRandy.
The iPlant Collaborative
CPSC 203 Introduction to Computers T97 By Jie (Jeff) Gao.
Call in: Participant Passcode: Centra: Meeting ID: ICR_WShttp://ncicb.centra.com August 11, 2010 ICR-WS Meeting.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
©2003 Paula Matuszek GOOGLE API l Search requests: submit a query string and a set of parameters to the Google Web APIs service and receive in return a.
IBM Express Runtime Quick Start Workshop © 2007 IBM Corporation Deploying a Solution.
DICOMwebTM 2015 Conference & Hands-on Workshop University of Pennsylvania, Philadelphia, PA September 10-11, 2015 DICOMweb Workflow API (UPS-RS) Jonathan.
Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.
GBIF Governing Board 20 Module 6B: New GBIF Tools II 2013 Portal and NPT Startup Daniel Amariles IT Leader, National Biodiversity Information System of.
Excel Services Displays all or parts of interactive Excel worksheets in the browser –Excel “publish” feature with optional parameters defined in worksheet.
E-utilities: Short course. The Entrez Query System at NCBI.
CCRC Cancer Conference November 8, 2015.
Canadian Bioinformatics Workshops
APIs for the Biomedical Commons Cloud and NCI Genomic Data Commons Robert L. Grossman and Allison Heath University of Chicago and Open Cloud Consortium.
NCRI Cancer Conference November 1, 2015.
ArcGIS for Server Security: Advanced
Using command line tools to process sequencing data
Take a REST from manual searching: PDBe, programmatically
Interacting with O365 using MS Graph API
Searching and Indexing
Data Virtualization Tutorial… CORS and CIS
Accessing Spatial Information from MaineDOT
NCI’s Genomics Data Commons (GDC) & NCI Cloud Pilots
Azure Machine Learning & ML Studio
WGISS-41: IDN Report Michael Morahan CEOS WGISS-41 Meeting
Office 365 Development.
PubMed Database Interface (Basic Course Module 4 Part A)
Yonglan Zheng Galaxy Hands-on Demo Step-by-step Yonglan Zheng
Getting Started With Solr
PubMed Database Interface (Basic Course: Module 4)
Programmatic interaction with the Invenio-based NADRE Repository
Programmatic interaction with the Invenio-based NADRE Repository
Informer 5 API How to get connected and start integrating
The NCI Genomic Data Commons as an engine for precision medicine
HCA Data Access Oct 3rd 2019.
Presentation transcript:

NCI Cloud Pilot Collaboration Meeting GDC API and Pipelines NCI Cloud Pilot Collaboration Meeting August 10, 2015

Agenda GDC API GDC Pipelines Overview Authentication Entity Endpoints Download Endpoint GDC Pipelines GRCh38 Reference Genome BWA (DNA-Seq)

GDC API Overview API URL Endpoint Optional Entity ID Query parameters REST API for programmatically interfacing with the GDC to query and download data Drives the current data portal Current features include: Search an indexed view of the GDC data model for projects, files, cases, and annotations Gather details about a project, file, case, or annotation Download files https://gdc-api.nci.nih.gov/files/5003adf1-1cfd-467d-8234-0d396422a4ee?fields=state API URL Endpoint Optional Entity ID Query parameters

GDC API: Endpoints Query and retrieve Expose GDC data through a query mechanism that returns JSON output Six endpoints available: Four search and retrieval endpoints, return JSON output: /projects /cases /files /annotations One for download, returns single file or tar.gz: /data One for reporting the API status: /status

GDC API: Token-Based Authentication Browser must be used to first obtain a token No direct API access due to limitations of current SAML protocol, see https://wiki.shibboleth.net/confluence/display/CONCEPT/ECP To obtain a token: Login to the GDC portal using your eRA Commons After login, the option to download a token appears under your username in the upper right Provided using the X-Auth-Token header GDC Token eRA Commons Login

GDC API: Search and Retrieval Endpoints Utilizes ElasticSearch to provide an indexed view of the GDC data model For each endpoint type, a walk is done on the graph to create a nested JSON document Each endpoint has a “_mapping” function for obtaining the current document structure, including query fields and their types https://gdc-api.nci.nih.gov/projects/_mapping

GDC API: Query by Filtering Direct API querying is done by using the “filter” parameter Portal provides ‘GQL’, a more human friendly syntax that gets translated Single Field Example: {"op": "=", "content": { "field": "cases.clinical.gender", "value": ["male"] } Multi Field Example {"op": "and", "content": [ { "op": "=", "content": { "field": "cases.clinical.gender", "value": "female" } }, "field": "files.platform", "value": "Affymetrix SNP Array 6.0" ] The full URL: https://gdc-api.nci.nih.gov/files?filters={"op":"=","content":{"field":"cases.clinical.gender","value":["male"]}} Single field operators: =, != , <, <=, =, >, >=, in, is, not, range, exclude Multi field operators: and, or

GDC API: Search and Retrieval Endpoint Parameters All of the entity endpoints take the same query string parameters: Facets, specify for which fields to include a document count Fields, specify which fields to include the response, _mapping will report the defaults if none are specified Filters, as described on previous slide From, specify the first record to return for pagination Size, number of results to return Sort, specify a field to sort by Pretty, prettify JSON response

GDC API: Sample Call (projects endpoint) List projects or retrieve details about a specific project Retrieve a list of projects Example: https://gdc-api.nci.nih.gov/projects?fields=project_id,primary_site&facets=primary_site&pretty=true Retrieve project-specific details Example: https://gdc-api.nci.nih.gov/projects/TCGA-BRCA?fields=name,summary.case_count&pretty=true { "data": { "hits": [ {"project_id": "TCGA-SKCM”,"primary_site": "Skin”} , {"project_id": "TCGA-PCPG”,"primary_site": "Nervous System”} , {"project_id": "TCGA-LAML”,"primary_site": "Blood”} , {"project_id": "TCGA-CNTL”,"primary_site": "Not Applicable”} , {"project_id": "TCGA-UVM”,"primary_site": "Eye”} , {"project_id": "TARGET-AML”,"primary_site": "Blood”} , {"project_id": "TCGA-SARC”,"primary_site": "Mesenchymal”} , {"project_id": "TCGA-LUSC”,"primary_site": "Lung”} , {"project_id": "TARGET-NBL”,"primary_site": "Nervous System”} , {"project_id": "TCGA-PAAD”,"primary_site": "Pancreas”} ] , "aggregations": { "primary_site": { "buckets": [ {"key": "Blood", doc_count": 6}, {"key": "Kidney", "doc_count": 6}, // Portion remove for readability }}} { "data": { "name": "Breast Invasive Carcinoma", "summary": { "case_count": 1101 } }, "warnings": {} Retrieving project-specific details Retrieving list of projects

GDC API: Data Endpoint Stream single file or gzipped (tar.gz) files back to the user Accept one or more UUID (comma separated) Token provided as X-Auth-Token header required to access restricted files { "origin": "migrated", "data_type": "Raw microarray data", "platform": "MDA_RPPA_Core", "file_name": "Collagen_VI-R-V_GBL1112757.tif", "md5sum": "68d1edc2b7fda0c7c97d67b7b617a1f2", "data_format": "TIF", "acl": "open", "access": "open", "uploaded_datetime": 1425340539, "state": "live", "data_subtype": "Raw intensities", "file_id": "6eb0e7f2-f0a6-420a-9511-9fef295c653e", "file_size": 6273772, "experimental_strategy": "Protein expression array" }, "data_type": "Simple nucleotide variation", "platform": "Affymetrix SNP Array 6.0", "file_name": "DUNGS_p_TCGA_b84_115_SNP_N_GenomeWideSNP_6_C09_771624.birdseed.data.txt", "md5sum": "a2a8e75e08dec27035f4af89c81f6e08", "data_format": "TXT", "acl": "phs000178", "access": "protected", "data_subtype": "Genotypes", "file_id": "178c2af5-181a-4312-b45e-0320b6daefb1", "file_size": 20851964, "experimental_strategy": "Genotyping array" Open-access file (no authentication needed) Restricted-access file (no authentication needed)

API References GDC Resources Questions? GDC Web Site (Contains User’s Guides) https://gdc.nci.nih.gov GDC Data Portal URL https://gdc.nci.nih.gov/access-data/about-gdc-data-portal https://gdc-portal.nci.nih.gov GDC Application Programming Interface (API) https://gdc.nci.nih.gov/developers/gdc-application-programming-interface-api https://gdc-api.nci.nih.gov Questions?

GDC Alignment Pipeline Overview GDC reference based off of GRCh38p0-no-alt DNA-Seq Alignment Pipeline BAM Validation and Convert to FASTQ Alignment and QC by Read Groups BAM Merge and Validation RNA-Seq Alignment Pipeline

GDC Reference Genome

BAM Validation and Convert to FASTQ

Alignment and QC by Read Groups

QC Metrics Alignment metrics collected Samtools flagstat Samtools stats Samtools idxstats Picard Tools CollectWgsMetrics/ CollectHsMetrics Picard Tools CollectMultipleMetrics CollectAlignmentSummaryMetrics CollectInsertSizeMetrics (and histogram) QualityScoreDistribution (and histogram) MeanQualityByCycle (and histogram) CollectBaseDistributionByCycle CollectGcBiasMetrics (and summary)

BAM Merge and Validation

Pipelines in the Works RNA-Seq: Based on ICGC STAR 2-pass, QC metrics still TBD miRNA: basing on BWA based pipeline created by BCCA for TCGA

Discussion and Questions?