APIs for the Biomedical Commons Cloud and NCI Genomic Data Commons Robert L. Grossman and Allison Heath University of Chicago and Open Cloud Consortium.

APIs for the Biomedical Commons Cloud and NCI Genomic Data Commons Robert L. Grossman and Allison Heath University of Chicago and Open Cloud Consortium September 3, 2015

The NCI Genomic Data Commons (GDC) is a cloud-based computing platform that will aggregate, harmonize and share cancer genomics data and associated clinical information for NIH/NCI funded researchers. The GDC is in beta testing now and contains about 2 PB of cancer genomics data. The GDC is the one of the NIH Commons and the one furthest along in development. The National Cancer Institute (NCI) is the largest NIH Institute and represents approximately half of the NIH total budget. The GDC is open source, and we are using the same software stack and associated service to build other commons for biology, medicine and health care, as well as other sciences, including environmental science. The Biomedical Data Commons (BDC) is a commons developed and operated by the not-for-profit Open Cloud Consortium (OCC) that will create a commons that spans multiple medical research centers. We are interested in prototyping an international biomedical commons that interoperates through APIs developed by the international research community, including G4GH. Overview 2

Proposed Software Stack 3 Software Defined Networking Various IaaS, PaaS, DiCOS that can run containers and virtual machines APIs for access to genomic and clinical data with AAA controls a genomic application with clinical significance, such as clustering RNA-seq data for cancer patients deliverable via a container 3-dim visualization using R-studio

We have developed a preliminary pipeline for clustering RNA-seq cancer data. Clustering of RNA-Seq Data for Cancer Genomics 4

Chicago Taiwan Amsterdam Proposed Sites 5

GDC Data Access APIs 6

REST API for programmatically interfacing with the GDC to query and download data –Drives the current data portal Current features include: –Search an indexed view of the GDC data model for projects, files, cases, and annotations –Gather details about a project, file, case, or annotation –Download files GDC API Overview 7 https://gdc-api.nci.nih.gov/files/5003adf1-1cfd-467d-8234-0d396422a4ee?fields=state API URL Endpoint Optional Entity ID Query parameters

GDC API: Endpoints 8 Query and retrieve Expose GDC data through a query mechanism that returns JSON output Six endpoints available: –Four search and retrieval endpoints, return JSON output: /projects /cases /files /annotations –One for download, returns single file or tar.gz: /data –One for reporting the API status: /status

Browser must be used to first obtain a token –No direct API access due to limitations of current SAML protocol, see https://wiki.shibboleth.net/confluence/display/CONCEPT/ECP https://wiki.shibboleth.net/confluence/display/CONCEPT/ECP To obtain a token: –Login to the GDC portal using your eRA Commons –After login, the option to download a token appears under your username in the upper right –Provided using the X-Auth-Token header GDC API: Token-Based Authentication 9 eRA Commons Login GDC Token

Utilizes ElasticSearch to provide an indexed view of the GDC data model –For each endpoint type, a walk is done on the graph to create a nested JSON document Each endpoint has a “_mapping” function for obtaining the current document structure, including query fields and their types –https://gdc-api.nci.nih.gov/projects/_mappinghttps://gdc-api.nci.nih.gov/projects/_mapping GDC API: Search and Retrieval Endpoints 10

Direct API querying is done by using the “filter” parameter Portal provides ‘GQL’, a more human friendly syntax that gets translated GDC API: Query by Filtering 11 Single Field Example: {"op": "=", "content": { "field": "cases.clinical.gender", "value": ["male"] } Multi Field Example {"op": "and", "content": [ { "op": "=", "content": { "field": "cases.clinical.gender", "value": "female" } }, { "op": "=", "content": { "field": "files.platform", "value": "Affymetrix SNP Array 6.0" } ] } The full URL: https://gdc-api.nci.nih.gov/files?filters={"op":"=","content":{"field":"cases.clinical.gender","value":["malehttps://gdc-api.nci.nih.gov/files?filters={"op":"=","content":{"field":"cases.clinical.gender","value":["male"]}} Single field operators: =, !=,, >=, in, is, not, range, exclude Multi field operators: and, or

All of the entity endpoints take the same query string parameters: –Facets, specify for which fields to include a document count –Fields, specify which fields to include the response, _mapping will report the defaults if none are specified –Filters, as described on previous slide –From, specify the first record to return for pagination –Size, number of results to return –Sort, specify a field to sort by –Pretty, prettify JSON response GDC API: Search and Retrieval Endpoint Parameters 12

List projects or retrieve details about a specific project Retrieve a list of projects –Example: https://gdc-api.nci.nih.gov/projects?fields=project_id,primary_site&facets=primary_site&pretty=true https://gdc-api.nci.nih.gov/projects?fields=project_id,primary_site&facets=primary_site&pretty=true –Retrieve project-specific details –Example: https://gdc-api.nci.nih.gov/projects/TCGA-BRCA?fields=name,summary.case_count&pretty=true https://gdc-api.nci.nih.gov/projects/TCGA-BRCA?fields=name,summary.case_count&pretty=true GDC API: Sample Call (projects endpoint) 13 { "data": { "name": "Breast Invasive Carcinoma", "summary": { "case_count": 1101 } }, "warnings": {} } { "data": { "hits": [ {"project_id": "TCGA-SKCM”,"primary_site": "Skin”}, {"project_id": "TCGA-PCPG”,"primary_site": "Nervous System”}, {"project_id": "TCGA-LAML”,"primary_site": "Blood”}, {"project_id": "TCGA-CNTL”,"primary_site": "Not Applicable”}, {"project_id": "TCGA-UVM”,"primary_site": "Eye”}, {"project_id": "TARGET-AML”,"primary_site": "Blood”}, {"project_id": "TCGA-SARC”,"primary_site": "Mesenchymal”}, {"project_id": "TCGA-LUSC”,"primary_site": "Lung”}, {"project_id": "TARGET-NBL”,"primary_site": "Nervous System”}, {"project_id": "TCGA-PAAD”,"primary_site": "Pancreas”} ], "aggregations": { "primary_site": { "buckets": [ {"key": "Blood", doc_count": 6}, {"key": "Kidney", "doc_count": 6}, // Portion remove for readability }}} Retrieving project- specific details Retrieving list of projects

Stream single file or gzipped (tar.gz) files back to the user Accept one or more UUID (comma separated) Token provided as X-Auth-Token header required to access restricted files GDC API: Data Endpoint 14 { "origin": "migrated", "data_type": "Raw microarray data", "platform": "MDA_RPPA_Core", "file_name": "Collagen_VI-R-V_GBL1112757.tif", "md5sum": "68d1edc2b7fda0c7c97d67b7b617a1f2", "data_format": "TIF", "acl": "open", "access": "open", "uploaded_datetime": 1425340539, "state": "live", "data_subtype": "Raw intensities", "file_id": "6eb0e7f2-f0a6-420a-9511-9fef295c653e", "file_size": 6273772, "experimental_strategy": "Protein expression array" }, { "origin": "migrated", "data_type": "Simple nucleotide variation", "platform": "Affymetrix SNP Array 6.0", "file_name": "DUNGS_p_TCGA_b84_115_SNP_N_GenomeWideSNP_6_C09_771624.birdseed.data.txt", "md5sum": "a2a8e75e08dec27035f4af89c81f6e08", "data_format": "TXT", "acl": "phs000178", "access": "protected", "uploaded_datetime": 1425340539, "state": "live", "data_subtype": "Genotypes", "file_id": "178c2af5-181a-4312-b45e-0320b6daefb1", "file_size": 20851964, "experimental_strategy": "Genotyping array" }, Open-access file (no authentication needed) Restricted-access file (no authentication needed)

GDC Resources –GDC Web Site (Contains User’s Guides) https://gdc.nci.nih.gov –GDC Data Portal URL https://gdc.nci.nih.gov/access-data/about-gdc-data-portalhttps://gdc.nci.nih.gov/access-data/about-gdc-data-portal https://gdc-portal.nci.nih.gov –GDC Application Programming Interface (API) https://gdc.nci.nih.gov/developers/gdc-application-programming-interface-apihttps://gdc.nci.nih.gov/developers/gdc-application-programming-interface-api https://gdc-api.nci.nih.gov Questions? API References 15

Questions? 16

Backup Material 17

GDC APIs for Data Submission 18

GDC submitter types –“Type 1” Users associated with an institution or group with significant informatic resources Large one-time submissions or long- term ongoing data submission Mostly use APIs or command line tools for submission –“Type 2” Users associate with a single group (such as a laboratory PI), with limited informatic resource One-time or sporadic uploads of low volumes of patient and analysis data with varying levels of completeness Submit via web browser, supported by the same API –This audience is Type 1, so focusing on that use case Type 1 Submission: Overview 19 Program Project Case Biospecimen Clinical Data Bundles Files

GDC has to track at least three different identifiers: –GDC ID: UUID for all GDC entities except for Program/Project; created by GDC* and unique across the whole system –Alias: an optional string that can be used by the submitter to store their own IDs; must be unique on a per-project basis –dbGaP ID: For program/project is the study accession (e.g. phs000178 for TCGA). It is still TBD if/how the GDC tracks participant and sample dbGaP IDs. *Tentatively could support allowing Type 1 submitters to create GDC IDs if UUID4; Done for legacy data Type 1 Submission: Identifiers 20 Program (GDC ID, dbGaP ID) Project (GDC ID, dbGaP ID) Case (GDC ID, Alias, dbGaP ID?) Biospecimen (GDC ID, Alias, dbGaP ID?) Clinical (GDC ID, Alias, dbGaP ID?) Data Bundles (GDC ID, Alias) Files (GDC ID, Alias)

The native language of the GDC is JSON, each entity is represented as a JSON document The search/retrieval API has opinionated views on the GDC data model for query performance Submission API designed to more directly operate on entities and relationships in the GDC: –The data dictionary is defined in JSON schema; will include core GDC schemas with a basic inheritance model for per-project properties –The base submission API then has a single endpoint per project that takes in a JSON object or an array of JSON objects representing submissions https://gdc-api.nci.nih.gov/v0/submission/program1/project1 –Uses standard HTTP methods: GET to retrieve an entity POST to create an entity PUT to fully replace an entity PATCH to do partial updates to an entity DELETE to delete an entity Type 1 Submission: Data Dictionary Driven REST API 21

Still undergoing iterative design, the authoritative formats will be released as JSON schemas, but to provide a flavor of where we’re going: Type 1 Submission: JSON Example 22 General Entity Format { 'type': string, 'id': string, 'alias': string, ' ': any type, ' ': [{ 'id': string 'alias': string 'type': string }],... } General Response Format { 'success': boolean, 'code': int, 'message': string 'transactional_errors': [transactional_error], 'transactional_error_count': int, 'entity_error_count': int, 'entity': [object], } Example Aliquot Entity (as of right now) { "type": "aliquot", "amount": None, "concentration": 0.14, "alias": "aliquot101", "derived_from": { "alias": "sample200", "type": "sample" }, } Example Aliquot Entity (under discussion) { "type": "aliquot", "amount": None, "concentration": 0.14, "alias": "aliquot101", "sample": { "alias": "sample200” }, }

Validation has two categories: 1.Does not require access to GDC database: Validate entities and properties based on JSON schema Validate entities and properties based on custom property-level functions 2.Requires access to GDC database: Identifiers do not violate uniqueness constraints Validate that linked to entities exist (if not included in current transaction) Validate that adding link does not violate multiplicity constraints Intention is to be as helpful as possible when fixing invalid data errors and will provide a list of all known errors regarding each entity –Provide a “dry run” option to just do validation without writing to the GDC –Category 1 validation could be done without access to the API and fixing errors returned by the dry run should result in successful submission –Category 2 validation has to be done through the API and a successful dry run most likely means a successful submission, but cannot guarantee E.g. someone else submits “case1” before you do Type 1 Submission: Validation 23

Program (e.g. TCGA/TARGET) and their Projects (e.g. BRCA) are created in the GDC and linked to the appropriate dbGaP projects –Program/Project creation and updates to be done by the GDC team (under advisement from the PO) Cases can then be submitted to a specific project –Cases are a primarily administrative entity; the term was chosen to allow the freedom of defining this based on the needs of the specific program/project For TCGA, participants are cases Could be cell lines, xenografts, as needed Moving forward, recommend each Program/Project to have a well defined concept of case to prevent confusion –TBD: level of synchronization required with dbGaP subjects Type 1 Submission: Programs, Projects, and Cases 24

Current model supports sample, portion, analyte, aliquot, and slide* Will provide endpoints to accept biospecimen and clinical XML –Accept BCR XML format and store XML as blob for user download –Perform one extra validation step against XSD –Translate XML into native GDC JSON to submit against core GDC API –Need to communicate/collaborate on this translation layer because the errors returned will be based on the core JSON-based GDC API Anticipate with the Type 1 submission, the creation/update of a case and corresponding biospecimen data will happen with one transaction via the XML- based endpoint Clinical data submission is a separate transaction, also expect creation/update to be a single transaction via the XML-based endpoint –Select fields will be indexed and made searchable on the GDC data portal Likely more stringent validation Initial set contains 28 fields categorized into demographics, diagnosis, exposure, family history, and treatment Type 1 Submission: Biospecimen and Clinical 25

The most “in progress” GDC concept Currently, in the GDC data model there are just files, but realized a need to represent related sets of files A data bundle is a set of files with metadata –A data bundle is attached to one biospecimen entity Working through use cases to see if this needs to be expanded Examples of possible combinations: –1 BAM file –2 paired FASTQ files + SRA XML files –1 BAM file + SRA XML files –1 slide image Data bundle types to be defined by what files are expected and what introspection is done for validation or linking to other GDC entities Type 1 Submission: Data Bundles 26

From submission standpoint, two step process due to size of files: –Register data bundle by either explicitly linking the bundle to the biospecimen entity in the API call or uploading metadata (e.g. SRA XML) that contains GDC ID or Alias –Upload raw data using the GDC Data Transfer Tool referencing to the data bundle ID or Alias Type 1 Submission: Data Bundle Lifecycle 27

Submission Dashboard 28

APIs for the Biomedical Commons Cloud and NCI Genomic Data Commons Robert L. Grossman and Allison Heath University of Chicago and Open Cloud Consortium.

Similar presentations

Presentation on theme: "APIs for the Biomedical Commons Cloud and NCI Genomic Data Commons Robert L. Grossman and Allison Heath University of Chicago and Open Cloud Consortium."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

APIs for the Biomedical Commons Cloud and NCI Genomic Data Commons Robert L. Grossman and Allison Heath University of Chicago and Open Cloud Consortium.

Similar presentations

Presentation on theme: "APIs for the Biomedical Commons Cloud and NCI Genomic Data Commons Robert L. Grossman and Allison Heath University of Chicago and Open Cloud Consortium."— Presentation transcript:

Similar presentations

About project

Feedback