GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015.

Slides:

Advertisements

Similar presentations

IRRA DSpace April 2006 Claire Knowles University of Edinburgh.

Advertisements

Top Tips for Using Turnitin for Originality Checking and Online Marking A Quick Overview Humanities eLearning Team

A Toolbox for Blackboard Tim Roberts

London & Zurich Plc User Guide. Service Benefits Full on-line management of client accounts Paperless direct debit – no signatures required Standing orders.

ABSTRACT WormBase is a freely available information resource primarily for the nematode Caenorhabditis elegans but which progressively includes data from.

UIC Data Conversion and Submission via CDX Node Client UIC Database V2 6/16/

The Imperial College Tissue Bank A searchable catalogue for tissues, research projects and data outcomes Prof Gerry Thomas - Dept. Surgery & Cancer The.

IAEA International Atomic Energy Agency INIS Collection Search: Introduction and main features INIS Training Seminar 7-11 October 2013, Vienna Domenico.

NIMAC 2.0 Basics for AUs: Searching, Downloading, and Assigning Files & Using the Reports Options 1www.nimac.us.

Engaging networks can help you to grow your online community Outreach top 10.

Welcome to the Turnitin.com Instructor Quickstart Tutorial ! This brief tour will take you through the basic steps teachers and students new to Turnitin.com.

What is so good about Archie and RevMan 5

Welcome to the Manage Quality Assurance module of the “MIP Release 3 Study Workflow Training” course! This module guides you through the process of managing.

Online Resources From Oxford University Press This presentation gives a brief description of Oxford Journals. It tells you: what the journals are; how.

Sam Kalb Scholarly Communication Services Coordinator QUEEN’S.

Promoting data dissemination and reproducibility. Christopher I. Hunter, Scott C. Edmunds, Peter Li, Xiao Si Zhe, Robert L Davidson, Laurie Goodman. Submit.

Based on material developed by Samantha Romanello and

Www. ScoutsOnline.co.uk On-Brand Websites for Scout Groups.

ORGANIZING AND STRUCTURING DATA FOR DIGITAL PROJECTS Suzanne Huffman Digital Resources Librarian Simpson Library.

Training Course 2 User Module Training Course 3 Data Administration Module Session 1 Orientation Session 2 User Interface Session 3 Database Administration.

Classroom User Training June 29, 2005 Presented by:

IT Introduction to Website Development Welcome!

Copyright OpenHelix. No use or reproduction without express written consent1.

Presentation on SubmissionTrackingTool: by Anjan Sharma.

Wiley eGrade. What is eGrade? Web-based software that enables instructors to automate the process of assigning and grading homework and quiz assignments.

TPM Software within Good Spirit School Division. TPM Software is an integrated Student Services Software Solution Forms / Printouts / Reports Integrated.

On-line data submission training California Partnership for Achieving Student Success.

Bridging Communities and Data with ArcGIS Open Data Courtney Claessens, Product Engineer Daniel Fenton, Product Engineer.

0 eCPIC User Training: Resource Library These training materials are owned by the Federal Government. They can be used or modified only by FESCOM member.

ISetup – A Guide/Benefit for the Functional User! Mohan Iyer January 17 th, 2008.

1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.

Scratchpads The virtual research environment for biodiversity data Simon Rycroft, Dave Roberts, Vince Smith, Alice Heaton, Katherine Bouton, Laurence Livermore,

Training by the Office of Library and Information Services Contact for more information: karen.gardner- or

What is TrinDocs A fully integrated document management system enabling: Archiving Instant Retrieval Workflow & Routing OCR and Intelligent Form Recognition.

Submitting Course Outlines for C-ID Designation Training for Articulation Officers Summer 2012.

SiZhe Xiao GigaScience 2013 POSTER Open Access GigaDB – revolutionizing data dissemination, organization and use Xiao Si Zhe 1, Chris Hunter, Tam P. Sneddon,

Introduction to Morpho BEAM Workshop Samantha Romanello Long Term Ecological Research University of New Mexico.

Don’t make me think Biodiversity Data Publishing Made Easy Laurence Livermore, Vince Smith, Alice Heaton, Simon Rycroft, Ed Baker, Ben Scott & Lyubomir.

Introduction to Morpho RCN Workshop Samantha Romanello Long Term Ecological Research University of New Mexico.

Introduction to KE EMu

Hussein Suleman University of Cape Town Department of Computer Science Digital Libraries Laboratory February 2008 Data Curation Repositories:

Scratchpads and the new Biodiversity Data Journal Biodiversity Data Publishing made… easier Dimitris Koureas Natural History Museum London.

GigaScience ( is an online, open-access journal that includes, as part of its publishing activities, the database GigaDB.

Dataset citation Clickable link to Dataset in the archive Sarah Callaghan (NCAS-BADC) and the NERC Data Citation and Publication team

User Guide, 21 May 2009 © Copyright ISAteam 1 ISAconfigurator for ISAcreator User Guide Alpha version: May 2009 Contact:

Editorial Manager Editors’ Day 2014 Rachel Craven.

U.S. Environmental Protection Agency Central Data Exchange Pilot Project Promoting Geospatial Data Exchange Between EPA and State Partners. April 25, 2007.

Air Force Security Assistance Center Report.Web Tutorial AFSAC Schoolhouse DSN (937) Jun 2016 "THIS BRIEFING/PRESENTATION/DOCUMENT.

Here are some things you can do while you wait 1.Open your omeka.net site in your browser (e.g. 2.Open.

NIMAC for Accessible Media Producers: February 2013 NIMAC 2.0 for AMPs.

How to complete and submit a Final Report through Mobility Tool+ Technical guidelines Authentication, Completion and Submission 1 Antonia Gogaki IT Officer.

Financial Management of ECE Programs.  Go to “Tools”  Click on “Personal Information” to edit your personal information (including address) or.

Conceptual Overview For Understanding the New Paradigm Provided by: Web Services Section.

GNU EPrints 2 Overview Christopher Gutteridge 19 th October 2002 CERN. Geneva, Switzerland.

Developing our Metadata: Technical Considerations & Approach Ray Plante NIST 4/14/16 NMI Registry Workshop BIPM, Paris 1 …don’t worry ;-) or How we concentrate.

SPS Spotlight Series October 2014

County Website Content Management System

DIAS & DIAS data release 2 years DIAS-GCI Cooperation Hiroko KINUTANI DIAS (Data Integration and Analysis System in Japan) , St. Petersburg.

OceanDocs Digital Repository of Marine Science Research Outputs

LMEvents SharePoint Portal How-to Guide

Christopher I Hunter Conference name Date

GigaDB – revolutionizing data dissemination, organization and use

Institutional role in supporting open access, open science, open data

A step-by-step guide to DOI registration

CSDR Submit-Review Website Submitter Guide

Louisiana: Our History.

A Case Study for Synergistically Implementing the Management of Open Data Robert R. Downs NASA Socioeconomic Data and Applications.

Manage Sourcing - Supplier

Palestinian Central Bureau of Statistics

Presentation transcript:

GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

Presentation contents GigaDB introduction Data types hosted Anatomy of a dataset DOI Navigate GigaDB site Search tool Submission tool The extensible metadata schema Coffee Break ISA tools introduction ISA-Tab as an exchange format ISA in action

GigaScience Database

Giga-overview GigaDB hosts biological data (any type of data related to, or used in biological studies) Primarily associated with the BMC journal, GigaScience Funded by BGI-Research and China National GeneBank

Currently ~160 datasets available Genomic datasets represent majority of data(~70%) ~90% of all data from BGI (or partner) studies But there 13 different types represented All manually curated

Data types Various Nucleotide data types: – Genomic, Transcriptomic, Metagenomic, Epigenomic, Genome mapping. Mass spectrometry: – Proteomics, Metabolomics, MS-Imaging. Software & Workflows Other – Imaging, Neuroscience, Network analysis

Navigating the GigaDB website Home page Dataset DOI page Data download options Search tool Submission: – Who should submit to GigaDB – How to submit data

Anatomy of a GigaDB entry All relevant information is held together in packets called Datasets Each dataset has a stable DOI page If required there can be a hierarchy of datasets

Title Study type(s) Image Citation Description Funders Links to Google scholar and EuroPMC to see who has cited this dataset submitter Link to manuscript Links to external resources Cont.

Samples used in the study Files listed as part of the study History of dataset changes Social media links Links to other datasets of similar nature

Downloading the data FTP Conventional/easy to use Can pull individually from web page 1 or multiple files using command line unix Speed = upto 1 Mb/sec

Downloading the data Aspera Requires plugin download Only available to use via web-app 1 or multiple files Speed = upto 100 Mb/sec – (e.g. upto 100x faster than FTP)

Search tool Search for the term “genome” in the search bar at the top of any dataset page:

Search tool

= GigaDB datasets = Samples = Files It will only display files that contain matches to the search term

Submitting data to GigaDB All data submitted to GigaDB must be fully consented for public release Where appropriate data should be submitted to established public archives first. (e.g. INSDC) At present we only host data associated with GigaScience journal articles, or by prior approval by the Editors of GigaScience. Potential submitters should approach the editors and database curators by to discuss possible inclusion.

Validation checks Fail – submitter is provided error report Pass – dataset is uploaded to GigaDB. Submission Workflow Curator makes dataset public (can be set as future date if required) DataCite XML file Submission Submit Excel spreadsheet or uses online wizard GigaDB DOI assigned Files Submitter provides files by ftp or Aspera XML is generated and registered with DataCite Curator Review Curator contacts submitter with DOI citation and to arrange file transfer (and resolve any other questions/issues). DOI / / Genomic data from the crab-eating macaque/cynomolgus monkey (Macaca fascicularis) (2011) Public GigaDB dataset

Curator makes dataset public (can be set as future date if required) DataCite XML file GigaDB DOI assigned Files Submitter provides files by ftp or Aspera XML is generated and registered with DataCite Curator Review Curator contacts submitter with DOI citation and to arrange file transfer (and resolve any other questions/issues). DOI / / Genomic data from the crab-eating macaque/cynomolgus monkey (Macaca fascicularis) (2011) Public GigaDB dataset Submit Excel spreadsheet or uses online wizard Validation checks Fail – submitter is provided error report Pass – dataset is uploaded to GigaDB. Submission Workflow Submission

Once approved there are two options for submitting metadata; – offline using an Excel spreadsheet – online using the wizard Soon to be a third option (ISA-tab)

Online vs Offline Guided Good for few large samples Allows greater addition of linking Limited documentation Best for large number of samples/files

Submission wizard Register, Log in, Goto your profile page:

Add all the links to related data

Add all Sample metadata

Curator makes dataset public (can be set as future date if required) DataCite XML file XML is generated and registered with DataCite DOI / / Genomic data from the crab-eating macaque/cynomolgus monkey (Macaca fascicularis) (2011) Public GigaDB dataset Submit Excel spreadsheet or uses online wizard Validation checks Fail – submitter is provided error report Pass – dataset is uploaded to GigaDB. Submission GigaDB DOI assigned Files Submitter provides files by ftp or Aspera Curator Review Curator contacts submitter with DOI citation and to arrange file transfer (and resolve any other questions/issues). Submission Workflow

Manual check /curate After metadata has been submitted it is checked by a curator A private upload area is assigned and user can upload data files by Aspera or FTP

Submit Excel spreadsheet or uses online wizard Validation checks Fail – submitter is provided error report Pass – dataset is uploaded to GigaDB. Submission DOI assigned Files Submitter provides files by ftp or Aspera Curator Review Curator contacts submitter with DOI citation and to arrange file transfer (and resolve any other questions/issues). Curator makes dataset public (can be set as future date if required) DataCite XML file XML is generated and registered with DataCite GigaDB DOI / / Genomic data from the crab-eating macaque/cynomolgus monkey (Macaca fascicularis) (2011) Public GigaDB dataset Submission Workflow

Mint the DOI Once all the files and metadata are stored and linked appropriately we will mint the DOI with out partners at DataCite.

Publish the dataset Publication date = date on which DOI is released to public. Immediately added to GigaDB RSS feed. Any other promotion of datasets is done in conjunction with manuscript publication

Behind the scenes

The extensible metadata schema Spectrum of data being hosted is very broad Database needs to be: – Structured, but allow wide variety – Be able to incorporate multiple standards – Utilise ontologies – Link to multiple external sources

The GigaDB schema looks like this:

Just the Dataset tables

dataset id submitter_id image_id identifier title description dataset_size ftp_site upload_status excelfile excelfile_md5 publication_date modification_date publisher_id token

Just the Dataset tables

Store wide variety of attributes attribute id attribute_name definition model structured_comment_name value_syntax allowed_units occurrence ontology_link note

Checklists Different things are important in different experiment types Various communities have standard checklists they try to adhere to GigaDB can leverage those different checklists and integrate them where possible.

MIxS Genomic Standards Consortium (GSC) – Minimum Information about x Sequence It includes: – set of core descriptors for sequence data – Set of measurements and observations describing the environment of the sample – Goes beyond the minimum, by defining ~370 attributes that could be used. It is hoped that the adoption of this standard would elevate the quality, accessibility and utility of information that can be collected.

SRA & PX The Sequence Read Archive (SRA) and the ProteomeXchange (PX) also both provide specific terms (attributes) that we can map to.

Other checklists We are able to include attributes from any model or standard and link that from the attributes table So if there is a recommended standard for a particular data-type we can incorporate it.

Ontologies Units Taxonomy Any that are defined in standards Common ones in use: – DOID - Disease ontology – EFO - Experimental Factor ontology – SO - Sequence ontology – UBERON - cross-species ontology of anatomical structures – ENVO - ENVironment Ontology

Future developments Develop an Application programming Interface (API) – Including support for ISA format import and export Improve dataset DOI display pages – Include experiment information Improve submission wizard – Include bulk upload tables Add ontology look-up automatically Integrate other tools

That’s it for GigaDB. Thanks for listening! Any Questions? Next up, ISA tools.

ISA tools Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

What is ISA? I nvestigation S tudy (and/or Sample) A ssay

What is ISA-tab? ISA-tab is a general purpose, domain agnostic, flexible format to describe multi-omic experiments. It can be used as a submission format to some archives and there are a suite of tools for conversion into other common formats.

What are ISA-tools? A suite of tools based on the ISA-tab format Developed and maintained by a team at Oxford University, UK. The main tool of interest here is the ISA-creator

Live demo Don’t panic, I have screen shots if it all goes wrong!

The Ontology lookup function

ISA Validation tool Default only checks ISA-tabs are formed correctly Can be configured to check against checklists

ISA converter tool The development team actively work on new converter tools And are always happy to work with domain experts to make more

The ISA team Susanna Sansone Philippe Rocca-Serra Alejandra Gonzalez-Beltran

That’s it for ISA. Thanks for listening! Any Questions? Next up, GigaScience software and workflows.