Download presentation
Presentation is loading. Please wait.
1
Christopher I Hunter Conference name Date
GigaDB explained Christopher I Hunter Conference name Date
2
Presentation contents
GigaDB introduction Data types hosted Anatomy of a dataset DOI Navigate GigaDB site Search tool Submission tool The extensible metadata schema Coffee Break ISA tools introduction ISA-Tab as an exchange format ISA in action
3
GigaScience Database
4
Giga-overview GigaDB hosts biological data
(any type of data related to, or used in biological studies) Primarily associated with the BMC journal, GigaScience Funded by BGI-Research and China National GeneBank
5
Currently ~160 datasets available
Genomic datasets represent majority of data(~70%) ~90% of all data from BGI (or partner) studies But there 13 different types represented All manually curated
6
Data types Various Nucleotide data types: Mass spectrometry:
Genomic, Transcriptomic, Metagenomic, Epigenomic, Genome mapping. Mass spectrometry: Proteomics, Metabolomics, MS-Imaging. Software & Workflows Other Imaging, Neuroscience, Network analysis
7
Navigating the GigaDB website
Home page Dataset DOI page Data download options Search tool Submission: Who should submit to GigaDB How to submit data
9
Anatomy of a GigaDB entry
All relevant information is held together in packets called Datasets Each dataset has a stable DOI page If required there can be a hierarchy of datasets
10
Links to Google scholar and EuroPMC to see who has cited this dataset
Title Study type(s) Image Citation Description Funders Links to Google scholar and EuroPMC to see who has cited this dataset submitter Link to manuscript Links to external resources Cont.
11
Samples used in the study
Files listed as part of the study History of dataset changes Social media links Links to other datasets of similar nature
12
Downloading the data FTP Conventional/easy to use
Can pull individually from web page 1 or multiple files using command line unix Speed = upto 1 Mb/sec
13
Downloading the data Aspera Requires plugin download
Only available to use via web-app 1 or multiple files Speed = upto 100 Mb/sec (e.g. upto 100x faster than FTP)
14
Search tool Search for the term “genome” in the search bar at the top of any dataset page:
15
Search tool
16
= GigaDB datasets = Samples = Files
It will only display files that contain matches to the search term
19
Submitting data to GigaDB
All data submitted to GigaDB must be fully consented for public release Where appropriate data should be submitted to established public archives first. (e.g. INSDC) At present we only host data associated with GigaScience journal articles, or by prior approval by the Editors of GigaScience. Potential submitters should approach the editors and database curators by to discuss possible inclusion.
20
Submission Workflow Curator Review Validation checks Files GigaDB
Submit Excel spreadsheet or uses online wizard Curator contacts submitter with DOI citation and to arrange file transfer (and resolve any other questions/issues). Fail – submitter is provided error report Curator Review Submission Validation checks Files DOI assigned Submitter provides files by ftp or Aspera GigaDB Pass – dataset is uploaded to GigaDB. XML is generated and registered with DataCite Curator makes dataset public (can be set as future date if required) DataCite XML file Public GigaDB dataset DOI /100003 Genomic data from the crab-eating macaque/cynomolgus monkey (Macaca fascicularis) (2011)
21
Submission Workflow Curator Review Validation checks Files GigaDB
Submit Excel spreadsheet or uses online wizard Curator contacts submitter with DOI citation and to arrange file transfer (and resolve any other questions/issues). Fail – submitter is provided error report Curator Review Submission Validation checks Files DOI assigned Submitter provides files by ftp or Aspera GigaDB Pass – dataset is uploaded to GigaDB. XML is generated and registered with DataCite Curator makes dataset public (can be set as future date if required) DataCite XML file Public GigaDB dataset DOI /100003 Genomic data from the crab-eating macaque/cynomolgus monkey (Macaca fascicularis) (2011)
22
Submission Once approved there are two options for submitting metadata; offline using an Excel spreadsheet online using the wizard Soon to be a third option (ISA-tab)
23
Online vs Offline Guided Good for few large samples
Allows greater addition of linking Limited documentation Best for large number of samples/files
24
Submission wizard Register, Log in, Goto your profile page:
28
Add all the links to related data
29
Add all the links to related data
30
Add all the links to related data
31
Add all the links to related data
32
Add all Sample metadata
33
Submission Workflow Curator Review Validation checks Files GigaDB
Submit Excel spreadsheet or uses online wizard Curator contacts submitter with DOI citation and to arrange file transfer (and resolve any other questions/issues). Fail – submitter is provided error report Curator Review Submission Validation checks Files DOI assigned Submitter provides files by ftp or Aspera GigaDB Pass – dataset is uploaded to GigaDB. XML is generated and registered with DataCite Curator makes dataset public (can be set as future date if required) DataCite XML file Public GigaDB dataset DOI /100003 Genomic data from the crab-eating macaque/cynomolgus monkey (Macaca fascicularis) (2011)
34
Manual check /curate After metadata has been submitted it is checked by a curator A private upload area is assigned and user can upload data files by Aspera or FTP
35
Submission Workflow Curator Review Validation checks Files GigaDB
Submit Excel spreadsheet or uses online wizard Curator contacts submitter with DOI citation and to arrange file transfer (and resolve any other questions/issues). Fail – submitter is provided error report Curator Review Submission Validation checks Files DOI assigned Submitter provides files by ftp or Aspera GigaDB Pass – dataset is uploaded to GigaDB. XML is generated and registered with DataCite Curator makes dataset public (can be set as future date if required) DataCite XML file Public GigaDB dataset DOI /100003 Genomic data from the crab-eating macaque/cynomolgus monkey (Macaca fascicularis) (2011)
36
Mint the DOI Once all the files and metadata are stored and linked appropriately we will mint the DOI with out partners at DataCite.
37
Publish the dataset Publication date = date on which DOI is released to public. Immediately added to GigaDB RSS feed. Any other promotion of datasets is done in conjunction with manuscript publication
38
Behind the scenes
39
The extensible metadata schema
Spectrum of data being hosted is very broad Database needs to be: Structured, but allow wide variety Be able to incorporate multiple standards Utilise ontologies Link to multiple external sources
40
The GigaDB schema looks like this:
41
Just the Dataset tables
42
Just the Dataset tables
id submitter_id image_id identifier title description dataset_size ftp_site upload_status excelfile excelfile_md5 publication_date modification_date publisher_id token
43
Just the Dataset tables
44
Store wide variety of attributes
attribute_name definition model structured_comment_name value_syntax allowed_units occurrence ontology_link note
45
Checklists Different things are important in different experiment types Various communities have standard checklists they try to adhere to GigaDB can leverage those different checklists and integrate them where possible.
46
MIxS Genomic Standards Consortium (GSC) It includes:
Minimum Information about x Sequence It includes: set of core descriptors for sequence data Set of measurements and observations describing the environment of the sample Goes beyond the minimum, by defining ~370 attributes that could be used. It is hoped that the adoption of this standard would elevate the quality, accessibility and utility of information that can be collected.
47
SRA & PX The Sequence Read Archive (SRA) and the ProteomeXchange (PX) also both provide specific terms (attributes) that we can map to.
48
Other checklists We are able to include attributes from any model or standard and link that from the attributes table So if there is a recommended standard for a particular data-type we can incorporate it.
49
Ontologies Units Taxonomy Any that are defined in standards
Common ones in use: DOID - Disease ontology EFO - Experimental Factor ontology SO - Sequence ontology UBERON - cross-species ontology of anatomical structures ENVO - ENVironment Ontology
50
Future developments Develop an Application programming Interface (API)
Including support for ISA format import and export Improve dataset DOI display pages Include experiment information Improve submission wizard Include bulk upload tables Add ontology look-up automatically Integrate other tools
51
That’s it for GigaDB. Thanks for listening! Any Questions? Next up, ISA tools.
52
Christopher I Hunter Conference name Date
ISA tools Christopher I Hunter Conference name Date
53
What is ISA? Investigation Study (and/or Sample) Assay
54
What is ISA-tab? ISA-tab is a general purpose, domain agnostic, flexible format to describe multi-omic experiments. It can be used as a submission format to some archives and there are a suite of tools for conversion into other common formats.
56
What are ISA-tools? A suite of tools based on the ISA-tab format
Developed and maintained by a team at Oxford University, UK. The main tool of interest here is the ISA-creator
58
Live demo Don’t panic, I have screen shots if it all goes wrong!
64
The Ontology lookup function
65
ISA Validation tool Default only checks ISA-tabs are formed correctly
Can be configured to check against checklists
66
ISA converter tool The development team actively work on new converter tools And are always happy to work with domain experts to make more
67
The ISA team http://www.isa-tools.org/ Susanna Sansone
Philippe Rocca-Serra Alejandra Gonzalez-Beltran
68
That’s it for ISA. Thanks for listening! Any Questions? Next up, GigaScience software and workflows.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.