Peter Li GigaScience peter@gigasciencejournal.com GigaDB and Galaxy: revolutionizing data dissemination, organization and analysis Peter Li GigaScience.

Slides:



Advertisements
Similar presentations
Publish or perish? Linking Scratchpads and the new Biodiversity Data Journal for streamlining publication of botanical data D.N Koureas 1, L. Penev 2 &
Advertisements

Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Rewarding Reproducibility and Method Publishing the GigaScience Way Scott Edmunds
Archives and Information Retrieval
Bioinformatics and Phylogenetic Analysis
NCBI resources III: GEO and expression data analysis Yanbin Yin Fall
FROM DATA REPOSITORIES TO DATA JOURNALS – WHERE, WHEN AND HOW TO SUBMIT Andrew L. Hufton Managing Editor, Scientific Data Nature Publishing Group
Thomas Lemberger Chief Editor, Molecular Systems Biology Deputy Head, Scientific Publications, EMBO Publishing actionable data.
Using 3D-SURFER. Before you start 3D-Surfer can be accessed at For visualization.
NGS Analysis Using Galaxy
Gene expression services: ArrayExpress and the Gene Expression Atlas Contact: Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL
Promoting data dissemination and reproducibility. Christopher I. Hunter, Scott C. Edmunds, Peter Li, Xiao Si Zhe, Robert L Davidson, Laurie Goodman. Submit.
Tools for reproducible and accessible science VMs, KnitR and OMERO Rob Davidson Cardiac Physiome Workshop Auckland, April 8th 2015.
Open Data, Open Source: preparing for Big Data in Metabolomics Rob L Davidson #MetSoc2015 This presentation DOI: /m9.figshare
Moving forward our shared data agenda: a view from the publishing industry ICSTI, March 2012.
Gene Expression Omnibus (GEO)
Open Data, Open Source: preparing for Big Data in Metabolomics Rob L Davidson #MetSoc2015 This presentation DOI: /m9.figshare
Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015.
Introduction to GigaScience journal & database Chris I Hunter & Rob L Davidson ISI CODATA International Training Workshop on Big Data 11 th March 2015.
GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015.
Grup.bio.unipd.it CRIBI Genomics group Erika Feltrin PhD student in Biotechnology 6 months at EBI.
Software Sustainability Institute Software Attribution can we improve the reusability and sustainability of scientific software?
Browsing the Genome Using Genome Browsers to Visualize and Mine Data.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
SiZhe Xiao GigaScience 2013 POSTER Open Access GigaDB – revolutionizing data dissemination, organization and use Xiao Si Zhe 1, Chris Hunter, Tam P. Sneddon,
Now launched! Visit nature.com/scientificdata Honorary Academic Editor Susanna-Assunta Sansone Advisory.
WHAT ARE WE GOING TO DO WITH DATA? Rob L Davidson #WCSJ2015 This presentation DOI: /m9.figshare
Gene Expression Omnibus (GEO)
Bioinformatics Lecture to accompany BLAST/ORF finder activity
Scratchpads and the new Biodiversity Data Journal Biodiversity Data Publishing made… easier Dimitris Koureas Natural History Museum London.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
GigaScience ( is an online, open-access journal that includes, as part of its publishing activities, the database GigaDB.
Citing Datasets. Research: search for knowledge or any systematic investigation to establish facts. And to establish facts, one needs Data.
Dataset citation Clickable link to Dataset in the archive Sarah Callaghan (NCAS-BADC) and the NERC Data Citation and Publication team
Merging and sharing Metabolomics analysis tools with Galaxy: transparent, reproducible, open 'omics Robert L Davidson #MMW2014 Merlion.
CyVerse-enabled NCBI Sequence Read Archive (SRA) Submission Pipeline
Data Citation Implementation Pilot Workshop
CyVerse Workshop Discovery Environment Overview. Welcome to the Discovery Environment A Simple Interface to Hundreds of Bioinformatics Apps, Powerful.
Publish your data. The Data Journal concept Data must be well described before others can use it and benefit from it. Scientists who share data in a reusable.
Enhancements to Galaxy for delivering on NIH Commons
Using BLAST to Identify Species from Proteins
Finding Magazine & Newspaper Articles in a Library Database
Biological Databases By: Komal Arora.
Edmunds GigaScience 2013 POSTER Open Access
CyVerse Discovery Environment
Tin-Lap, LEE School of Biomedical Sciences,
Considerations for metagenomics data analysis and summary of workflows
Making “Open Data” Work: Challenges for Data Integration in Genomics Research
Using ArrayExpress.
Christopher I Hunter Conference name Date
GigaDB – revolutionizing data dissemination, organization and use
How to store and visualize RNA-seq data
Bioinformatics Madina Bazarova. What is Bioinformatics? Bioinformatics is marriage between biology and computer. It is the use of computers for the acquisition,
Publishing software and data
SRA Submission Pipeline
Bioinformatics Capstone Project
Department of Genetics • Stanford University School of Medicine
Using BLAST to Identify Species from Proteins
Functional Annotation of the Horse Genome
Mangaldai College, Mangaldai
Gene Expression Omnibus (GEO)
OpenML Workshop Eindhoven TU/e,
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Understanding research method-tool package through scientific texts
Dissemination of the mcr-1 colistin resistance gene
PubMed Database Interface (Basic Course: Module 4)
Using BLAST to Identify Species from Proteins
Welcome - webinar instructions
Data + Research Elements What Publishers Can Do (and Are Doing) to Facilitate Data Integration and Attribution David Parsons – Lawrence, KS, 13th February.
Presentation transcript:

Peter Li GigaScience peter@gigasciencejournal.com GigaDB and Galaxy: revolutionizing data dissemination, organization and analysis Peter Li GigaScience peter@gigasciencejournal.com

Journal and database for large-scale data in conjunction with Editor-in-Chief: Laurie Goodman Editor: Scott Edmunds Commissioning Editor: Nicole Nogoy Lead Curator: Tam Sneddon Data Platform: Peter Li www.gigasciencejournal.com

Mini-ping genome published this month

Why another *omics journal? Already many journals publishing research involving large data sets Results reproducibility

Unrepeatability of scientific results Out of 18 microarray papers, results from 10 could not be reproduced Ioannidis et al., 2009. Repeatability of published microarray gene expression analyses. Nature Genetics 41: 149-155.

How are we supporting data reproducibility? Data sets Linked to DOI Linked to GigaScience paper Analyses Community tools for data reproduction and reuse

Linking of papers and data by citation of DOIs DOIs Provide example of a GigaScience paper Mention DOI for the paper itself Highlight data set generated and its DOI Data set DOI Paper DOI

http://gigadb.org

GigaDB is a new database integrated with the GigaScience journal to meet the needs of a new generation of biological and biomedical research as it enters the era of “big-data”… (see more) And now that you all want to submit to GigaDB, how do you do that and how will people search and find your data and, other than citing your DOI, what will they be able to do with the data? We have redesigned the underlying Giga database and we’re working on the front end which we hope to be public early next month so the following slides are a mix of screenshots from the development site overlaid with tweaks made in powerpoint to illustrate features you can hope to see when we go live. These include: a home page image slider for browsing datasets a text box search which I will demonstrate shortly

Faster download speeds ***NEEDS REWORKING!!!!*** This is an example landing page for DOI 10.5524/100015 for the YH genome dataset. These pages are still in development but you can see the date released, title and abstract and how the dataset should be cited. Additional information includes links to manuscripts and data accessions at EBI, NCBI or DDBJ. There is then information on the samples and files. Faster download speeds Aspera data transfer

BGI Datasets Get DOI®s Released pre-publication Invertebrate Ant - Florida carpenter ant - Jerdon’s jumping ant - Leaf-cutter ant Roundworm Schistosoma Silkworm Parasitic nematode Pacific oyster Released pre-publication Paper published in GigaScience Microbe E. Coli O104:H4 TY-2482 T2D gut metagenome Cell-Lines Chinese Hamster Ovary Mouse methylomes Vertebrates Darwin’s Finch Giant panda Macaque Chinese rhesus Crab-eating Mini-Pig Naked mole rat  Parrot, Puerto Rican Penguin - Emperor penguin - Adelie penguin Pigeon, domestic Polar bear Sheep Tibetan antelope Human Asian individual  (YH) - DNA Methylome - Genome Assembly - Transcriptome Cancer (14TB) Single cell bladder cancer HBV infected exomes Ancient DNA - Saqqaq Eskimo - Aboriginal Australian PLANTS Chinese cabbage Cucumber Foxtail millet Pigeonpea Potato Sorghum 39 data sets

Currently: 39 public datasets *10 citations in references* Humans Ancient DNA - Aboriginal Australian - Saqqaq Eskimo Asian individual  (YH) A GigaDB dataset citation is also included in the YH Transcriptome paper published in Nature Biotechnology in February this year. As you can see the dataset was published in 2011 but this did not prevent subsequent publication of the analysis paper.

What about the analyses? Data sets Linked to Linked to GigaScience paper Analyses How will we make analyses available for downloading and execution?

Bioinformatics data analyses as workflows What happens in bioinformatics is that there are a lot of tools which are available on the Web and the way people use these tools to combine the use of 2 or more of these tools in a pipeline. For example, you might be a biologist interested in how a protein of interest is related to other proteins in its family. How does tcoffee align protein sequences? The alignment tools uses a number of computational algorithms to compare sequences. They can be divided into 2 types: global alignment tools which attempt to align every residue in the sequence and local alignment which attempts to match regions of similarity between sequences. Example workflow: Investigate the evolutionary relationships between proteins Multiple sequence alignment Query Protein sequences

Implement GigaScience workflows in a community-accepted format Open source Over 20,000 main Galaxy server users Over 500 papers citing Galaxy use Over 20,000 users on the main server Over 500 papers citing the use of Galaxy Over 55 servers deployed on the Web Over 55 Galaxy servers deployed http://galaxyproject.org

Tool parameterisation Results panel Allows scientists who may not have programming skills to be able to compose data analysis pipelines. Tool list Tool parameterisation Results panel

Pilot project - Integrate BGI SOAP package into Galaxy Enable SOAP tools to be used from within Galaxy workflows

Integrate BGI SOAP package into Galaxy Data analysis pipelines Python wrapper Python wrapper Python wrapper Python wrapper Python wrapper Python wrapper SOAP1 SOAP2 SOAPdenovo1 SOAPdenovo2 SOAPsnp SOAPsplice

GitHub open code repository https://github.com/gigascience

Tool list Tool parameterisation Results panel

SOAPdenovo2 Galaxy workflow

http://www.myexperiment.org

Why publish in GigaScience? Benefit Added value Data hosted in GigaDB Allocation of DOIs to data Metadata in isa-tab format Galaxy tool integration Use of tools in Galaxy workflows No need to use own servers Citable data Aids reuse of data Supports reuse of tools Improves documentation Shows how tool can be used with other bioinf. software DOIs can now be tracked in the new Thomson Reuters Data Citation index - which gives form of credit and makes the data more discoverable (Scott)

Thanks to: peter@gigasciencejournal.com Tin-Lap Lee and Huayan Gao - CUHK Tam, Jesse, Scott, Nicole & Laurie - GigaScience peter@gigasciencejournal.com