Open Data, Open Source: preparing for Big Data in Metabolomics Rob L Davidson #MetSoc2015

Open Data, Open Source: preparing for Big Data in Metabolomics Rob L Davidson GigaScience @bobbledavidson #MetSoc2015

2 Big Science

3 R&D is getting bigger

4 More PhDs doi:10.1038/472276a

5 More postdocs

6 Not so much at the top

7 Big is at the bottom

THE NEED FOR OPEN DATA IN SCIENCE

9 Let me tell you about… “I am appalled sometimes at some papers today: they are so data-heavy and I don't think that makes them better papers.” – Tim Hunt 2014 Lab –

Researcher bias Positive result bias  20 teams do studies, 1 publishes p<0.05 Poorly explained analyses DOI: 10.1371/journal.pmed.0020124

Problem: Reproducibility Out of 18 microarray papers, results from 10 could not be reproduced DOI: 10.1038/ng.295

12 Software? “The good news is that I was able to find some code. I am just hoping that it is a stable working version of the code... I have lost some data... The bad news is that the code is not commented and/or clean. So, I cannot really guarantee that you will enjoy playing with it.” 613 papers tested 123 successful reproductions DOI:10.6084/m9.figshare.1439750

DOI: 10.1371/journal.pmed.1001747 85% of research resources are wasted! We must... favor... unbiased, transparent, collaborative research with greater standardization Share data, protocols, materials, software, other tools

OPEN DATA CASE STUDY

15 Pregnancy-Induced Metabolic Phenotype Variations in Maternal Plasma DOI: 10.1021/pr401068k

16 Data Note


18 Devil in the detail

19 Minor discrepancies Major considerations

20 Open Data Release data prior to peer review Produce highly detailed metadata descriptions – ISA Tab Expect/ accept updates, ‘ongoing review’ Release ‘negative data’ – Get credit for ALL work

OPEN SOURCE CASE STUDY

22 Birmingham metabolomics workflow Many tools Many languages Complex to learn Many parameters Complex to report

23 Galaxy-M GUI

24 Galaxy-M Workflows

25 Accessible, reusable Github – Ease of access Galaxy – Ease of use – Ease of reporting – Ease of adaptation Virtual Machine – Ease of installation – Guaranteed reproducibility Test Datasets

26 And yet… referee 2 “I think important aspects of reproducibility are lost when building on closed source and non-free applications.” “To be frank, if this were a genomics article I would recommend not publishing a purely computational methods paper when large parts of the pipeline are non- free and closed source - limiting both the reproducibility and transparency of the pipeline. Realistically though my understanding is that this is quite common in metabolomics” “I would have indicated the paper was of more broad interest if there was at least one complete open source pipeline for data analysis”

27 Solution Compiled all Matlab code REMOVED PLS Toolbox analysis Will work towards Matlab-free system in future

28 Open Source Use all the tools for – sharing, – installing, – Reusing Do not use proprietary systems – To increase collaboration – To increase interest and citations – Sorry Eigenvector

THANKS! GigaScience team: Scott Edmunds Peter Li Chris Hunter Jesse Xiao Rob Davidson

30 Call for papers Plant Metabolomics Guest Edited by: Ute Roessner and Ruth Welti Open Access - Citable Data - Integrated Tools - Signed Peer Review Activities of plant metabolomics consortia Metabolomics and physiology of plant- environment interactions Insights into biochemical pathways and related physiology Plant MS-imaging

