Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015 DOI: /m9.figshare
Article: DOI: /m9.figshare
Big data! (The new oil) New dot com bubble? Article: DOI: /m9.figshare
DOI: /m9.figshare
Article: DOI: /m9.figshare
Article: DOI: /m9.figshare
Yay, we’re all unicorns! from: Are you recruiting a data scientist or a unicorn? DOI: /m9.figshare
But why are we sad unicorns? DOI: /m9.figshare
Measuring software reproducibility Systematic study: 515 papers (429 conference, 86 journal) <30% reproducible DOI: /m9.figshare
Measuring software reproducibility DOI: /m9.figshare
Reasons for failure “The good news is that I was able to find some code. I am just hoping that it is a stable working version of the code... I have lost some data... The bad news is that the code is not commented and/or clean. So, I cannot really guarantee that you will enjoy playing with it.” DOI: /m9.figshare
Cost of failure Waste time Waste money Frustrating Distrust DOI: /m9.figshare
How to fix it DOI: /m9.figshare
The path to enlightenment Look to the experts (4 x 10 simple rules) Share code – licenses Share environment – Codify the environment Share workflows – All parameters, versions, order of steps – GalaxyProject.org Share outputs – Share intermediate results – Share code for figures – Codify publications DOI: /m9.figshare
Look to the experts DOI: /m9.figshare
Look to the experts DOI: /m9.figshare
A word from the experts: 1 Keep it simple – Don’t be a perfectionist – Aim for multiple versions – Optimise/improve later – Get feedback/help from community Hastings #1 + Prlic # 5 DOI: /m9.figshare
A word from the experts: 2 Versioning – Use a versioning system (e.g. Github) – Allow others to know what version they use – Release early, release often (Linus Torvalds) – Get help from community Seemen # 3, Hastings # 10, Sandve #3/4 DOI: /m9.figshare
A word from the experts: 3 Use good coding practice – You don’t have to be the best – Learn from others – Become involved in a community – Write as though others will be watching Prlic #2 + all of Seemen and Hastings DOI: /m9.figshare
A word from the experts: highlight Start simple Release early Use versioning Build a community Get community feedback, testing, support …but wait, won’t that mean??? DOI: /m9.figshare
Sharing code DOI: /m9.figshare
Sharing code “Scientific software…public release is then only considered around the time of publication” – prlic #4 “the fear of getting scooped” – Reality: “staking a claim in the field” DOI: /m9.figshare
Sharing code: don’t worry Share early – Be simple – Don’t be perfectionist CRAPL license Source: DOI: /m9.figshare
Sharing code: licenses Know your licenses – Apache License 2.0 – BSD 3-Clause “New” or “Revised” – BSD 2-Clause “simplified” or “FreeBSD” – GNU (GPL) – MIT – Mozilla Public License 2.0 – etc Source: DOI: /m9.figshare
Sharing code: repositories Github Sourgeforge Zenodo GigaDB/GigaGalaxy Versioning, sharing, collaboration, community feedback DOI: /m9.figshare
Sharing environment DOI: /m9.figshare
Your environment How hard would it be to start from scratch? What if you move from Ubuntu to Centos? IF it took you a while to set up your box, if you hesitate to set it up for your colleagues… – Create a virtual machine or ‘docker’ image that can be shared whole. – Time-stamp of working system DOI: /m9.figshare
Share your environment Virtual machine – Copy your exact environment – If it works for you, it works for anyone – Reproducibility, frozen in time DOI: /m9.figshare DOI: / X-3-23
Share your environment Docker – ‘light’ vm – Discrete unit of code+environment – Can be called like a compiled tool New possibilities e.g. nucleotid.es benchmarking – Data-driven peer-review DOI: /m9.figshare
Share your environment VM = black box? Docker == black box! harmful.html harmful.html DOI: /m9.figshare
Codify your environment Provisioning scripts are ‘research objects’ Improves adaptability (easier to recode for alternative OS etc) Builds in extra documentation Easier to share – although GigaDB still wants a compiled snapshot (i.e. full machine) DOI: /m9.figshare
Short list of provisioning systems Vagrant Chef Salt Puppet Ansible Many more – see link for info DOI: /m9.figshare
Sharing workflows DOI: /m9.figshare
Share your workflow Any analysis is a string of tools with a great many parameters The order of the sequence, the version of each part and the inputs and outputs are never fully explained These should be shared! Help is at hand: there are many ‘workflow’ systems for this DOI: /m9.figshare
Workflow systems Galaxy Knime Taverna Many more… GigaScience uses Galaxy – galaxy.cbiit.cuhk.edu.hk DOI: /m9.figshare
Galaxy Over 36,000 main Galaxy server users Over 1,000 papers citing Galaxy use Over 55 Galaxy servers deployed Open source DOI: /m9.figshare
Galaxy User Interface Tool List Tool Parameters History/results DOI: /m9.figshare
Galaxy: Under the hood python myfunction input1 Basic xml 'wrapper' Describe inputs and outputs Calls command Monitors for output Logs/returns to 'history' DOI: /m9.figshare
Galaxy Workflow: visualise DOI: /m9.figshare
Galaxy Workflow: visualise DOI: /m9.figshare
Galaxy Workflow: visualise DOI: /m9.figshare
Galaxy Workflow: export DOI: /m9.figshare
Citable workflow Add as supplemental files or publish with distinct DOI via GigaDB or FigShare DOI: /m9.figshare
Galaxy Toolshed Many 'omics, stats, visualisations tools! Download; Run instantly DOI: /m9.figshare
GigaGalaxy Web Site: galaxy.cbiit.cuhk.edu.hk DOI: /m9.figshare
SOAPdenovo2 workflows implemented in galaxy.cbiit.cuhk.edu.hk
SOAPdenovo2 workflows implemented in Implemented entire workflow in our Galaxy server, inc.: 3 pre-processing steps 4 SOAPdenovo modules 1 post processing steps Evaluation and visualization tools Also will be available to download by >50K Galaxy users in galaxy.cbiit.cuhk.edu.hk
Can we reproduce results? SOAPdenovo2 S. aureus pipeline
Sharing outputs DOI: /m9.figshare
Share outputs – intermediate results Workflow systems help with this – Results in history If a part of your analysis can’t be replicated – Requires a license – Is no longer compatible – Just plain won’t work The rest of the analysis can still be used DOI: /m9.figshare
Share outputs – code for figures Data transform for figures – Remove points? – 3D: choose ‘best angle’? – PCA: choose ‘best components’? Figure choice – Bar chart or box&whisker? Allow reinterpretation!!! DOI: /m9.figshare
Share outputs – codify publication “This article is an example of a literate programming document. It has been created in R using the knitr package. Figures and tables in this paper are generated dynamically as the document is compiled. Several R packages are required to run the analysis. Materials are archived in the Gigascience database” DOI: /m9.figshare DOI: / X-3-3
Literate coding options See listing: 3/1/19 3/1/19 – R: KnitR, Sweave, R-Markdown – Javascript: Tangle, Active Markdown (CoffeeScript) – Python: Ipython Notebooks – iReport links this functionality for Galaxy DOI: /m9.figshare
SUMMARY
The path to enlightenment Look to the experts (4 x 10 simple rules) Share code – licenses Share environment – Codify the environment Share workflows – All parameters, versions, order of steps – GalaxyProject.org Share outputs – Share intermediate results – Share code for figures – Codify publications DOI: /m9.figshare
All Your Research Objects Project proposal Project experimental SOPs Images of equipment, subjects, conditions RAW data Meta-data Analysis code, parameters, pipelines Analysis environment, VM or provisioning script Intermediate results Publication figures/images/tables: codify Publication text DOI: /m9.figshare
@gigascience facebook.com/GigaScience Scott Edmunds Peter Li Chris Hunter Rob Davidson Jesse Si Zhe Nicole Nogoy Laurie Goodman Amye Kenall (BMC) galaxy.cbiit.cuhk.edu.hk blogs.biomedcentral.com/gigablog/
DOI: /m9.figshare