Presentation is loading. Please wait.

Presentation is loading. Please wait.

Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015.

Similar presentations


Presentation on theme: "Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015."— Presentation transcript:

1 Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015 DOI: 10.6084/m9.figshare.1330219

2 Article: http://econ.st/1o12gCN DOI: 10.6084/m9.figshare.1330219

3

4 Big data! (The new oil) New dot com bubble? Article: http://bit.ly/1AN8ysJ DOI: 10.6084/m9.figshare.1330219

5 Source: @flowchainsensei DOI: 10.6084/m9.figshare.1330219

6 Article: http://bit.ly/1xdCxbY DOI: 10.6084/m9.figshare.1330219

7 Article: http://bit.ly/1Mdll03 DOI: 10.6084/m9.figshare.1330219

8 Yay, we’re all unicorns! from: Are you recruiting a data scientist or a unicorn? DOI: 10.6084/m9.figshare.1330219 http://ubm.io/1Gpxizh

9 But why are we sad unicorns? DOI: 10.6084/m9.figshare.1330219

10 Measuring software reproducibility Systematic study: 515 papers (429 conference, 86 journal) <30% reproducible DOI: 10.6084/m9.figshare.1330219 http://reproducibility.cs.arizona.edu

11 Measuring software reproducibility DOI: 10.6084/m9.figshare.1330219 http://reproducibility.cs.arizona.edu

12 Reasons for failure “The good news is that I was able to find some code. I am just hoping that it is a stable working version of the code... I have lost some data... The bad news is that the code is not commented and/or clean. So, I cannot really guarantee that you will enjoy playing with it.” DOI: 10.6084/m9.figshare.1330219 http://reproducibility.cs.arizona.edu

13 Cost of failure Waste time Waste money Frustrating Distrust DOI: 10.6084/m9.figshare.1330219

14 How to fix it DOI: 10.6084/m9.figshare.1330219

15 The path to enlightenment Look to the experts (4 x 10 simple rules) Share code – licenses Share environment – Codify the environment Share workflows – All parameters, versions, order of steps – GalaxyProject.org Share outputs – Share intermediate results – Share code for figures – Codify publications DOI: 10.6084/m9.figshare.1330219

16 Look to the experts DOI: 10.6084/m9.figshare.1330219

17 Look to the experts DOI: 10.6084/m9.figshare.1330219

18 A word from the experts: 1 Keep it simple – Don’t be a perfectionist – Aim for multiple versions – Optimise/improve later – Get feedback/help from community Hastings #1 + Prlic # 5 DOI: 10.6084/m9.figshare.1330219

19 A word from the experts: 2 Versioning – Use a versioning system (e.g. Github) – Allow others to know what version they use – Release early, release often (Linus Torvalds) – Get help from community Seemen # 3, Hastings # 10, Sandve #3/4 DOI: 10.6084/m9.figshare.1330219

20 A word from the experts: 3 Use good coding practice – You don’t have to be the best – Learn from others – Become involved in a community – Write as though others will be watching Prlic #2 + all of Seemen and Hastings DOI: 10.6084/m9.figshare.1330219

21 A word from the experts: highlight Start simple Release early Use versioning Build a community Get community feedback, testing, support …but wait, won’t that mean??? DOI: 10.6084/m9.figshare.1330219

22 Sharing code DOI: 10.6084/m9.figshare.1330219

23 Sharing code “Scientific software…public release is then only considered around the time of publication” – prlic #4 “the fear of getting scooped” – Reality: “staking a claim in the field” DOI: 10.6084/m9.figshare.1330219

24 Sharing code: don’t worry Share early – Be simple – Don’t be perfectionist CRAPL license Source: http://matt.might.net/articles/crapl/ DOI: 10.6084/m9.figshare.1330219

25 Sharing code: licenses Know your licenses – Apache License 2.0 – BSD 3-Clause “New” or “Revised” – BSD 2-Clause “simplified” or “FreeBSD” – GNU (GPL) – MIT – Mozilla Public License 2.0 – etc Source: http://opensource.org/licenses DOI: 10.6084/m9.figshare.1330219

26 Sharing code: repositories Github Sourgeforge Zenodo GigaDB/GigaGalaxy Versioning, sharing, collaboration, community feedback DOI: 10.6084/m9.figshare.1330219

27 Sharing environment DOI: 10.6084/m9.figshare.1330219

28 Your environment How hard would it be to start from scratch? What if you move from Ubuntu to Centos? IF it took you a while to set up your box, if you hesitate to set it up for your colleagues… – Create a virtual machine or ‘docker’ image that can be shared whole. – Time-stamp of working system DOI: 10.6084/m9.figshare.1330219

29 Share your environment Virtual machine – Copy your exact environment – If it works for you, it works for anyone – Reproducibility, frozen in time DOI: 10.6084/m9.figshare.1330219 DOI:10.1186/2047-217X-3-23

30 Share your environment Docker – ‘light’ vm – Discrete unit of code+environment – Can be called like a compiled tool New possibilities e.g. nucleotid.es benchmarking – Data-driven peer-review DOI: 10.6084/m9.figshare.1330219 http://nucleotid.es/

31 Share your environment VM = black box? Docker == black box! http://ivory.idyll.org/blog/vms-considered- harmful.html http://ivory.idyll.org/blog/vms-considered- harmful.html DOI: 10.6084/m9.figshare.1330219

32 Codify your environment Provisioning scripts are ‘research objects’ Improves adaptability (easier to recode for alternative OS etc) Builds in extra documentation Easier to share – although GigaDB still wants a compiled snapshot (i.e. full machine) DOI: 10.6084/m9.figshare.1330219

33 Short list of provisioning systems Vagrant Chef Salt Puppet Ansible Many more – see link for info DOI: 10.6084/m9.figshare.1330219 http://bit.ly/1wrYiuI

34 Sharing workflows DOI: 10.6084/m9.figshare.1330219

35 Share your workflow Any analysis is a string of tools with a great many parameters The order of the sequence, the version of each part and the inputs and outputs are never fully explained These should be shared! Help is at hand: there are many ‘workflow’ systems for this DOI: 10.6084/m9.figshare.1330219

36 Workflow systems Galaxy Knime Taverna Many more… GigaScience uses Galaxy – galaxy.cbiit.cuhk.edu.hk DOI: 10.6084/m9.figshare.1330219

37 Galaxy Over 36,000 main Galaxy server users Over 1,000 papers citing Galaxy use Over 55 Galaxy servers deployed Open source http://galaxyproject.org DOI: 10.6084/m9.figshare.1330219

38 Galaxy User Interface Tool List Tool Parameters History/results DOI: 10.6084/m9.figshare.1330219

39 Galaxy: Under the hood python myfunction input1 Basic xml 'wrapper' Describe inputs and outputs Calls command Monitors for output Logs/returns to 'history' DOI: 10.6084/m9.figshare.1330219

40 Galaxy Workflow: visualise DOI: 10.6084/m9.figshare.1330219

41 Galaxy Workflow: visualise DOI: 10.6084/m9.figshare.1330219

42 Galaxy Workflow: visualise DOI: 10.6084/m9.figshare.1330219

43 Galaxy Workflow: export DOI: 10.6084/m9.figshare.1330219

44 Citable workflow Add as supplemental files or publish with distinct DOI via GigaDB or FigShare DOI: 10.6084/m9.figshare.1330219

45 Galaxy Toolshed https://toolshed.g2.bx.psu.edu/ Many 'omics, stats, visualisations 2700+ tools! Download; Run instantly DOI: 10.6084/m9.figshare.1330219

46 GigaGalaxy Web Site: galaxy.cbiit.cuhk.edu.hk DOI: 10.6084/m9.figshare.1330219

47 SOAPdenovo2 workflows implemented in galaxy.cbiit.cuhk.edu.hk

48 SOAPdenovo2 workflows implemented in Implemented entire workflow in our Galaxy server, inc.: 3 pre-processing steps 4 SOAPdenovo modules 1 post processing steps Evaluation and visualization tools Also will be available to download by >50K Galaxy users in galaxy.cbiit.cuhk.edu.hk

49 Can we reproduce results? SOAPdenovo2 S. aureus pipeline

50 Sharing outputs DOI: 10.6084/m9.figshare.1330219

51 Share outputs – intermediate results Workflow systems help with this – Results in history If a part of your analysis can’t be replicated – Requires a license – Is no longer compatible – Just plain won’t work The rest of the analysis can still be used DOI: 10.6084/m9.figshare.1330219

52 Share outputs – code for figures Data transform for figures – Remove points? – 3D: choose ‘best angle’? – PCA: choose ‘best components’? Figure choice – Bar chart or box&whisker? Allow reinterpretation!!! DOI: 10.6084/m9.figshare.1330219

53 Share outputs – codify publication “This article is an example of a literate programming document. It has been created in R using the knitr package. Figures and tables in this paper are generated dynamically as the document is compiled. Several R packages are required to run the analysis. Materials are archived in the Gigascience database” DOI: 10.6084/m9.figshare.1330219DOI:10.1186/2047-217X-3-3

54 Literate coding options See listing: http://www.gigasciencejournal.com/content/ 3/1/19 http://www.gigasciencejournal.com/content/ 3/1/19 – R: KnitR, Sweave, R-Markdown – Javascript: Tangle, Active Markdown (CoffeeScript) – Python: Ipython Notebooks – iReport links this functionality for Galaxy DOI: 10.6084/m9.figshare.1330219

55 SUMMARY

56 The path to enlightenment Look to the experts (4 x 10 simple rules) Share code – licenses Share environment – Codify the environment Share workflows – All parameters, versions, order of steps – GalaxyProject.org Share outputs – Share intermediate results – Share code for figures – Codify publications DOI: 10.6084/m9.figshare.1330219

57 All Your Research Objects Project proposal Project experimental SOPs Images of equipment, subjects, conditions RAW data Meta-data Analysis code, parameters, pipelines Analysis environment, VM or provisioning script Intermediate results Publication figures/images/tables: codify Publication text DOI: 10.6084/m9.figshare.1330219

58 @gigascience facebook.com/GigaScience Scott Edmunds Peter Li Chris Hunter Rob Davidson Jesse Si Zhe Nicole Nogoy Laurie Goodman Amye Kenall (BMC) www.gigadb.org galaxy.cbiit.cuhk.edu.hk www.gigasciencejournal.com blogs.biomedcentral.com/gigablog/

59 DOI: 10.6084/m9.figshare.1330219


Download ppt "Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015."

Similar presentations


Ads by Google