Presentation is loading. Please wait.

Presentation is loading. Please wait.

Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015.

Similar presentations


Presentation on theme: "Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015."— Presentation transcript:

1 Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015 DOI: 10.6084/m9.figshare.1330219

2 Article: http://econ.st/1o12gCN DOI: 10.6084/m9.figshare.1330219

3

4 Big data! (The new oil) Article: http://bit.ly/1AN8ysJ DOI: 10.6084/m9.figshare.1330219

5 Source: @flowchainsensei DOI: 10.6084/m9.figshare.1330219

6 Article: http://bit.ly/1xdCxbY DOI: 10.6084/m9.figshare.1330219

7 Article: http://bit.ly/1Mdll03 DOI: 10.6084/m9.figshare.1330219

8 Yay, we’re all unicorns! Are you recruiting a data scientist or a unicorn? http://ubm.io/1Gpxizh DOI: 10.6084/m9.figshare.1330219

9 Source: http://bit.ly/1MdA8rI But why are we sad unicorns? DOI: 10.6084/m9.figshare.1330219

10 Measuring software reproducibility 515 papers (429 conference, 86 journal) <30% reproducible http://reproducibility.cs.arizona.edu DOI: 10.6084/m9.figshare.1330219

11 Measuring software reproducibility DOI: 10.6084/m9.figshare.1330219

12 Reasons for failure “The good news is that I was able to find some code. I am just hoping that it is a stable working version of the code... I have lost some data... The bad news is that the code is not commented and/or clean. So, I cannot really guarantee that you will enjoy playing with it.” DOI: 10.6084/m9.figshare.1330219

13 Cost of failure Waste time Waste money Frustrating Distrust DOI: 10.6084/m9.figshare.1330219

14 How to fix it DOI: 10.6084/m9.figshare.1330219

15 The path to enlightenment A word from the experts (4 x 10 simple rules) Share code – licenses Share environment – Codify the environment Share workflows – All parameters, versions, order of steps – GalaxyProject.org Share outputs – Share intermediate results – Share code for figures – Codify publications DOI: 10.6084/m9.figshare.1330219

16 A word from the experts DOI: 10.6084/m9.figshare.1330219

17 A word from the experts DOI: 10.6084/m9.figshare.1330219

18 A word from the experts: 1 Keep it simple – Don’t be a perfectionist – Aim for multiple versions – Optimise/improve later – Get feedback/help from community Hastings #1 + Prlic # 5 DOI: 10.6084/m9.figshare.1330219

19 A word from the experts: 2 Versioning – Use a versioning system (e.g. Github) – Allow others to know what version they use – Release early, release often (Linus Torvalds) – Get help from community Seemen # 3, Hastings # 10, Sandve #3/4 DOI: 10.6084/m9.figshare.1330219

20 A word from the experts: 3 Use good coding practice – You don’t have to be the best – Learn from others – Become involved in a community – Write as though others will be watching Prlic #2 + all of Seemen and Hastings DOI: 10.6084/m9.figshare.1330219

21 A word from the experts: highlight Start simple Release early Use versioning Build a community Get community feedback, testing, support …but wait, won’t that mean??? DOI: 10.6084/m9.figshare.1330219

22 Sharing code DOI: 10.6084/m9.figshare.1330219

23 Sharing code “Scientific software…public release is then only considered around the time of publication” – prlic #4 “the fear of getting scooped” – Reality: “staking a claim in the field” DOI: 10.6084/m9.figshare.1330219

24 Sharing code: don’t worry Share early – Be simple – Don’t be perfectionist CRAPL license Source: http://matt.might.net/articles/crapl/ DOI: 10.6084/m9.figshare.1330219

25 Sharing code: licenses Know your licenses – Apache License 2.0 – BSD 3-Clause “New” or “Revised” – BSD 2-Clause “simplified” or “FreeBSD” – GNU (GPL) – MIT – Mozilla Public License 2.0 – etc Source: http://opensource.org/licenses DOI: 10.6084/m9.figshare.1330219

26 Sharing code: repositories Github Sourgeforge Zenodo GigaDB/GigaGalaxy Versioning, sharing, collaboration, community feedback DOI: 10.6084/m9.figshare.1330219

27 Sharing environment DOI: 10.6084/m9.figshare.1330219

28 Your environment How hard would it be to start from scratch? What if you move from Ubuntu to Centos? IF it took you a while to set up your box, if you hesitate to set it up for your colleagues… – Create a virtual machine or ‘docker’ image that can be shared whole. – Time-stamp of working system DOI: 10.6084/m9.figshare.1330219

29 Share your environment Virtual machine – Copy your exact environment – If it works for you, it works for anyone – Reproducibility, frozen in time DOI: 10.6084/m9.figshare.1330219

30 Share your environment Docker – ‘light’ vm – Discrete unit of code+environment – Can be called like a compiled tool New possibilities e.g. nucleotid.es benchmarking – Data-driven peer-review DOI: 10.6084/m9.figshare.1330219

31 Share your environment VM = black box? Docker == black box! http://ivory.idyll.org/blog/vms-considered- harmful.html http://ivory.idyll.org/blog/vms-considered- harmful.html DOI: 10.6084/m9.figshare.1330219

32 Codify your environment Provisioning scripts are ‘research objects’ Improves adaptability (easier to recode for alternative OS etc) Builds in extra documentation Easier to share – although GigaDB still wants a compiled snapshot (i.e. full machine) DOI: 10.6084/m9.figshare.1330219

33 List of provisioning systems Vagrant Chef Salt Ansible DOI: 10.6084/m9.figshare.1330219

34 Sharing pipelines DOI: 10.6084/m9.figshare.1330219

35 Share your pipeline Any analysis is a string of tools with a great many parameters The order of the sequence, the version of each part and the inputs and outputs are never fully explained These should be shared! Help is at hand: there are many ‘workflow’ systems for this DOI: 10.6084/m9.figshare.1330219

36 Workflow systems Galaxy Knime Taverna Many more… GigaScience uses Galaxy – galaxy.cbiit.cuhk.edu.hk DOI: 10.6084/m9.figshare.1330219

37 Galaxy Over 36,000 main Galaxy server users Over 1,000 papers citing Galaxy use Over 55 Galaxy servers deployed Open source http://galaxyproject.org DOI: 10.6084/m9.figshare.1330219

38 Galaxy User Interface Tool List Tool Parameters History/results DOI: 10.6084/m9.figshare.1330219

39 Galaxy: Under the hood python myfunction input1 Basic xml 'wrapper' Describe inputs and outputs Calls command Monitors for output Logs/returns to 'history' DOI: 10.6084/m9.figshare.1330219

40 Galaxy Workflow: visualise DOI: 10.6084/m9.figshare.1330219

41 Galaxy Workflow: visualise DOI: 10.6084/m9.figshare.1330219

42 Galaxy Workflow: visualise DOI: 10.6084/m9.figshare.1330219

43 Galaxy Workflow: export DOI: 10.6084/m9.figshare.1330219

44 Citable workflow Add as supplemental files or publish with distinct DOI via GigaDB or FigShare DOI: 10.6084/m9.figshare.1330219

45 Galaxy Toolshed https://toolshed.g2.bx.psu.edu/ Many 'omics, stats, visualisations 2700+ tools! Download; Run instantly DOI: 10.6084/m9.figshare.1330219

46 GigaGalaxy Web Site: galaxy.cbiit.cuhk.edu.hk DOI: 10.6084/m9.figshare.1330219

47 Sharing outputs DOI: 10.6084/m9.figshare.1330219

48 Share outputs – intermediate results Workflow systems help with this If a part of your analysis can’t be replicated – Requires a license – Is no longer compatible – Just plain won’t work The rest of the analysis can still be used DOI: 10.6084/m9.figshare.1330219

49 Share outputs – code for figures DOI: 10.6084/m9.figshare.1330219

50 Share outputs – codify publication KnitR e.g. http://www.gigasciencejournal.com/content/ 3/1/3 http://www.gigasciencejournal.com/content/ 3/1/3 DOI: 10.6084/m9.figshare.1330219

51 Literate coding options See listing: http://www.gigasciencejournal.com/content/ 3/1/19 http://www.gigasciencejournal.com/content/ 3/1/19 – R: KnitR, Sweave, R-Markdown – Javascript: Tangle, Active Markdown (CoffeeScript) – Python: Ipython Notebooks – iReport links this functionality for Galaxy DOI: 10.6084/m9.figshare.1330219

52 Research objects Project proposal Project experimental SOPs Images of equipment, subjects, conditions RAW data Meta-data Analysis code, parameters, pipelines Analysis environment, VM or provisioning script Intermediate results Publication figures/images/tables: codify Publication text DOI: 10.6084/m9.figshare.1330219

53 @gigascience facebook.com/GigaScience Scott Edmunds Peter Li Chris Hunter Rob Davidson Jesse Si Zhe Nicole Nogoy Laurie Goodman Amye Kenall (BMC) www.gigadb.org galaxy.cbiit.cuhk.edu.hk www.gigasciencejournal.com blogs.biomedcentral.com/gigablog/


Download ppt "Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015."

Similar presentations


Ads by Google