Presentation is loading. Please wait.

Presentation is loading. Please wait.

Software workflows as research objects Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015 Slideshow-URL.

Similar presentations


Presentation on theme: "Software workflows as research objects Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015 Slideshow-URL."— Presentation transcript:

1 Software workflows as research objects Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015 Slideshow-URL

2 Article: http://econ.st/1o12gCN

3

4 Big data! (The new oil) Article: http://bit.ly/1AN8ysJ

5 Source: @flowchainsensei

6 Article: http://bit.ly/1xdCxbY

7 Article: http://bit.ly/1Mdll03

8 Yay, we’re all unicorns! Are you recruiting a data scientist or a unicorn? http://ubm.io/1Gpxizh

9 Source: http://bit.ly/1MdA8rI But why are we sad unicorns?

10 Measuring software reproducibility 515 papers (429 conference, 86 journal) <30% reproducible http://reproducibility.cs.arizona.edu

11 Measuring software reproducibility

12 Reasons for failure “The good news is that I was able to find some code. I am just hoping that it is a stable working version of the code... I have lost some data... The bad news is that the code is not commented and/or clean. So, I cannot really guarantee that you will enjoy playing with it.”

13 Cost of failure Waste time Waste money Frustrating Distrust

14 How to fix it

15 The path to enlightenment A word from the experts (4 x 10 simple rules) Share code – licenses Share environment – Codify the environment Share workflows – All parameters, versions, order of steps – GalaxyProject.org Share outputs – Share intermediate results – Share code for figures – Codify publications Slideshow URL……………..

16 A word from the experts

17

18 A word from the experts: 1 Keep it simple – Don’t be a perfectionist – Aim for multiple versions – Optimise/improve later – Get feedback/help from community Hastings #1 + Prlic # 5

19 A word from the experts: 2 Versioning – Use a versioning system (e.g. Github) – Allow others to know what version they use – Release early, release often (Linus Torvalds) – Get help from community Seemen # 3, Hastings # 10, Sandve #3/4

20 A word from the experts: 3 Use good coding practice – You don’t have to be the best – Learn from others – Become involved in a community – Write as though others will be watching Prlic #2 + all of Seemen and Hastings

21 A word from the experts: highlight Start simple Release early Use versioning Build a community Get community feedback, testing, support …but wait, won’t that mean???

22 Sharing code

23 “Scientific software…public release is then only considered around the time of publication” – prlic #4 “the fear of getting scooped” – Reality: “staking a claim in the field”

24 Sharing code: don’t worry Share early – Be simple – Don’t be perfectionist CRAPL license Source: http://matt.might.net/articles/crapl/

25 Sharing code: licenses Know your licenses – Apache License 2.0 – BSD 3-Clause “New” or “Revised” – BSD 2-Clause “simplified” or “FreeBSD” – GNU (GPL) – MIT – Mozilla Public License 2.0 – etc Source: http://opensource.org/licenses

26 Sharing code: repositories Github Sourgeforge Zenodo GigaDB/GigaGalaxy Versioning, sharing, collaboration, community feedback

27 Sharing environment

28 Your environment How hard would it be to start from scratch? What if you move from Ubuntu to Centos? IF it took you a while to set up your box, if you hesitate to set it up for your colleagues… – Create a virtual machine or ‘docker’ image that can be shared whole. – Time-stamp of working system

29 Share your environment Virtual machine – Copy your exact environment – If it works for you, it works for anyone – Reproducibility, frozen in time

30 Share your environment Docker – ‘light’ vm – Discrete unit of code+environment – Can be called like a compiled tool New possibilities e.g. nucleotid.es benchmarking – Data-driven peer-review

31 Share your environment VM = black box? Docker == black box! http://ivory.idyll.org/blog/vms-considered- harmful.html http://ivory.idyll.org/blog/vms-considered- harmful.html

32 Codify your environment Provisioning scripts are ‘research objects’ Improves adaptability (easier to recode for alternative OS etc) Builds in extra documentation Easier to share – although GigaDB still wants a compiled snapshot (i.e. full machine)

33 List of provisioning systems Vagrant Chef Salt Ansible

34 Sharing pipelines

35 Share your pipeline Any analysis is a string of tools with a great many parameters The order of the sequence, the version of each part and the inputs and outputs are never fully explained These should be shared! Help is at hand: there are many ‘workflow’ systems for this

36 List of workflow systems Galaxy Knime Taverna

37 Galaxy Over 36,000 main Galaxy server users Over 1,000 papers citing Galaxy use Over 55 Galaxy servers deployed Open source http://galaxyproject.org

38 Galaxy User Interface Tool List Tool Parameters History/results

39 Galaxy: Under the hood python myfunction input1 Basic xml 'wrapper' Describe inputs and outputs Calls command Monitors for output Logs/returns to 'history'

40 Galaxy Workflow: visualise

41

42

43 Galaxy Workflow: export

44 Citable workflow Add as supplemental files or publish with distinct DOI via GigaDB or FigShare

45 Galaxy Toolshed https://toolshed.g2.bx.psu.edu/ Many 'omics, stats, visualisations 2700+ tools! Download; Run instantly

46 Sharing outputs

47 Share outputs – intermediate results Workflow systems help with this If a part of your analysis can’t be replicated – Requires a license – Is no longer compatible – Just plain won’t work The rest of the analysis can still be used (show diagram)

48 Share outputs – code for figures

49 Share outputs – codify publication KnitR e.g. http://www.gigasciencejournal.com/content/3/1 /3 http://www.gigasciencejournal.com/content/3/1 /3 Options given here: http://www.gigasciencejournal.com/content/3/1 /19 http://www.gigasciencejournal.com/content/3/1 /19 – R: KnitR, Sweave, R-Markdown – Javascript: Tangle, Active Markdown (CoffeeScript) – Python: Ipython Notebooks – iReport links this functionality for Galaxy

50 Research objects Project proposal Project experimental SOPs Images of equipment, subjects, conditions RAW data Meta-data Analysis code, parameters, pipelines Analysis environment, VM or provisioning script Intermediate results Publication figures/images/tables: codify Publication text

51 Share early Share widely Share openly Slideshow URL


Download ppt "Software workflows as research objects Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015 Slideshow-URL."

Similar presentations


Ads by Google