Download presentation
Presentation is loading. Please wait.
Published byJerome Lee Modified over 9 years ago
1
Software workflows as research objects Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015 Slideshow-URL
2
Article: http://econ.st/1o12gCN
4
Big data! (The new oil) Article: http://bit.ly/1AN8ysJ
5
Source: @flowchainsensei
6
Article: http://bit.ly/1xdCxbY
7
Article: http://bit.ly/1Mdll03
8
Yay, we’re all unicorns! Are you recruiting a data scientist or a unicorn? http://ubm.io/1Gpxizh
9
Source: http://bit.ly/1MdA8rI But why are we sad unicorns?
10
Measuring software reproducibility 515 papers (429 conference, 86 journal) <30% reproducible http://reproducibility.cs.arizona.edu
11
Measuring software reproducibility
12
Reasons for failure “The good news is that I was able to find some code. I am just hoping that it is a stable working version of the code... I have lost some data... The bad news is that the code is not commented and/or clean. So, I cannot really guarantee that you will enjoy playing with it.”
13
Cost of failure Waste time Waste money Frustrating Distrust
14
How to fix it
15
The path to enlightenment A word from the experts (4 x 10 simple rules) Share code – licenses Share environment – Codify the environment Share workflows – All parameters, versions, order of steps – GalaxyProject.org Share outputs – Share intermediate results – Share code for figures – Codify publications Slideshow URL……………..
16
A word from the experts
18
A word from the experts: 1 Keep it simple – Don’t be a perfectionist – Aim for multiple versions – Optimise/improve later – Get feedback/help from community Hastings #1 + Prlic # 5
19
A word from the experts: 2 Versioning – Use a versioning system (e.g. Github) – Allow others to know what version they use – Release early, release often (Linus Torvalds) – Get help from community Seemen # 3, Hastings # 10, Sandve #3/4
20
A word from the experts: 3 Use good coding practice – You don’t have to be the best – Learn from others – Become involved in a community – Write as though others will be watching Prlic #2 + all of Seemen and Hastings
21
A word from the experts: highlight Start simple Release early Use versioning Build a community Get community feedback, testing, support …but wait, won’t that mean???
22
Sharing code
23
“Scientific software…public release is then only considered around the time of publication” – prlic #4 “the fear of getting scooped” – Reality: “staking a claim in the field”
24
Sharing code: don’t worry Share early – Be simple – Don’t be perfectionist CRAPL license Source: http://matt.might.net/articles/crapl/
25
Sharing code: licenses Know your licenses – Apache License 2.0 – BSD 3-Clause “New” or “Revised” – BSD 2-Clause “simplified” or “FreeBSD” – GNU (GPL) – MIT – Mozilla Public License 2.0 – etc Source: http://opensource.org/licenses
26
Sharing code: repositories Github Sourgeforge Zenodo GigaDB/GigaGalaxy Versioning, sharing, collaboration, community feedback
27
Sharing environment
28
Your environment How hard would it be to start from scratch? What if you move from Ubuntu to Centos? IF it took you a while to set up your box, if you hesitate to set it up for your colleagues… – Create a virtual machine or ‘docker’ image that can be shared whole. – Time-stamp of working system
29
Share your environment Virtual machine – Copy your exact environment – If it works for you, it works for anyone – Reproducibility, frozen in time
30
Share your environment Docker – ‘light’ vm – Discrete unit of code+environment – Can be called like a compiled tool New possibilities e.g. nucleotid.es benchmarking – Data-driven peer-review
31
Share your environment VM = black box? Docker == black box! http://ivory.idyll.org/blog/vms-considered- harmful.html http://ivory.idyll.org/blog/vms-considered- harmful.html
32
Codify your environment Provisioning scripts are ‘research objects’ Improves adaptability (easier to recode for alternative OS etc) Builds in extra documentation Easier to share – although GigaDB still wants a compiled snapshot (i.e. full machine)
33
List of provisioning systems Vagrant Chef Salt Ansible
34
Sharing pipelines
35
Share your pipeline Any analysis is a string of tools with a great many parameters The order of the sequence, the version of each part and the inputs and outputs are never fully explained These should be shared! Help is at hand: there are many ‘workflow’ systems for this
36
List of workflow systems Galaxy Knime Taverna
37
Galaxy Over 36,000 main Galaxy server users Over 1,000 papers citing Galaxy use Over 55 Galaxy servers deployed Open source http://galaxyproject.org
38
Galaxy User Interface Tool List Tool Parameters History/results
39
Galaxy: Under the hood python myfunction input1 Basic xml 'wrapper' Describe inputs and outputs Calls command Monitors for output Logs/returns to 'history'
40
Galaxy Workflow: visualise
43
Galaxy Workflow: export
44
Citable workflow Add as supplemental files or publish with distinct DOI via GigaDB or FigShare
45
Galaxy Toolshed https://toolshed.g2.bx.psu.edu/ Many 'omics, stats, visualisations 2700+ tools! Download; Run instantly
46
Sharing outputs
47
Share outputs – intermediate results Workflow systems help with this If a part of your analysis can’t be replicated – Requires a license – Is no longer compatible – Just plain won’t work The rest of the analysis can still be used (show diagram)
48
Share outputs – code for figures
49
Share outputs – codify publication KnitR e.g. http://www.gigasciencejournal.com/content/3/1 /3 http://www.gigasciencejournal.com/content/3/1 /3 Options given here: http://www.gigasciencejournal.com/content/3/1 /19 http://www.gigasciencejournal.com/content/3/1 /19 – R: KnitR, Sweave, R-Markdown – Javascript: Tangle, Active Markdown (CoffeeScript) – Python: Ipython Notebooks – iReport links this functionality for Galaxy
50
Research objects Project proposal Project experimental SOPs Images of equipment, subjects, conditions RAW data Meta-data Analysis code, parameters, pipelines Analysis environment, VM or provisioning script Intermediate results Publication figures/images/tables: codify Publication text
51
Share early Share widely Share openly Slideshow URL
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.