Software workflows as research objects Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015 Slideshow-URL.

Slides:



Advertisements
Similar presentations
EPrints 2.0 / March 4 th 2002 / Glasgow / Chris Gutteridge Introduction to EPrints 2.0 March 4 th 2002 Glasgow Christopher Gutteridge from the Department.
Advertisements

Computer Basics Hit List of Items to Talk About ● What and when to use left, right, middle, double and triple click? What and when to use left, right,
Software Configuration Management Donna Albino LIS489, December 3, 2014.
Linux vs. Windows. Linux  Linux was originally built by Linus Torvalds at the University of Helsinki in  Linux is a Unix-like, Kernal-based, fully.
Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015.
Technical Tips and Tricks for User Support Mike Gardner
Contributing source code to CSDMS Albert Kettner.
Docker Martin Meyer Agenda What is Docker? –Docker vs. Virtual Machine –History, Status, Run Platforms –Hello World Images and Containers.
Aleksi Kallio CSC – IT Center for Science Chipster and collaboration with other bioinformatics platforms.
Before class begins… Help us to assess this session and plan for future workshops Please complete the Advanced Refworks Pre-learning assessment at:
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
Version Control with git. Version Control Version control is a system that records changes to a file or set of files over time so that you can recall.
ETD Repositories Using DSpace Software Andrew Penman The Robert Gordon University 27 th September 2004.
Promoting data dissemination and reproducibility. Christopher I. Hunter, Scott C. Edmunds, Peter Li, Xiao Si Zhe, Robert L Davidson, Laurie Goodman. Submit.
Tools for reproducible and accessible science VMs, KnitR and OMERO Rob Davidson Cardiac Physiome Workshop Auckland, April 8th 2015.
Open Data, Open Source: preparing for Big Data in Metabolomics Rob L Davidson #MetSoc2015 This presentation DOI: /m9.figshare
Software Sustainability Institute Linking software: Citations, roles, references,and more
Tools for reproducible and accessible science KnitR, VMs and OMERO Rob Davidson Cardiac Physiome Workshop Auckland, April 8th 2015 DOI for this talk: /m9.figshare
A Tale of Two Apps WHY DEVELOPMENT PRACTICES MATTER Zendcon Oct
So just what is the Sedona Framework? –The Framework is an embedded device programming and control environment with two major facets –Open Source Free.
Engineering a New Home EMILY STENBERG DIGITAL PUBLISHING & PRESERVATION LIBRARIAN LAUREN TODD ENGINEERING SUBJECT LIBRARIAN WASHINGTON UNIVERSITY IN ST.
Version Control with Subversion Quick Reference of Subversion.
14/11/11 Taverna Roadmap Shoaib Sufi myGrid Project Manager.
Open Data, Open Source: preparing for Big Data in Metabolomics Rob L Davidson #MetSoc2015 This presentation DOI: /m9.figshare
1 Copyright ©2004 TAC. 2 T-WorMS Adding Sanity to Your Process Jamie L. Mitchell CTO TAC.
Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015.
Introduction to GigaScience journal & database Chris I Hunter & Rob L Davidson ISI CODATA International Training Workshop on Big Data 11 th March 2015.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Objectives.
1 Instant Data Warehouse Utilities Extended (Again!!) 14/7/ Today I am pleased to announce the publishing of some fantastic new functionality for.
NGS data analysis CCM Seminar series Michael Liang:
Technical Workshops | Esri International User Conference San Diego, California Creating Geoprocessing Services Kevin Hibma, Scott Murray July 25, 2012.
Software Sustainability Institute Software Attribution can we improve the reusability and sustainability of scientific software?
OTN Workshop 2015 OTN SandBox Presented by Marta Mihoff OTN Database/Data Process Manager.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
Digital Commons & Open Access Repositories Johanna Bristow, Strategic Marketing Manager APBSLG Libraries: September 2006.
By: Anuj Sharma. Topics covered:  GIT Introduction  GIT Benefits over different tools  GIT workflow  GIT server creation  How to use GIT for first.
Samba – Good Just Keeps Getting Better The new and not so new features available in Samba, and how they benefit your organization. Copyright 2002 © Dustin.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
WHAT ARE WE GOING TO DO WITH DATA? Rob L Davidson #WCSJ2015 This presentation DOI: /m9.figshare
Copyright © Software Carpentry 2011 This work is licensed under the Creative Commons Attribution License See
Electronic labnotes Mari Wigham COMMIT/. Information WUR  Organising, sharing, finding and reusing data  Expertise in: ● Modelling data.
Esri UC 2014 | Technical Workshop | Creating Geoprocessing Services Kevin Hibma.
Open Archive Workshop, CERN th March 2001 Peer Review - the HEP View Mick Draper, CERN ETT Division
Merging and sharing Metabolomics analysis tools with Galaxy: transparent, reproducible, open 'omics Robert L Davidson #MMW2014 Merlion.
Title Presenter name Slideshow-URL Conference name Date.
Sharing OERs via Jorum Siobhán Burke and Sarah Currier 12 th December 2012.
Challenges and Solutions Will Schroeder, co-Founder, President VAC Big Data Consortium Meeting July 31, 2012.
ICS Software Development Environment Blaž Zupanc and Leandro Fernandez 19 February 2016.
Canadian Bioinformatics Workshops
CyVerse Workshop Discovery Environment Overview. Welcome to the Discovery Environment A Simple Interface to Hundreds of Bioinformatics Apps, Powerful.
CitEc as a source for research assessment and evaluation José Manuel Barrueco Universitat de València (SPAIN) May, й Международной научно-практической.
GNU EPrints 2 Overview Christopher Gutteridge 19 th October 2002 CERN. Geneva, Switzerland.
The Reproducible Research Advantage Why + how to make your research more reproducible Presentation for the Center for Open Science June 17, 2015 April.
Canadian Bioinformatics Workshops
bitcurator-access-webtools Quick Start Guide
Licenses and Interpreted Languages for DHTC Thursday morning, 10:45 am
LabVIEW User Group Meeting
Version Control with Subversion
Progress on NA61/NA49 software virtualisation Dag Toppe Larsen Wrocław
CyVerse Discovery Environment
The importance of being Connected
Publishing software and data
Drupal VM and Docker4Drupal For Drupal Development Platform
Drupal VM and Docker4Drupal as Consistent Drupal Development Platform
FOSS 101 Sarah Glassmeyer Project Specialist Manager,
Reno WordPress Meetup February 12, 2015.
bitcurator-access-webtools Quick Start Guide
NIEM Tool Strategy Next Steps for Movement
Computational Pipeline Strategies
Contributing source code to CSDMS
Presentation transcript:

Software workflows as research objects Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015 Slideshow-URL

Article:

Big data! (The new oil) Article:

Article:

Article:

Yay, we’re all unicorns! Are you recruiting a data scientist or a unicorn?

Source: But why are we sad unicorns?

Measuring software reproducibility 515 papers (429 conference, 86 journal) <30% reproducible

Measuring software reproducibility

Reasons for failure “The good news is that I was able to find some code. I am just hoping that it is a stable working version of the code... I have lost some data... The bad news is that the code is not commented and/or clean. So, I cannot really guarantee that you will enjoy playing with it.”

Cost of failure Waste time Waste money Frustrating Distrust

How to fix it

The path to enlightenment A word from the experts (4 x 10 simple rules) Share code – licenses Share environment – Codify the environment Share workflows – All parameters, versions, order of steps – GalaxyProject.org Share outputs – Share intermediate results – Share code for figures – Codify publications Slideshow URL……………..

A word from the experts

A word from the experts: 1 Keep it simple – Don’t be a perfectionist – Aim for multiple versions – Optimise/improve later – Get feedback/help from community Hastings #1 + Prlic # 5

A word from the experts: 2 Versioning – Use a versioning system (e.g. Github) – Allow others to know what version they use – Release early, release often (Linus Torvalds) – Get help from community Seemen # 3, Hastings # 10, Sandve #3/4

A word from the experts: 3 Use good coding practice – You don’t have to be the best – Learn from others – Become involved in a community – Write as though others will be watching Prlic #2 + all of Seemen and Hastings

A word from the experts: highlight Start simple Release early Use versioning Build a community Get community feedback, testing, support …but wait, won’t that mean???

Sharing code

“Scientific software…public release is then only considered around the time of publication” – prlic #4 “the fear of getting scooped” – Reality: “staking a claim in the field”

Sharing code: don’t worry Share early – Be simple – Don’t be perfectionist CRAPL license Source:

Sharing code: licenses Know your licenses – Apache License 2.0 – BSD 3-Clause “New” or “Revised” – BSD 2-Clause “simplified” or “FreeBSD” – GNU (GPL) – MIT – Mozilla Public License 2.0 – etc Source:

Sharing code: repositories Github Sourgeforge Zenodo GigaDB/GigaGalaxy Versioning, sharing, collaboration, community feedback

Sharing environment

Your environment How hard would it be to start from scratch? What if you move from Ubuntu to Centos? IF it took you a while to set up your box, if you hesitate to set it up for your colleagues… – Create a virtual machine or ‘docker’ image that can be shared whole. – Time-stamp of working system

Share your environment Virtual machine – Copy your exact environment – If it works for you, it works for anyone – Reproducibility, frozen in time

Share your environment Docker – ‘light’ vm – Discrete unit of code+environment – Can be called like a compiled tool New possibilities e.g. nucleotid.es benchmarking – Data-driven peer-review

Share your environment VM = black box? Docker == black box! harmful.html harmful.html

Codify your environment Provisioning scripts are ‘research objects’ Improves adaptability (easier to recode for alternative OS etc) Builds in extra documentation Easier to share – although GigaDB still wants a compiled snapshot (i.e. full machine)

List of provisioning systems Vagrant Chef Salt Ansible

Sharing pipelines

Share your pipeline Any analysis is a string of tools with a great many parameters The order of the sequence, the version of each part and the inputs and outputs are never fully explained These should be shared! Help is at hand: there are many ‘workflow’ systems for this

List of workflow systems Galaxy Knime Taverna

Galaxy Over 36,000 main Galaxy server users Over 1,000 papers citing Galaxy use Over 55 Galaxy servers deployed Open source

Galaxy User Interface Tool List Tool Parameters History/results

Galaxy: Under the hood python myfunction input1 Basic xml 'wrapper' Describe inputs and outputs Calls command Monitors for output Logs/returns to 'history'

Galaxy Workflow: visualise

Galaxy Workflow: export

Citable workflow Add as supplemental files or publish with distinct DOI via GigaDB or FigShare

Galaxy Toolshed Many 'omics, stats, visualisations tools! Download; Run instantly

Sharing outputs

Share outputs – intermediate results Workflow systems help with this If a part of your analysis can’t be replicated – Requires a license – Is no longer compatible – Just plain won’t work The rest of the analysis can still be used (show diagram)

Share outputs – code for figures

Share outputs – codify publication KnitR e.g. /3 /3 Options given here: /19 /19 – R: KnitR, Sweave, R-Markdown – Javascript: Tangle, Active Markdown (CoffeeScript) – Python: Ipython Notebooks – iReport links this functionality for Galaxy

Research objects Project proposal Project experimental SOPs Images of equipment, subjects, conditions RAW data Meta-data Analysis code, parameters, pipelines Analysis environment, VM or provisioning script Intermediate results Publication figures/images/tables: codify Publication text

Share early Share widely Share openly Slideshow URL