Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015.

Slides:



Advertisements
Similar presentations
1 OBJECTIVES To generate a web-based system enables to assemble model configurations. to submit these configurations on different.
Advertisements

Drupal Online Tutorial A Product of an ENGL 421 class at Purdue University Page 1.
Rewarding Reproducibility and Method Publishing the GigaScience Way Scott Edmunds
1 Richard White Design decisions: architecture 1 July 2005 BiodiversityWorld Grid Workshop NeSC, Edinburgh, 30 June - 1 July 2005 Design decisions: architecture.
Technical Tips and Tricks for User Support Mike Gardner
Contributing source code to CSDMS Albert Kettner.
Geography 465 Overview Geoprocessing in ArcGIS. MODELING Geoprocessing as modeling.
Applied Software Project Management Andrew Stellman & Jennifer Greene Applied Software Project Management Applied Software.
Aleksi Kallio CSC – IT Center for Science Chipster and collaboration with other bioinformatics platforms.
Before class begins… Help us to assess this session and plan for future workshops Please complete the Advanced Refworks Pre-learning assessment at:
Calendar Browser is a groupware used for booking all kinds of resources within an organization. The software is totally integrated in Outlook. Calendar.
Version Control with git. Version Control Version control is a system that records changes to a file or set of files over time so that you can recall.
Software workflows as research objects Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015 Slideshow-URL.
Promoting data dissemination and reproducibility. Christopher I. Hunter, Scott C. Edmunds, Peter Li, Xiao Si Zhe, Robert L Davidson, Laurie Goodman. Submit.
Tools for reproducible and accessible science VMs, KnitR and OMERO Rob Davidson Cardiac Physiome Workshop Auckland, April 8th 2015.
Open Data, Open Source: preparing for Big Data in Metabolomics Rob L Davidson #MetSoc2015 This presentation DOI: /m9.figshare
Fall CIS 764 Database Systems Engineering L3: Two Assignments Relating to J2EE.
Tools for reproducible and accessible science KnitR, VMs and OMERO Rob Davidson Cardiac Physiome Workshop Auckland, April 8th 2015 DOI for this talk: /m9.figshare
So just what is the Sedona Framework? –The Framework is an embedded device programming and control environment with two major facets –Open Source Free.
Engineering a New Home EMILY STENBERG DIGITAL PUBLISHING & PRESERVATION LIBRARIAN LAUREN TODD ENGINEERING SUBJECT LIBRARIAN WASHINGTON UNIVERSITY IN ST.
A centre of expertise in digital information managementwww.ukoln.ac.uk QA And The IWMW Web Site: A Case Study (flaws and all) Brian Kelly UKOLN University.
14/11/11 Taverna Roadmap Shoaib Sufi myGrid Project Manager.
Open Data, Open Source: preparing for Big Data in Metabolomics Rob L Davidson #MetSoc2015 This presentation DOI: /m9.figshare
Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015.
IUScholarWorks is a set of services to make the work of IU scholars freely available. Allows IU departments, institutes, centers and research units to.
United Nations Economic Commission for Europe Statistical Division Seasonal Adjustment Process with Demetra+ Anu Peltola Economic Statistics Section, UNECE.
Introduction to GigaScience journal & database Chris I Hunter & Rob L Davidson ISI CODATA International Training Workshop on Big Data 11 th March 2015.
1 Instant Data Warehouse Utilities Extended (Again!!) 14/7/ Today I am pleased to announce the publishing of some fantastic new functionality for.
Data Management BIRN supports data intensive activities including: – Imaging, Microscopy, Genomics, Time Series, Analytics and more… BIRN utilities scale:
The HipHop Compiler from Facebook By Megha Gupta & Nikhil Kapoor.
CERN - IT Department CH-1211 Genève 23 Switzerland t DB Development Tools Benthic SQL Developer Application Express WLCG Service Reliability.
Technical Workshops | Esri International User Conference San Diego, California Creating Geoprocessing Services Kevin Hibma, Scott Murray July 25, 2012.
GEM Portal and SERVOGrid for Earthquake Science PTLIU Laboratory for Community Grids Geoffrey Fox, Marlon Pierce Computer Science, Informatics, Physics.
Software Sustainability Institute Software Attribution can we improve the reusability and sustainability of scientific software?
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
SiZhe Xiao GigaScience 2013 POSTER Open Access GigaDB – revolutionizing data dissemination, organization and use Xiao Si Zhe 1, Chris Hunter, Tam P. Sneddon,
The new European Toolkit EC-CHM Miruna Bădescu EEA contractor: Eau de Web.
Nature Reviews/2012. Next-Generation Sequencing (NGS): Data Generation NGS will generate more broadly applicable data for various novel functional assays.
WHAT ARE WE GOING TO DO WITH DATA? Rob L Davidson #WCSJ2015 This presentation DOI: /m9.figshare
Copyright © Software Carpentry 2011 This work is licensed under the Creative Commons Attribution License See
Electronic labnotes Mari Wigham COMMIT/. Information WUR  Organising, sharing, finding and reusing data  Expertise in: ● Modelling data.
Esri UC 2014 | Technical Workshop | Creating Geoprocessing Services Kevin Hibma.
Scratchpads and the new Biodiversity Data Journal Biodiversity Data Publishing made… easier Dimitris Koureas Natural History Museum London.
1 MSTE Visual SourceSafe For more information, see:
GigaScience ( is an online, open-access journal that includes, as part of its publishing activities, the database GigaDB.
Merging and sharing Metabolomics analysis tools with Galaxy: transparent, reproducible, open 'omics Robert L Davidson #MMW2014 Merlion.
Recent Enhancements to Quality Assurance and Case Management within the Emissions Modeling Framework Alison Eyth, R. Partheepan, Q. He Carolina Environmental.
Title Presenter name Slideshow-URL Conference name Date.
Chapter – 8 Software Tools.
Integrating and Extending Workflow 8 AA301 Carl Sykes Ed Heaney.
Challenges and Solutions Will Schroeder, co-Founder, President VAC Big Data Consortium Meeting July 31, 2012.
ICS Software Development Environment Blaž Zupanc and Leandro Fernandez 19 February 2016.
Canadian Bioinformatics Workshops
Updating image To update the background image: Go to ‘View’ Select ‘Slide Master’ Select the page with the image Right click on the image and select ‘Change.
The Reproducible Research Advantage Why + how to make your research more reproducible Presentation for the Center for Open Science June 17, 2015 April.
Incorporating W3C’s DQV and PROV in CISER’s Data Quality Review and
Using Galaxy for Metabolomics
Licenses and Interpreted Languages for DHTC Thursday morning, 10:45 am
What is Open Science and How do I do it?
Edmunds GigaScience 2013 POSTER Open Access
GigaDB – revolutionizing data dissemination, organization and use
“Real Simple Syndication” (RSS)
Publishing software and data
StratusLab Sustainability
Drupal VM and Docker4Drupal For Drupal Development Platform
Drupal VM and Docker4Drupal as Consistent Drupal Development Platform
Geoprocessing with ArcGIS for Server
bitcurator-access-webtools Quick Start Guide
Computational Pipeline Strategies
Presentation transcript:

Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015 DOI: /m9.figshare

Article: DOI: /m9.figshare

Big data! (The new oil) New dot com bubble? Article: DOI: /m9.figshare

DOI: /m9.figshare

Article: DOI: /m9.figshare

Article: DOI: /m9.figshare

Yay, we’re all unicorns! from: Are you recruiting a data scientist or a unicorn? DOI: /m9.figshare

But why are we sad unicorns? DOI: /m9.figshare

Measuring software reproducibility Systematic study: 515 papers (429 conference, 86 journal) <30% reproducible DOI: /m9.figshare

Measuring software reproducibility DOI: /m9.figshare

Reasons for failure “The good news is that I was able to find some code. I am just hoping that it is a stable working version of the code... I have lost some data... The bad news is that the code is not commented and/or clean. So, I cannot really guarantee that you will enjoy playing with it.” DOI: /m9.figshare

Cost of failure Waste time Waste money Frustrating Distrust DOI: /m9.figshare

How to fix it DOI: /m9.figshare

The path to enlightenment Look to the experts (4 x 10 simple rules) Share code – licenses Share environment – Codify the environment Share workflows – All parameters, versions, order of steps – GalaxyProject.org Share outputs – Share intermediate results – Share code for figures – Codify publications DOI: /m9.figshare

Look to the experts DOI: /m9.figshare

Look to the experts DOI: /m9.figshare

A word from the experts: 1 Keep it simple – Don’t be a perfectionist – Aim for multiple versions – Optimise/improve later – Get feedback/help from community Hastings #1 + Prlic # 5 DOI: /m9.figshare

A word from the experts: 2 Versioning – Use a versioning system (e.g. Github) – Allow others to know what version they use – Release early, release often (Linus Torvalds) – Get help from community Seemen # 3, Hastings # 10, Sandve #3/4 DOI: /m9.figshare

A word from the experts: 3 Use good coding practice – You don’t have to be the best – Learn from others – Become involved in a community – Write as though others will be watching Prlic #2 + all of Seemen and Hastings DOI: /m9.figshare

A word from the experts: highlight Start simple Release early Use versioning Build a community Get community feedback, testing, support …but wait, won’t that mean??? DOI: /m9.figshare

Sharing code DOI: /m9.figshare

Sharing code “Scientific software…public release is then only considered around the time of publication” – prlic #4 “the fear of getting scooped” – Reality: “staking a claim in the field” DOI: /m9.figshare

Sharing code: don’t worry Share early – Be simple – Don’t be perfectionist CRAPL license Source: DOI: /m9.figshare

Sharing code: licenses Know your licenses – Apache License 2.0 – BSD 3-Clause “New” or “Revised” – BSD 2-Clause “simplified” or “FreeBSD” – GNU (GPL) – MIT – Mozilla Public License 2.0 – etc Source: DOI: /m9.figshare

Sharing code: repositories Github Sourgeforge Zenodo GigaDB/GigaGalaxy Versioning, sharing, collaboration, community feedback DOI: /m9.figshare

Sharing environment DOI: /m9.figshare

Your environment How hard would it be to start from scratch? What if you move from Ubuntu to Centos? IF it took you a while to set up your box, if you hesitate to set it up for your colleagues… – Create a virtual machine or ‘docker’ image that can be shared whole. – Time-stamp of working system DOI: /m9.figshare

Share your environment Virtual machine – Copy your exact environment – If it works for you, it works for anyone – Reproducibility, frozen in time DOI: /m9.figshare DOI: / X-3-23

Share your environment Docker – ‘light’ vm – Discrete unit of code+environment – Can be called like a compiled tool New possibilities e.g. nucleotid.es benchmarking – Data-driven peer-review DOI: /m9.figshare

Share your environment VM = black box? Docker == black box! harmful.html harmful.html DOI: /m9.figshare

Codify your environment Provisioning scripts are ‘research objects’ Improves adaptability (easier to recode for alternative OS etc) Builds in extra documentation Easier to share – although GigaDB still wants a compiled snapshot (i.e. full machine) DOI: /m9.figshare

Short list of provisioning systems Vagrant Chef Salt Puppet Ansible Many more – see link for info DOI: /m9.figshare

Sharing workflows DOI: /m9.figshare

Share your workflow Any analysis is a string of tools with a great many parameters The order of the sequence, the version of each part and the inputs and outputs are never fully explained These should be shared! Help is at hand: there are many ‘workflow’ systems for this DOI: /m9.figshare

Workflow systems Galaxy Knime Taverna Many more… GigaScience uses Galaxy – galaxy.cbiit.cuhk.edu.hk DOI: /m9.figshare

Galaxy Over 36,000 main Galaxy server users Over 1,000 papers citing Galaxy use Over 55 Galaxy servers deployed Open source DOI: /m9.figshare

Galaxy User Interface Tool List Tool Parameters History/results DOI: /m9.figshare

Galaxy: Under the hood python myfunction input1 Basic xml 'wrapper' Describe inputs and outputs Calls command Monitors for output Logs/returns to 'history' DOI: /m9.figshare

Galaxy Workflow: visualise DOI: /m9.figshare

Galaxy Workflow: visualise DOI: /m9.figshare

Galaxy Workflow: visualise DOI: /m9.figshare

Galaxy Workflow: export DOI: /m9.figshare

Citable workflow Add as supplemental files or publish with distinct DOI via GigaDB or FigShare DOI: /m9.figshare

Galaxy Toolshed Many 'omics, stats, visualisations tools! Download; Run instantly DOI: /m9.figshare

GigaGalaxy Web Site: galaxy.cbiit.cuhk.edu.hk DOI: /m9.figshare

SOAPdenovo2 workflows implemented in galaxy.cbiit.cuhk.edu.hk

SOAPdenovo2 workflows implemented in Implemented entire workflow in our Galaxy server, inc.: 3 pre-processing steps 4 SOAPdenovo modules 1 post processing steps Evaluation and visualization tools Also will be available to download by >50K Galaxy users in galaxy.cbiit.cuhk.edu.hk

Can we reproduce results? SOAPdenovo2 S. aureus pipeline

Sharing outputs DOI: /m9.figshare

Share outputs – intermediate results Workflow systems help with this – Results in history If a part of your analysis can’t be replicated – Requires a license – Is no longer compatible – Just plain won’t work The rest of the analysis can still be used DOI: /m9.figshare

Share outputs – code for figures Data transform for figures – Remove points? – 3D: choose ‘best angle’? – PCA: choose ‘best components’? Figure choice – Bar chart or box&whisker? Allow reinterpretation!!! DOI: /m9.figshare

Share outputs – codify publication “This article is an example of a literate programming document. It has been created in R using the knitr package. Figures and tables in this paper are generated dynamically as the document is compiled. Several R packages are required to run the analysis. Materials are archived in the Gigascience database” DOI: /m9.figshare DOI: / X-3-3

Literate coding options See listing: 3/1/19 3/1/19 – R: KnitR, Sweave, R-Markdown – Javascript: Tangle, Active Markdown (CoffeeScript) – Python: Ipython Notebooks – iReport links this functionality for Galaxy DOI: /m9.figshare

SUMMARY

The path to enlightenment Look to the experts (4 x 10 simple rules) Share code – licenses Share environment – Codify the environment Share workflows – All parameters, versions, order of steps – GalaxyProject.org Share outputs – Share intermediate results – Share code for figures – Codify publications DOI: /m9.figshare

All Your Research Objects Project proposal Project experimental SOPs Images of equipment, subjects, conditions RAW data Meta-data Analysis code, parameters, pipelines Analysis environment, VM or provisioning script Intermediate results Publication figures/images/tables: codify Publication text DOI: /m9.figshare

@gigascience facebook.com/GigaScience Scott Edmunds Peter Li Chris Hunter Rob Davidson Jesse Si Zhe Nicole Nogoy Laurie Goodman Amye Kenall (BMC) galaxy.cbiit.cuhk.edu.hk blogs.biomedcentral.com/gigablog/

DOI: /m9.figshare