Data management for reproducible research Data as code Data management for reproducible research Martin O’Reilly Principal Research Software Engineer The Alan Turing Institute 08/09/2017 @martinoreilly | @turinginst Data as code: Data management for reproducible research
The Alan Turing Institute is the national centre for data science, headquartered at the British Library. Turing Research Engineering Radka Jersakova May Yong Tim Hobson James Geddes James Hetherington Turing Research Fellows Kirstie Whitaker Tomas Petricek 08/09/2017 @martinoreilly | @turinginst Data as code: Data management for reproducible research
Data management for reproducible research 08/09/2017 @martinoreilly | @turinginst Data as code: Data management for reproducible research
FAIR Data Principles Findable Accessible Interoperable Re-usable 08/09/2017 Source: FORCE11 website. https://www.force11.org/group/fairgroup/fairprinciples. Accessed on 07 Sep 2017 Data as code: Data management for reproducible research
Code management for reproducible research How do I get your code? Online repositories and persistent archives with versioning support How do I use your code? Documentation, examples, packages, virtual machines, containers How do I trust your code? Tests, examples, readable code How do I build on your code? Documentation, readable code, tests What am I allowed to do with your code? Licence 08/09/2017 Data as code: Data management for reproducible research
Data management for reproducible research How do I get your data? Online repositories with versioning and APIs for data access How do I use your data? Documentation, metadata, common data formats, data packages How do I trust your data? Record of provenance and processing, versioning How do I build on your data? Record of provenance and processing, compatible content, linkable to other data What am I allowed to do with your data? Licences, terms of use, data access agreements, ethics 08/09/2017 Data as code: Data management for reproducible research
Good examples 08/09/2017 Data as code: Data management for reproducible research
UN Comtrade database Web API for programmatic access Can apply current and historical classification codes to entire dataset Can select subset of data to retrieve along multiple dimensions 08/09/2017 Source: Screenshot of UN Comtrade database website. https://comtrade.un.org/data. Accessed on 06 Sep 2017 Data as code: Data management for reproducible research
UN Comtrade database Third-party R package available for querying web API 08/09/2017 Source: Screenshot from Comtradr R package Github README.md. https://github.com/ChrisMuir/comtradr. Accessed on 06 Sep 2017 Data as code: Data management for reproducible research
ConnectomeDB Website requires registration and login 08/09/2017 Source: Screenshot of ConnectomeDB login page. https://db.humanconnectome.org. Accessed on 06 Sep 2017 Data as code: Data management for reproducible research
ConnectomeDB One-time click for acceptance of terms Generate dedicated Amazon AWS access credentials 08/09/2017 Source: Screenshot of ConnectomeDB main page. https://db.humanconnectome.org. Accessed on 06 Sep 2017 Data as code: Data management for reproducible research
The Gamma Dot-driven development Intellisense autocomplete for data exploration Interactive dynamic data preview Uses F# type providers For more details, see http://tomasp.net/academic/papers/pivot/ 08/09/2017 Source: The Gamma homepage. https://thegamma.net/. Accessed on 06 Sep 2017 Data as code: Data management for reproducible research
The Gamma Sub categories indicated by initial numerals Sub-sub categories indicated by text formatting Subtotals indicated by background colour 08/09/2017 Source: UK National Statistics Public Expenditure Statistical Analyses 2016. Chapter 5 table 5.2. https://www.gov.uk/government/statistics/public-expenditure-statistical-analyses-2016/. Accessed on 06 Sep 2017 Data as code: Data management for reproducible research
The Gamma 08/09/2017 Source: Gamma @ The Turing: Accounting for Democracy. http://gamma.turing.ac.uk/expenditure/. Accessed on 06 Sep 2017 Data as code: Data management for reproducible research
The Gamma 08/09/2017 Source: Gamma @ The Turing: Accounting for Democracy. http://gamma.turing.ac.uk/expenditure/. Accessed on 06 Sep 2017 Data as code: Data management for reproducible research
Dream data 08/09/2017 Data as code: Data management for reproducible research
My wish list Repository supporting versioning and content-aware sub-setting Data includes raw and processed data, with code to replicate processing Content-aware, on-demand differential download Automatable access to data requiring an access agreement / authentication Data accessible as native code objects Documentation accessible in context of data presentation Standard, machine-readable licences Repository tracks download / usage stats 08/09/2017 Data as code: Data management for reproducible research
Interesting tools Repositories Figshare, Zenodo, Dataverse, DataONE, Dryad Data access Repository APIs, rOpenSci, SPARQL Data formats RDF, OWL, Research object bundles, BagIt, Frictionless data Differencing data Daff (tables), data-diff (JSON), data-diff (Python) Provenance / processing record Workflow platforms (e.g. Galaxy), execution capture tools (e.g. Sumatra) 08/09/2017 Data as code: Data management for reproducible research
turing.ac.uk @turinginst moreilly@turing.ac.uk @martinoreilly 08/09/2017 Data as code: Data management for reproducible research