Download presentation
Presentation is loading. Please wait.
1
Data management for reproducible research
Data as code Data management for reproducible research Martin O’Reilly Principal Research Software Engineer The Alan Turing Institute 08/09/2017 @martinoreilly Data as code: Data management for reproducible research
2
The Alan Turing Institute is the national centre for data science, headquartered at the British Library. Turing Research Engineering Radka Jersakova May Yong Tim Hobson James Geddes James Hetherington Turing Research Fellows Kirstie Whitaker Tomas Petricek 08/09/2017 @martinoreilly Data as code: Data management for reproducible research
3
Data management for reproducible research
08/09/2017 @martinoreilly Data as code: Data management for reproducible research
4
FAIR Data Principles Findable Accessible Interoperable Re-usable
08/09/2017 Source: FORCE11 website. Accessed on 07 Sep 2017 Data as code: Data management for reproducible research
5
Code management for reproducible research
How do I get your code? Online repositories and persistent archives with versioning support How do I use your code? Documentation, examples, packages, virtual machines, containers How do I trust your code? Tests, examples, readable code How do I build on your code? Documentation, readable code, tests What am I allowed to do with your code? Licence 08/09/2017 Data as code: Data management for reproducible research
6
Data management for reproducible research
How do I get your data? Online repositories with versioning and APIs for data access How do I use your data? Documentation, metadata, common data formats, data packages How do I trust your data? Record of provenance and processing, versioning How do I build on your data? Record of provenance and processing, compatible content, linkable to other data What am I allowed to do with your data? Licences, terms of use, data access agreements, ethics 08/09/2017 Data as code: Data management for reproducible research
7
Good examples 08/09/2017 Data as code: Data management for reproducible research
8
UN Comtrade database Web API for programmatic access
Can apply current and historical classification codes to entire dataset Can select subset of data to retrieve along multiple dimensions 08/09/2017 Source: Screenshot of UN Comtrade database website. Accessed on 06 Sep 2017 Data as code: Data management for reproducible research
9
UN Comtrade database Third-party R package available for querying web API 08/09/2017 Source: Screenshot from Comtradr R package Github README.md. Accessed on 06 Sep 2017 Data as code: Data management for reproducible research
10
ConnectomeDB Website requires registration and login 08/09/2017
Source: Screenshot of ConnectomeDB login page. Accessed on 06 Sep 2017 Data as code: Data management for reproducible research
11
ConnectomeDB One-time click for acceptance of terms
Generate dedicated Amazon AWS access credentials 08/09/2017 Source: Screenshot of ConnectomeDB main page. Accessed on 06 Sep 2017 Data as code: Data management for reproducible research
12
The Gamma Dot-driven development
Intellisense autocomplete for data exploration Interactive dynamic data preview Uses F# type providers For more details, see 08/09/2017 Source: The Gamma homepage. Accessed on 06 Sep 2017 Data as code: Data management for reproducible research
13
The Gamma Sub categories indicated by initial numerals
Sub-sub categories indicated by text formatting Subtotals indicated by background colour 08/09/2017 Source: UK National Statistics Public Expenditure Statistical Analyses Chapter 5 table Accessed on 06 Sep 2017 Data as code: Data management for reproducible research
14
The Gamma 08/09/2017 Source: The Turing: Accounting for Democracy. Accessed on 06 Sep 2017 Data as code: Data management for reproducible research
15
The Gamma 08/09/2017 Source: The Turing: Accounting for Democracy. Accessed on 06 Sep 2017 Data as code: Data management for reproducible research
16
Dream data 08/09/2017 Data as code: Data management for reproducible research
17
My wish list Repository supporting versioning and content-aware sub-setting Data includes raw and processed data, with code to replicate processing Content-aware, on-demand differential download Automatable access to data requiring an access agreement / authentication Data accessible as native code objects Documentation accessible in context of data presentation Standard, machine-readable licences Repository tracks download / usage stats 08/09/2017 Data as code: Data management for reproducible research
18
Interesting tools Repositories
Figshare, Zenodo, Dataverse, DataONE, Dryad Data access Repository APIs, rOpenSci, SPARQL Data formats RDF, OWL, Research object bundles, BagIt, Frictionless data Differencing data Daff (tables), data-diff (JSON), data-diff (Python) Provenance / processing record Workflow platforms (e.g. Galaxy), execution capture tools (e.g. Sumatra) 08/09/2017 Data as code: Data management for reproducible research
19
turing.ac.uk @turinginst
@martinoreilly 08/09/2017 Data as code: Data management for reproducible research
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.