Data management for reproducible research

Slides:



Advertisements
Similar presentations
2011 NetIS Presentation The Complete ePublishing Platform Designed for the 21 st Century.
Advertisements

Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Nesstar, ESDS International and ESDS Qualidata online demonstrations ASLIB visit to the UK Data Archive Wednesday 24 November 2004 Louise Corti, Associate.
Obesity e-Lab Enabling obesity research using the Health Surveys for England: The Obesity e-Lab project Dexter Canoy The University of Manchester
Organising and Documenting Data Stuart Macdonald EDINA & Data Library DIY Research Data Management Training Kit for Librarians.
Getting started with ENDNOTE Compiled by Helene van der Sandt.
Sara Bowman Center for Open Science Open Science Framework: Facilitating Transparency and Reproducibility.
Data-PASS Shared Catalog Micah Altman & Jonathan Crabtree 1 Micah Altman Harvard University Archival Director, Henry A. Murray Research Archive Associate.
INTRODUCTION TO RESEARCH DATA MANAGEMENT Robin Desmeules Janice Kung J W Scott Health Sciences Library University of Alberta Libraries.
Linked Data Visualizations for Eurostat Linked Data Dr. Brand Niemann Director and Senior Data Scientist Semantic Community
Lecturer: Ghadah Aldehim
Moving forward our shared data agenda: a view from the publishing industry ICSTI, March 2012.
Getting started on informaworld™ How do I register my institution with informaworld™? How is my institution’s online access activated? What do I do if.
Interoperability Scenario Producing summary versions of compound multimedia historical documents.
DATAVERSE FOR JOURNALS Mercè Crosas, Ph.D. Director of Data Science IQSS, Harvard Society for Scholarly Publishing 37 th Meeting,
MADGIC is… MAPS and ATLASES DATA (NUMERIC and GEOSPATIAL) for use with special software GOVERNMENT INFORMATION (parliamentary and other official reports,
Interoperability through Library APIs Library Technology Services Open House 7/30/15.
Andreas Juffinger 14 June, 2012, Washington DC Europeana Research Opening Up Europeana for Research.
Refworks Part I. How can I access Refworks Refworks can be accessed from: – The homepage of the Jotello F Soga Library (
United Nations Economic Commission for Europe Statistical Division The Importance of Databases in the Dissemination Process Steven Vale, UNECE.
Introduction to the Semantic Web and Linked Data
KAnOE: Research Centre for Knowledge Analytics and Ontological Engineering Managing Semantic Data NACLIN-2014, 10 Dec 2014 Dr. Kavi Mahesh Dean of Research,
Briefing and Planning meeting on INSPIRE validator implementation – Discussion 16/12/2015.
Dissemination of ONS Data - Future Channels and Tools Callum Foster, Web Data Access Project ONS 1.
The Reproducible Research Advantage Why + how to make your research more reproducible Presentation for the Center for Open Science June 17, 2015 April.
Reference Management Module I: Introduction By Rehema Chande-Mallya(PhD)
Beyond the Repository: Research Systems, REF & New Opportunities William J Nixon Digital Library Development Manager.
Electronic Document Management By Portford Solutions Group, Inc.
Introduction to SHERPA RoMEO and its Significance for Publishers
WP3: Common policies and implementation strategies
FAIR Data in Trustworthy Data Repositories:
2nd DPHEP Collaboration Workshop
Current as of April/May 2013
Ian Bruno, Suzanna Ward The Cambridge Crystallographic Data Centre
MIRACLE Cloud-based reproducible data analysis and visualization for outputs of agent-based models Xiongbing Jin, Kirsten Robinson, Allen Lee, Gary Polhill,
EPSRC research data expectations and research software management
Loading Records Through the Registry’s REST Interface
ReproZip: Computational Reproducibility With Ease
Libraries as Data-Centers for the Arts and Humanities
An Overview of Data-PASS Shared Catalog
Ways to upgrade the FAIRness of your data repository.
WHAT DOES THE FUTURE HOLD? Ann Ellis Dec. 18, 2000
Towards Automated Data Wrangling
Steering Group Member, Link Digital
knowledge organization for a food secure world
VI-SEEM Data Repository
Institutional role in supporting open access, open science, open data
UNIT 15 Webpage Creator.
Linking persistent identifiers at the British Library
VI-SEEM Data Repository
Capitalize on your data Best Practices for the future Open Issues on how to contribute data To share with you what we learnt from the training workshops.
Case Study: US/UK open source SDGs National Reporting Platform
THE CURRENT STATE OF ICT WEB 2.0. The term "Web 2.0" was first used in January 1999 by Darcy DiNucci describes World Wide Web sites that emphasize user-
Experiences of the Digital Repository of Ireland
CDISC SHARE API v1.0 CAC Update 22 February 2018
May 2014 Improving the accessibility of Official Statistics: HMRC’s uktradeinfo website.
OpenML Workshop Eindhoven TU/e,
A platform for Linked Data publishing
Module 01 ETICS Overview ETICS Online Tutorials
Research Data Management
Research Infrastructures: Ensuring trust and quality of data
LOD reference architecture
COUNTER Update February 2006.
Ecosystem Status Report: collaborating with IPython Notebooks
Web archives as a research subject
Dataverse for citing and sharing research data
Crowd-Sourcing an Interactive Safety Review Package
OpenDP: A Pitch for a Community Effort
Research Data Dr Aoife Coffey, Research Data Coordinator
Palestinian Central Bureau of Statistics
Presentation transcript:

Data management for reproducible research Data as code Data management for reproducible research Martin O’Reilly Principal Research Software Engineer The Alan Turing Institute 08/09/2017 @martinoreilly | @turinginst Data as code: Data management for reproducible research

The Alan Turing Institute is the national centre for data science, headquartered at the British Library. Turing Research Engineering Radka Jersakova May Yong Tim Hobson James Geddes James Hetherington Turing Research Fellows Kirstie Whitaker Tomas Petricek 08/09/2017 @martinoreilly | @turinginst Data as code: Data management for reproducible research

Data management for reproducible research 08/09/2017 @martinoreilly | @turinginst Data as code: Data management for reproducible research

FAIR Data Principles Findable Accessible Interoperable Re-usable 08/09/2017 Source: FORCE11 website. https://www.force11.org/group/fairgroup/fairprinciples. Accessed on 07 Sep 2017 Data as code: Data management for reproducible research

Code management for reproducible research How do I get your code? Online repositories and persistent archives with versioning support How do I use your code? Documentation, examples, packages, virtual machines, containers How do I trust your code? Tests, examples, readable code How do I build on your code? Documentation, readable code, tests What am I allowed to do with your code? Licence 08/09/2017 Data as code: Data management for reproducible research

Data management for reproducible research How do I get your data? Online repositories with versioning and APIs for data access How do I use your data? Documentation, metadata, common data formats, data packages How do I trust your data? Record of provenance and processing, versioning How do I build on your data? Record of provenance and processing, compatible content, linkable to other data What am I allowed to do with your data? Licences, terms of use, data access agreements, ethics 08/09/2017 Data as code: Data management for reproducible research

Good examples 08/09/2017 Data as code: Data management for reproducible research

UN Comtrade database Web API for programmatic access Can apply current and historical classification codes to entire dataset Can select subset of data to retrieve along multiple dimensions 08/09/2017 Source: Screenshot of UN Comtrade database website. https://comtrade.un.org/data. Accessed on 06 Sep 2017 Data as code: Data management for reproducible research

UN Comtrade database Third-party R package available for querying web API 08/09/2017 Source: Screenshot from Comtradr R package Github README.md. https://github.com/ChrisMuir/comtradr. Accessed on 06 Sep 2017 Data as code: Data management for reproducible research

ConnectomeDB Website requires registration and login 08/09/2017 Source: Screenshot of ConnectomeDB login page. https://db.humanconnectome.org. Accessed on 06 Sep 2017 Data as code: Data management for reproducible research

ConnectomeDB One-time click for acceptance of terms Generate dedicated Amazon AWS access credentials 08/09/2017 Source: Screenshot of ConnectomeDB main page. https://db.humanconnectome.org. Accessed on 06 Sep 2017 Data as code: Data management for reproducible research

The Gamma Dot-driven development Intellisense autocomplete for data exploration Interactive dynamic data preview Uses F# type providers For more details, see http://tomasp.net/academic/papers/pivot/ 08/09/2017 Source: The Gamma homepage. https://thegamma.net/. Accessed on 06 Sep 2017 Data as code: Data management for reproducible research

The Gamma Sub categories indicated by initial numerals Sub-sub categories indicated by text formatting Subtotals indicated by background colour 08/09/2017 Source: UK National Statistics Public Expenditure Statistical Analyses 2016. Chapter 5 table 5.2. https://www.gov.uk/government/statistics/public-expenditure-statistical-analyses-2016/. Accessed on 06 Sep 2017 Data as code: Data management for reproducible research

The Gamma 08/09/2017 Source: Gamma @ The Turing: Accounting for Democracy. http://gamma.turing.ac.uk/expenditure/. Accessed on 06 Sep 2017 Data as code: Data management for reproducible research

The Gamma 08/09/2017 Source: Gamma @ The Turing: Accounting for Democracy. http://gamma.turing.ac.uk/expenditure/. Accessed on 06 Sep 2017 Data as code: Data management for reproducible research

Dream data 08/09/2017 Data as code: Data management for reproducible research

My wish list Repository supporting versioning and content-aware sub-setting Data includes raw and processed data, with code to replicate processing Content-aware, on-demand differential download Automatable access to data requiring an access agreement / authentication Data accessible as native code objects Documentation accessible in context of data presentation Standard, machine-readable licences Repository tracks download / usage stats 08/09/2017 Data as code: Data management for reproducible research

Interesting tools Repositories Figshare, Zenodo, Dataverse, DataONE, Dryad Data access Repository APIs, rOpenSci, SPARQL Data formats RDF, OWL, Research object bundles, BagIt, Frictionless data Differencing data Daff (tables), data-diff (JSON), data-diff (Python) Provenance / processing record Workflow platforms (e.g. Galaxy), execution capture tools (e.g. Sumatra) 08/09/2017 Data as code: Data management for reproducible research

turing.ac.uk @turinginst moreilly@turing.ac.uk @martinoreilly 08/09/2017 Data as code: Data management for reproducible research