RDA WG on Dynamic Data Citation

Slides:

Advertisements

Similar presentations

Software change management

Advertisements

Configuration management

Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.

U of R eXtensible Catalog Team MetaCat. Problem Domain.

This chapter is extracted from Sommerville’s slides. Text book chapter

Presented by DOI Create: TERN as a use-case Siddeswara Guru

Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.

Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.

Approaches to Making Dynamic Data Citable: Update on the Activities of the RDA Working Group Andreas Rauber Vienna University of Technology Viennna, Austria.

Online Autonomous Citation Management for CiteSeer CSE598B Course Project By Huajing Li.

Data File Access API : Under the Hood Simon Horwith CTO Etrilogy Ltd.

OracleAS Reports Services. Problem Statement To simplify the process of managing, creating and execution of Oracle Reports.

The role of knowledge bases in improving discoverability now and in the future- why national and international collaboration is key The role of knowledge.

Preserving Complex Scientific Objects: Process Capture and Data Identification Andreas Rauber J.Binder, T.Miksa, R.Mayer, S.Pröll, S.Strodl, M.Unterberger.

EBSCOhost 2.0 GOLD/GALILEO ANNUAL USERS GROUP CONFERENCE August 1, 2008.

1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.

Joint Declaration of Data Citation Principles Notes [1] CODATA 2013: sec 3.2.1; Uhlir (ed.) 2012, ch 14; Altman &

1.file. 2.database. 3.entity. 4.record. 5.attribute. When working with a database, a group of related fields comprises a(n)…

Data Citation Working Group P6 23 nd Sep 2015, Paris.

C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.

NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.

Chapter 9 Database Systems © 2007 Pearson Addison-Wesley. All rights reserved.

Summary of RDA Outputs so far dr. Ir. Herman Stehouwer 22 September 2015.

Introduction to the Semantic Web and Linked Data

“Dynamic” Data at BCO-DMO Biological and Chemical Oceanography Data Management Office (BCO-DMO) Shannon Rauch -- Danie Kinkade --

4 way comparison of Data Citation Principles: Amsterdam Manifesto, CoData, Data Cite, Digital Curation Center FORCE11 Data Citation Synthesis Group Should.

M. Stockhause 1, G. Levavasseur 2, K. Berger 1 1 Deutsches Klimarechenzentrum (DKRZ) 2 Institute Pierre Simon Laplace (IPSL) ESGF-QCWT Quality Control.

1 Chapter 12 Configuration management This chapter is extracted from Sommerville’s slides. Text book chapter 29 1.

CS5604: Final Presentation ProjOpenDSA: Log Support Victoria Suwardiman Anand Swaminathan Shiyi Wei Department of Computer Science, Virginia Tech December.

Digital Preservation and Citeable Data: Challenges in Data Citation in e-Science Andreas Rauber Vienna University of Technology & Secure Business Austria.

4 way comparison of Data Citation Principles: Amsterdam Manifesto, CoData, Data Cite, Digital Curation Center FORCE11 Data Citation Synthesis Group Should.

4 way comparison of Data Citation Principles: Amsterdam Manifesto, CoData, Data Cite, Digital Curation Center FORCE11 Data Citation Synthesis Group.

NIH BioCADDIE / Force11 Data Citation Pilot Kickoff Meeting Nine Zero Hotel, Boston MA, 3 February 2016 Introduction: Tim Clark, Maryann Martone and Joan.

Data Citation Implementation Pilot Workshop

Data Citation Dataverse Mercè Crosas Chief Data Science and Technology Officer, IQSS, Harvard Workshop: Data Citation.

Joint Declaration of Data Citation Principles (Overview) The Data Citation Synthesis Group Joint Declaration.

Retele de senzori Curs 2 - 1st edition UNIVERSITATEA „ TRANSILVANIA ” DIN BRAŞOV FACULTATEA DE INGINERIE ELECTRICĂ ŞI ŞTIINŢA CALCULATOARELOR.

ICSU-WDS & RDA Data Publication Services WG. 2 Linking Research Data and the Literature: why? Why link? 1.Increase visibility & discoverability of research.

Dr. Ari Asmi Research Coordinator Faculty of Science Department of Physics MAKING DYNAMIC DATA CITABLE: APPROACHES TO DATA CITATION WITHIN AS A RDA WORKING.

An Introduction to Data Modeling with Fedora Thorny Staples Fedora Commons, Inc.

Approaches to Making Data Citeable Recommendations of the RDA Working Group Andreas Rauber, Ari Asmi, Dieter van Uytvanck Stefan Pröll.

What is a database? (a supplement, not a substitute for Chapter 1…) some slides copied/modified from text Collection of Data? Data vs. information Example:

Management Information Systems by Prof. Park Kyung-Hye Chapter 7 (8th Week) Databases and Data Warehouses 07.

MIKADO – Generation of ISO – SeaDataNet metadata files

Identifiers and Citation

Open data For improved land governance

Current and Upcoming RDA Recommendations Dr. ir. Herman Stehouwer

B.6 Roadmap 2013 – 2014 SDMX RI User Group Luxembourg, September 2013.

MongoDB Er. Shiva K. Shrestha ME Computer, NCIT

WG Research Data Collections RDA P10 Montréal – September 2017

Adoption of Data Citation Outcomes by BCO-DMO Cynthia Chandler, Adam Shepherd, David Bassendine Biological and Chemical Oceanography Data Management.

Identifiers and Citation

Data Type Registries (DTR)

C2CAMP (A Working Title)

Hyperledger Fabric Private Data - Collection Types

MANAGING DATA RESOURCES

OpenML Workshop Eindhoven TU/e,

New input for CEOS Persistent Identifier Best Practices

EUDAT B2FIND A Cross-Discipline Metadata Service and Discovery Portal

PID‘s ( in theory land ) M. Dreyer.

NoSQL Databases Antonino Virgillito.

WG Research Data Collections Draft outputs of a RDA bottom-up effort P9 - April 2017 Co-chairs: Bridget Almas, Frederik Baumgardt, Tobias Weigel, Thomas.

WG Research Data Collections An overview of the recommendation

Database Systems Instructor Name: Lecture-3.

Chapter 10 ADO.

Agenda (AM) 9:30-10:15 Introduction to RDA

The ultimate in data organization

Configuration management

Leveraging PIDs for object management in data infrastructures RDA UK Node Workshop, July Tobias Weigel (DKRZ)

Presentation transcript:

RDA WG on Dynamic Data Citation Andreas Rauber Vienna University of Technology & Secure Business Austria rauber@ifs.tuwien.ac.at http://www.ifs.tuwien.ac.at/~andi 1

RDA WG Data Citation Research Data Alliance WG on Data Citation: Making Dynamic Data Citeable WG officially endorsed in March 2014 2 areas of focus Citing arbitrary subsets of data Citing data that is dynamic Stable across technology changes, scalable, implementable Not in focus Metadata for citing data, landing page design Which PID solution to adopt (DOI, URI, ARK, …) Liaise with other initiatives on data citation (CODATA, DataCite, Force11, …) https://rd-alliance.org/working-groups/data-citation-wg.html

Citation of Dynamic Data Citable datasets have to be static Fixed set of data, no changes: no corrections to errors, no new data being added But: (research) data is dynamic Adding new data, correcting errors, enhancing data quality, … Changes sometimes highly dynamic, at irregular intervals Current approaches Identifying entire data stream, without any versioning Using “accessed at” date (how useful is this on its own?) “Artificial” versioning by identifying batches of data (e.g. annual), aggregating changes into releases time-delayed artificial boundaries Would like to cite (and retrieve) precisely the data as it existed at certain point in time

Granularity of Data Citation What about granularity of data to be cited? Databases collect enormous amounts of data over time Researchers use specific subsets (“rows/columns”) of data Need to identify precisely the subset used Current approaches Storing pre-defined sub-sets -> not flexible enough Storing a copy of subset as used in study -> scalability Citing entire dataset, providing textual description of subset -> imprecise (ambiguity) Storing list of record identifiers in subset (“rows”) -> scalability, not for arbitrary subsets (e.g. when not entire record selected) Would like to be able to cite precisely the subset of (dynamic) data used in a study

Data Citation – Requirements Dynamic data corrections, additions, … Arbitrary subsets of data (granularity) rows/columns, time sequences, … from single number to the entire set Stable across technology changes e.g. migration to new database / data representation Machine-actionable not just machine-readable, definitely not just human-readable and interpretable Scalable to very large / highly dynamic datasets but should also work for small, static 5

Dynamic Data Citation We have: Data + Means-of-access Dynamic Data Citation: cite data dynamically via query Steps: Data  versioned (history, with time-stamps) Researcher creates working-set via some interface: Access  assign PID to “QUERY”, enhanced with Time-stamping for re-execution against versioned DB Re-writing for normalization, unique-sort, mapping to history Hashing result-set: verifying identity/correctness leading to landing page S. Pröll, A. Rauber. Scalable Data Citation in Dynamic Large Databases: Model and Reference Implementation. In IEEE Intl. Conf. on Big Data 2013 (IEEE BigData2013), 2013 http://www.ifs.tuwien.ac.at/~andi/publications/pdf/pro_ieeebigdata13.pdf

Data Citation – Deployment Note: query string provides valuable provenance information on the data set! Researcher uses workbench to identify subset of data Upon executing selection („download“) user gets Data (package, access API, …) PID (e.g. DOI) (Query is time-stamped and stored) Hash value computed over the data for local storage Recommended citation text (e.g. BibTeX) PID resolves to landing page Provides detailed metadata, link to parent data set, subset,… Option to retrieve original data OR current version OR changes Upon activating PID associated with a data citation Query is re-executed against time-stamped and versioned DB Results as above are returned This is an important advantage over traditional approaches relying on, e.g. storing a list of identifiers/DB dump!!!

To Discuss Today Q/A on RDA recommendation Needs in pilots present: sub-sets, dynamics, precision/automation Feasibility of approach for pilots: SQL, CSV, XML, within-file, … Effort for / Impact of / Interest in moving forward Maybe for later Which timestamp to assign (query-time, latest change global, latest change of data affected) Importance of unique sort and query re-write complexity Hash-key computation (entire result set, row/column header, subset, none) Relationship between subset identified and entire dataset Distributed data sources: linked / independent Migration across technological changes

Join RDA and Working Group If you are interested in joining the discussion, contributing a pilot, wish to establish a data citation solution, … Register for the RDA WG on Data Citation: Website: https://rd-alliance.org/working-groups/data-citation-wg.html Mailinglist: https://rd-alliance.org/node/141/archive-post-mailinglist Web Conferences: https://rd-alliance.org/webconference-data-citation-wg.html List of pilots: https://rd-alliance.org/groups/data-citation-wg/wiki/collaboration-environments.html 9

Thank you! https://rd-alliance.org/working-groups/data-citation-wg.html 10