Approaches to Making Dynamic Data Citable: Update on the Activities of the RDA Working Group Andreas Rauber Vienna University of Technology Viennna, Austria.

Approaches to Making Dynamic Data Citable: Update on the Activities of the RDA Working Group Andreas Rauber Vienna University of Technology Viennna, Austria rauber@ifs.tuwien.ac.at http://www.ifs.tuwien.ac.at/~andi/ Pa ge 1

Outline  What are the challenges in citing dynamic data? -Data Citation: the status quo and requirements  How can we enable precise citation of dynamic data? -Making dynamic data citeable  Will the concept work? -A look at some pilots and reference implementations  Does this work for all data? -Next steps, open issues, and the RDA Working Group Page 2

Data Citation  Citing Data should be easy -from providing a URL in a footnote -via providing a reference in the bibliography section -to assigning a PID to dataset (DOI, ARK, …) in a repository Page 3

Dynamic Data Citation  Citable datasets have to be static -Fixed set of data, no changes: no corrections to errors, no new data being added  But: (research) data is dynamic -Adding new data, correcting errors, enhancing data quality, … -Changes sometimes highly dynamic, at irregular intervals  Current approaches -Identifying entire data stream, without any versioning -Using “accessed at” date -“Artificial” versioning by identifying batches of data (e.g. annual), aggregating changes into releases (time-delayed!)  Would like to cite precisely the data as it existed at certain point in time, without delaying release of new data Page 4

Fine-Granular Data Citation  What about granularity of data to be cited? -Datasets contain vast amounts of data -Researchers use specific subsets of this data -Need to identify precisely the subset used  Current approaches -Storing a copy of subset as used in study -> scalability -Citing entire dataset, providing textual description of subset -> imprecise (ambiguity) -Storing list of record identifiers in subset -> scalability, not for arbitrary subsets (e.g. when not entire record selected)  Would like to be able to cite precisely the subset of dynamic data used in a study Page 5

Data Citation Current Approaches  Persistent Identifier (PID) e.g. DOI, URI, ARK, … currently provided for -entire data sets, copies of subsets -static data, sometimes release of versions -cited in their entirety with textual description of subsets  This is insufficient in many settings -imprecise -not machine-actionable -not scalable for large data sets -insufficient support for data that changes -insufficient support for arbitrary subsets (rows/columns) Page 6

Data Citation – Requirements for Citing  Arbitrary subsets of data -rows/columns, time sequences, … -from single number to the entire set  Dynamic data -corrections, additions, …  Stable across technology changes -e.g. migration to new database  Machine-actionable -not just machine-readable, definitely not just human-readable and interpretable  Scalable to very large / highly dynamic datasets

 Research Data Alliance  WG on Data Citation: Making Dynamic Data Citeable  WG officially endorsed in March 2014 -Concentrating on the problems of dynamic (changing) data(sub)sets -Focus! -Liaise with other WGs on attribution, metadata, … -Liaise with other initiatives on data citation (CODATA, DataCite, Force11, …) RDA WG Data Citation

Outline  What are the challenges in citing dynamic data? -Data Citation: the status quo and requirements  How can we enable precise citation of dynamic data? -Making Dynamic Data Citeable  Will the concept work? -A look at some pilots and reference implementations  Does this work for all data? -Next steps, open issues, and the RDA Working Group Page 9

Making Dynamic Data Citeable Data Citation: Data + Means-of-access  Data  time-stamped & versioned (aka history) Researcher creates working-set via some interface:  Access  assign PID to QUERY, enhanced with  Time-stamping for re-execution against versioned DB  Re-writing for normalization, unique-sort, mapping to history  Hashing result-set: verifying identity/correctness leading to landing page S. Pröll, A. Rauber. Scalable Data Citation in Dynamic Large Databases: Model and Reference Implementation. In IEEE Intl. Conf. on Big Data 2013 (IEEE BigData2013), 2013 http://www.ifs.tuwien.ac.at/~andi/publications/pdf/pro_ieeebigdata13.pdf http://www.ifs.tuwien.ac.at/~andi/publications/pdf/pro_ieeebigdata13.pdf

PID Assignment  PID assigned to a query identifying a new dataset  When to assign an existing/new PID to a query? -Existing PID: Identical query (semantics) with identical result set, i.e. no change to any element touched upon by query since first processing of the query -New PID: whenever query semantics is not absolutely identical (irrespective of result set being potentially identical!) or when results differ (update to data)  Note: -Identical result does not mean that query semantics is identical -Will assign different PIDs to capture query semantics -Need to normalize query to allow comparison -> query re-writing 11

Query Re-Writing  Query re-writing needed to -Standardization/Normalization of query to help with identifying semantically identical queries -Adapt to versioning approach chosen (versioning in operational tables, separate history table, …) -Add timestamp to any select statement in query -Potentially re-write to identify last change to result set touched upon (i.e. select including elements marked deleted, check most recent timestamp, to determine correct PID assignment) -Apply unique sort to source data prior to query to ensure unique sort 12

Query Re-Writing  Normalization of query string -Upper / lower case spelling -Sorting of filtering criteria (order does not influence result semantics) -Compute hash-key over query string to identify whether identical query has been issued already -If identical query found, re-run and check for changes in result set based on time-stamps of data records added/deleted -If different, assign new PID, otherwise existing PID 13

Query Re-Writing  Unique sort of result list -Most databases are set-based -Most subsequent processing is sequence-based -Need to re-write query to apply unique sort on any table prior to applying any user-defined sort for repeatability  Hashing of result set to verify identity of result -Compute over entire result set: comprehensive, potentially slow -Computer over column headers and row IDs: verifies correctness of attributes and data items selected does not safeguard against unmonitored changes to attribute values -Well-defined hash input data (data migrations) 14

Timestamping  Which timestamp to assign to new query? -Timestamp of query processing -Timestamp of last change to DB (global) -Timestamp of last change to result set touched upon by query (including deletes) most complex approach in terms of query re-writing required to select with deletes, extract latest TS, then filter closest to traditional concept of „version “ 15

Making Dynamic Data Citeable  Building blocks of supporting dynamic data citation: -Uniquely identifiable data records (for unique sort) -Versioned data, marking changes as insertion/deletion -Time stamps of data insertion / deletions -“Query language” for constructing subsets  Add modules: -Persistent query store: queries, timestamp, hash, metadata including creator of subset -Query rewriting module -PID assignment to queries -Landing page design, citation text  Stable across data source migrations (e.g. diff. DBMS), scalable, machine-actionable Page 16

Data Citation – Deployment  Researcher uses workbench to identify subset of data  Upon executing selection („download“) user gets  Data (package, access API, …)  PID (e.g. DOI) (Query is time-stamped and stored)  Hash value computed over the data for local storage  Recommended citation text (e.g. BibTeX)  PID resolves to landing page  Provides detailed metadata, link to parent data set, subset,…  Option to retrieve original data OR current version OR changes  Upon activating PID associated with a data citation  Query is re-executed against time-stamped and versioned DB  Results as above are returned This is an important advantage over traditional approaches relying on, e.g. storing a list of identifiers!!!

Outline  What are the challenges in citing dynamic data? -Data Citation: the status quo and requirements  How can we enable precise citation of dynamic data? -Making Dynamic Data Citeable  Will the concept work? -A look at some pilots and reference implementations  Does this work for all data? -Next steps, open issues, and the RDA working Group Page 18

 Reports from pilots -SQL Data: LNEC, MSD Reference implementations -CSV: MSD presentation -CLARIN presentation -XML data -Results from the VAMDC workshop -Results from the NERC workshop 19 Report from Pilots

Dynamic Data Citation for SQL Data LNEC, MSD Reference Implementation Dynamic Data Citation - Pilots

SQL Prototype Implementation  LNEC Laboratory of Civil Engineering, Portugal  Monitoring dams and bridges  31 manual sensor instruments  25 automatic sensor instruments  Web portal -Select sensor data -Define timespans  Report generation -Analysis processes, produces -Latex, produces -PDF report Page 21 Florian Fuchs [CC-BY-3.0 (http://creativecommons.org/licenses/by/3.0)], via Wikimedia Commonshttp://creativecommons.org/licenses/by/3.0)

SQL Prototype Implementation  Million Song Dataset http://labrosa.ee.columbia.edu/millionsong/ http://labrosa.ee.columbia.edu/millionsong/  Larges benchmark collection in Music Retrieval  Original set provided by Echonest  No audio, only set of features  Harvested, additional features and metadata extracted and offered by several groups e.g. http://www.ifs.tuwien.ac.at/mir/msd/download.htmlhttp://www.ifs.tuwien.ac.at/mir/msd/download.html  Dynamics because of metadata errors, extraction errors  Research groups select subsets by genre, audio length, audio quality,… 22

SQL Time-Stamping and Versioning  Integrated -Extend original tables by temporal metadata -Expand primary key by record-version column  Hybrid -Utilize history table for deleted record versions with metadata -Original table reflects latest version only  Separated -Utilizes full history table -Also inserts reflected in history table  Solution to be adopted depends on trade-off -Storage Demand -Query Complexity -Software adaption Page 23

SQL: Storing Queries Page 24  Add query store containing -PID of the query -Original query -Re-written query + query string hash -Timestamp (as used in re-written query) -Hash-key of query result -Metadata useful for citation / landing page (creator, institution, rights, …) -PID of parent dataset (or using fragment identifiers for query)

SQL Query Re-Writing  Normalizing queries to detect identical queries -WHERE clause sorted -Calculate query string hash -Identify semantically identical queries 25

SQL Query Re-Writing  Normalizing queries to detect identical queries -WHERE clause sorted -Calculate query string hash -Identify semantically identical queries -  non-identical queries: columns in different order 26

SQL Query Re-Writing 27  Adapt query to history table

Dynamic Data Citation - Pilots Dynamic Data Citation for CSV Data Stefan Pröll Secure Business Austria sproell@sba-research.org

Dynamic Data Citation for CSV Data  Goals : -Ensure cite-ability of CSV data -Enable subset citation -Support particularly small and large volume data -Support dynamically changing data  Why CSV data? -Well understood and widely spread -Simple and flexible -Most frequently requested during initial RDA meetings

CSV: Basic Steps  Upload interface  Upload CSV files  Migrate CSV file into RDBMS  Generate table structure  Add metadata columns for versioning  Add indices  Dynamic data  Update existing records  Append new data  Access interface  Track subset creation  Store queries Barrymieny

CSV Data Prototype

Dynamic Data Citation for XML Data Stefan Pröll, Secure Business Austria sproell@sba-research.org Dynamic Data Citation - Pilots

Dynamic Data Citation for XML Data  Goals : -Cite arbitrary subsets of XML data Subsets, nodes, attributes -Enable dynamic data -Utilize query available languages  Why XML data? -Used in many different settings -Complex structures possible -Schema available

XML Data Approaches  Apply data citation framework  Add metadata for versioning  Mark insert, update and delete operations  No actual deletes  Approaches  Copy branches upon updates and deletes  Simple approach, but uses storage space  Introduce parent/child relationships and  Resolution more complex

Dynamic Data Citation for XML Data  XML Database: Base X  Lightweight system  Client/Server architecture,  ACID safe transactions  XPath/XQuery 3.0 Processor  Scalable  E.g. used in UK Data Service Use Case  Textual transcripts in XML format  Unique identification of (sub) sections

Dynamic Data Citation for XML Data  Adapt XQuery parser, rewrite operations for  Inserting  Deleting  Replacing  Renaming  Reuse query parser for alternative implementations  E.g. eXistDB XQuery XQuery Parser

Support for Dynamic Data Citation in CLARIN Dieter van Uytvanck Max Plank Institute dieter.vanuytvanck@mpi.nl Dynamic Data Citation - Pilots

 Field Linguistics: language archive: transcriptions  Type: well-described XML files (stand-off annotations to stable video/audio files)  Size: small (few 100 KBs)  Dynamics: rather tens than hundreds of versions  Citation practice: rather fragments than the whole file (illustration, examples, counter-examples) Use case: field linguistics

 https://corpus1.mpi.nl/ds/imdi_browser/versioninginfo.js p?nodeid=MPI2002357%23 https://corpus1.mpi.nl/ds/imdi_browser/versioninginfo.js p?nodeid=MPI2002357%23  The only timestamps displayed are those at the moment a new version is created (hover over the date) - but this is rather a limit of the displaying application  Permission system: by default older versions are not accessible for others than the resource owner Use case: field linguistics

 MD5 checksum and timestamp in handle records: -http://hdl.handle.net/1839/00-0000-0000-001E-8DA4-1?noredirecthttp://hdl.handle.net/1839/00-0000-0000-001E-8DA4-1?noredirect -http://hdl.handle.net/1839/00-0000-0000-001E-8DB5-6?noredirecthttp://hdl.handle.net/1839/00-0000-0000-001E-8DB5-6?noredirect Use case: field linguistics – handle record

 Beta version available of the Virtual Collection Registry: http://clarin.eu/vcr http://clarin.eu/vcr  Official release planned for early October  Free software, GPL v3  Comes with -Federated Identity -Persistent Identifiers -Metadata harvesting  Allows to easily publish links to versioned datasets Virtual Collections - Implementation

Dynamic Data Citation - Pilots Results from VAMDC Workshop Carlo Maria Zwölf Virtual Atomic and Molecular Data Centre carlo-maria.zwolf@obspm.fr

VAMDC  Virtual Atomic and Molecular Data Centre  Worldwide e-infrastructure federating 41 heterogeneous and interoperable Atomic and Molecular databases  Nodes decide independently about growing rate, ingest system, corrections to apply to already stored data  Data-node may use different technology for storing data (SQL, No-sql, ASCII files),  All implement VAMDC access/query protocols  Return results in standardized XML format (XSAMS)  Access directly node-by-node or via VAMDC portal, which relays the user request to each node

VAMDC Issues identified  Each data node could modify/delete/add data without tracing  No support for reproducibility of past data extraction Proposed Data Citation WG Solution:  Considering the distributed architecture of the federated VAMDC infrastructure, it seemed very complex to apply the “Query Store” strategy -Should we need a QS on each node? -Should we need an additional QS on the central portal? -Since the portal acts as a relay between the user and the existing nodes, how can we coordinate the generation of PID for queries in this distributed context?

VAMDC Changes adopted following the workshop  An existing dataset will never be deleted nor modified  If a correction and/or addition to an existing data node are/is needed, this will be associated with the creation of a new dataset  Automatically maintain the genealogy for the families of datasets  Users and data-providers will be able to know the creation date of a dataset, its ancestor and its descendants  The XSAMS format (the VAMDC standard for formatting the results) will be modified to natively include references to the datasets used for its composition

Results from NERC Workshop June 1-2 2014, London John Watkins Centre for Ecology and Hydrology jww@ceh.ac.uk Dynamic Data Citation - Pilots

Following RDA Plenary 3 in Dublin –  Plans made for a workshop looking at specific use cases of data citation  WG members were invited to British Library in London to contribute use cases  This workshop was arranged and mainly attended by UK Natural Environment Research Council data centres and hosted by British Library Data Citation WG – July 2014 London workshop report

Data Citation WG – July 2014 London Workshop report  To present RDA WG conceptual model addressing citation of dynamic data to a group of data curation practitioners  To assess goodness of fit of the model for the requirements of users, curators, publishers, authors  To extend and/or improve the model to meet the widest range of data users  To plan test implementations of the citation model with various dynamic data curated by the group Aims of workshop

 We had a facilitator!  Detailed plan for the 2 days  We made sure everybody got out of their seats and contributed ideas  We captured all the work done and tried to present a summary report  Looked from 4 perspectives  Data user  Data depositor  Data Centre / repository  Data publishers / journals  Looked from 4 use cases  Butterfly monitoring  Ocean buoy network  Sociological Archive  National hydrological archive Workshop facilitation: What we did: Data Citation WG – July 2014 London Workshop report

 Data centres liked PIDs but didn’t want thousands of them  Publishers didn’t want very fine grained PIDs into data sets  Originators wanted recognition for data sets produced  We voted on what we thought were the most important aspects to think of improvements Worked on positives and negatives of the model: Data Citation WG – July 2014 London Workshop report

 We looked at different issues in the model especially subsets and versioning in file repositories  UKDA gave a clear demonstration of how to cite parts of documents  We found need for broad PIDs for collections and finer PIDs for versions and additions  We looked at improvements to the working of the current use cases Ideas of general improvements and practical steps Data Citation WG – July 2014 London Workshop report

 Stefan has generated a paper for IEEE detailing the approach and use cases  The ARGO buoy network has a draft proposal for how to implement data citation of the SeaDataNet  Other UK NERC data centres are continuing with their data citation developments  The ESIP (Federation of Earth science Information Partners) will hold a case study workshop in Jan 2015 (Washington DC)  A working group maybe organized in China next year by Inst. of Scientific and Technical Information of China (ISTIC) Outcome of workshop Data Citation WG – July 2014 London Workshop report

Outline  What are the challenges in citing dynamic data? -Data Citation: the status quo and requirements  How can we enable precise citation of dynamic data? -Making Dynamic Data Citeable  Will the concept work? -A look at some pilots and reference implementations  Does this work for all data? -Next steps, open issues, and the RDA working Group Page 56

Data Citation: Next steps  Solution devised for SQL -> expand to other data types -SQL: LNEC, MSD -Pilot for CSV: MSD -Analyze how to make XML and RDF time-stamped, versioned -LOD, noSQL, …  Verify pilots conceptually -Does it work? -Impact on data center (size, operations, APIs, …) specifically: how to realize versioning -How to integrate in workbenches?  Implement several pilots and verify  Test stability under migrations of data management systems

Join RDA and Working Group If you are interested in joining the discussion, wish to establish a data citation solution, …  Register for the RDA WG on Data Citation: -Website: https://rd-alliance.org/working-groups/data-citation-wg.html https://rd-alliance.org/working-groups/data-citation-wg.html -Mailinglist: https://rd-alliance.org/node/141/archive-post-mailinglist https://rd-alliance.org/node/141/archive-post-mailinglist -Web Conferences: https://rd-alliance.org/webconference-data-citation-wg.html https://rd-alliance.org/webconference-data-citation-wg.html -List of pilots: https://rd-alliance.org/groups/data-citation- wg/wiki/collaboration-environments.html https://rd-alliance.org/groups/data-citation- wg/wiki/collaboration-environments.html

Thank you for your attention.

Approaches to Making Dynamic Data Citable: Update on the Activities of the RDA Working Group Andreas Rauber Vienna University of Technology Viennna, Austria.

Similar presentations

Presentation on theme: "Approaches to Making Dynamic Data Citable: Update on the Activities of the RDA Working Group Andreas Rauber Vienna University of Technology Viennna, Austria."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Approaches to Making Dynamic Data Citable: Update on the Activities of the RDA Working Group Andreas Rauber Vienna University of Technology Viennna, Austria.

Similar presentations

Presentation on theme: "Approaches to Making Dynamic Data Citable: Update on the Activities of the RDA Working Group Andreas Rauber Vienna University of Technology Viennna, Austria."— Presentation transcript:

Similar presentations

About project

Feedback