Download presentation
Presentation is loading. Please wait.
Published byRoberta Cannon Modified over 9 years ago
1
Approaches to Making Dynamic Data Citable: Update on the Activities of the RDA Working Group Andreas Rauber Vienna University of Technology Viennna, Austria rauber@ifs.tuwien.ac.at http://www.ifs.tuwien.ac.at/~andi/ Pa ge 1
2
Outline What are the challenges in citing dynamic data? -Data Citation: the status quo and requirements How can we enable precise citation of dynamic data? -Making dynamic data citeable Will the concept work? -A look at some pilots and reference implementations Does this work for all data? -Next steps, open issues, and the RDA Working Group Page 2
3
Data Citation Citing Data should be easy -from providing a URL in a footnote -via providing a reference in the bibliography section -to assigning a PID to dataset (DOI, ARK, …) in a repository Page 3
4
Dynamic Data Citation Citable datasets have to be static -Fixed set of data, no changes: no corrections to errors, no new data being added But: (research) data is dynamic -Adding new data, correcting errors, enhancing data quality, … -Changes sometimes highly dynamic, at irregular intervals Current approaches -Identifying entire data stream, without any versioning -Using “accessed at” date -“Artificial” versioning by identifying batches of data (e.g. annual), aggregating changes into releases (time-delayed!) Would like to cite precisely the data as it existed at certain point in time, without delaying release of new data Page 4
5
Fine-Granular Data Citation What about granularity of data to be cited? -Datasets contain vast amounts of data -Researchers use specific subsets of this data -Need to identify precisely the subset used Current approaches -Storing a copy of subset as used in study -> scalability -Citing entire dataset, providing textual description of subset -> imprecise (ambiguity) -Storing list of record identifiers in subset -> scalability, not for arbitrary subsets (e.g. when not entire record selected) Would like to be able to cite precisely the subset of dynamic data used in a study Page 5
6
Data Citation Current Approaches Persistent Identifier (PID) e.g. DOI, URI, ARK, … currently provided for -entire data sets, copies of subsets -static data, sometimes release of versions -cited in their entirety with textual description of subsets This is insufficient in many settings -imprecise -not machine-actionable -not scalable for large data sets -insufficient support for data that changes -insufficient support for arbitrary subsets (rows/columns) Page 6
7
Data Citation – Requirements for Citing Arbitrary subsets of data -rows/columns, time sequences, … -from single number to the entire set Dynamic data -corrections, additions, … Stable across technology changes -e.g. migration to new database Machine-actionable -not just machine-readable, definitely not just human-readable and interpretable Scalable to very large / highly dynamic datasets
8
Research Data Alliance WG on Data Citation: Making Dynamic Data Citeable WG officially endorsed in March 2014 -Concentrating on the problems of dynamic (changing) data(sub)sets -Focus! -Liaise with other WGs on attribution, metadata, … -Liaise with other initiatives on data citation (CODATA, DataCite, Force11, …) RDA WG Data Citation
9
Outline What are the challenges in citing dynamic data? -Data Citation: the status quo and requirements How can we enable precise citation of dynamic data? -Making Dynamic Data Citeable Will the concept work? -A look at some pilots and reference implementations Does this work for all data? -Next steps, open issues, and the RDA Working Group Page 9
10
Making Dynamic Data Citeable Data Citation: Data + Means-of-access Data time-stamped & versioned (aka history) Researcher creates working-set via some interface: Access assign PID to QUERY, enhanced with Time-stamping for re-execution against versioned DB Re-writing for normalization, unique-sort, mapping to history Hashing result-set: verifying identity/correctness leading to landing page S. Pröll, A. Rauber. Scalable Data Citation in Dynamic Large Databases: Model and Reference Implementation. In IEEE Intl. Conf. on Big Data 2013 (IEEE BigData2013), 2013 http://www.ifs.tuwien.ac.at/~andi/publications/pdf/pro_ieeebigdata13.pdf http://www.ifs.tuwien.ac.at/~andi/publications/pdf/pro_ieeebigdata13.pdf
11
PID Assignment PID assigned to a query identifying a new dataset When to assign an existing/new PID to a query? -Existing PID: Identical query (semantics) with identical result set, i.e. no change to any element touched upon by query since first processing of the query -New PID: whenever query semantics is not absolutely identical (irrespective of result set being potentially identical!) or when results differ (update to data) Note: -Identical result does not mean that query semantics is identical -Will assign different PIDs to capture query semantics -Need to normalize query to allow comparison -> query re-writing 11
12
Query Re-Writing Query re-writing needed to -Standardization/Normalization of query to help with identifying semantically identical queries -Adapt to versioning approach chosen (versioning in operational tables, separate history table, …) -Add timestamp to any select statement in query -Potentially re-write to identify last change to result set touched upon (i.e. select including elements marked deleted, check most recent timestamp, to determine correct PID assignment) -Apply unique sort to source data prior to query to ensure unique sort 12
13
Query Re-Writing Normalization of query string -Upper / lower case spelling -Sorting of filtering criteria (order does not influence result semantics) -Compute hash-key over query string to identify whether identical query has been issued already -If identical query found, re-run and check for changes in result set based on time-stamps of data records added/deleted -If different, assign new PID, otherwise existing PID 13
14
Query Re-Writing Unique sort of result list -Most databases are set-based -Most subsequent processing is sequence-based -Need to re-write query to apply unique sort on any table prior to applying any user-defined sort for repeatability Hashing of result set to verify identity of result -Compute over entire result set: comprehensive, potentially slow -Computer over column headers and row IDs: verifies correctness of attributes and data items selected does not safeguard against unmonitored changes to attribute values -Well-defined hash input data (data migrations) 14
15
Timestamping Which timestamp to assign to new query? -Timestamp of query processing -Timestamp of last change to DB (global) -Timestamp of last change to result set touched upon by query (including deletes) most complex approach in terms of query re-writing required to select with deletes, extract latest TS, then filter closest to traditional concept of „version “ 15
16
Making Dynamic Data Citeable Building blocks of supporting dynamic data citation: -Uniquely identifiable data records (for unique sort) -Versioned data, marking changes as insertion/deletion -Time stamps of data insertion / deletions -“Query language” for constructing subsets Add modules: -Persistent query store: queries, timestamp, hash, metadata including creator of subset -Query rewriting module -PID assignment to queries -Landing page design, citation text Stable across data source migrations (e.g. diff. DBMS), scalable, machine-actionable Page 16
17
Data Citation – Deployment Researcher uses workbench to identify subset of data Upon executing selection („download“) user gets Data (package, access API, …) PID (e.g. DOI) (Query is time-stamped and stored) Hash value computed over the data for local storage Recommended citation text (e.g. BibTeX) PID resolves to landing page Provides detailed metadata, link to parent data set, subset,… Option to retrieve original data OR current version OR changes Upon activating PID associated with a data citation Query is re-executed against time-stamped and versioned DB Results as above are returned This is an important advantage over traditional approaches relying on, e.g. storing a list of identifiers!!!
18
Outline What are the challenges in citing dynamic data? -Data Citation: the status quo and requirements How can we enable precise citation of dynamic data? -Making Dynamic Data Citeable Will the concept work? -A look at some pilots and reference implementations Does this work for all data? -Next steps, open issues, and the RDA working Group Page 18
19
Reports from pilots -SQL Data: LNEC, MSD Reference implementations -CSV: MSD presentation -CLARIN presentation -XML data -Results from the VAMDC workshop -Results from the NERC workshop 19 Report from Pilots
20
Dynamic Data Citation for SQL Data LNEC, MSD Reference Implementation Dynamic Data Citation - Pilots
21
SQL Prototype Implementation LNEC Laboratory of Civil Engineering, Portugal Monitoring dams and bridges 31 manual sensor instruments 25 automatic sensor instruments Web portal -Select sensor data -Define timespans Report generation -Analysis processes, produces -Latex, produces -PDF report Page 21 Florian Fuchs [CC-BY-3.0 (http://creativecommons.org/licenses/by/3.0)], via Wikimedia Commonshttp://creativecommons.org/licenses/by/3.0)
22
SQL Prototype Implementation Million Song Dataset http://labrosa.ee.columbia.edu/millionsong/ http://labrosa.ee.columbia.edu/millionsong/ Larges benchmark collection in Music Retrieval Original set provided by Echonest No audio, only set of features Harvested, additional features and metadata extracted and offered by several groups e.g. http://www.ifs.tuwien.ac.at/mir/msd/download.htmlhttp://www.ifs.tuwien.ac.at/mir/msd/download.html Dynamics because of metadata errors, extraction errors Research groups select subsets by genre, audio length, audio quality,… 22
23
SQL Time-Stamping and Versioning Integrated -Extend original tables by temporal metadata -Expand primary key by record-version column Hybrid -Utilize history table for deleted record versions with metadata -Original table reflects latest version only Separated -Utilizes full history table -Also inserts reflected in history table Solution to be adopted depends on trade-off -Storage Demand -Query Complexity -Software adaption Page 23
24
SQL: Storing Queries Page 24 Add query store containing -PID of the query -Original query -Re-written query + query string hash -Timestamp (as used in re-written query) -Hash-key of query result -Metadata useful for citation / landing page (creator, institution, rights, …) -PID of parent dataset (or using fragment identifiers for query)
25
SQL Query Re-Writing Normalizing queries to detect identical queries -WHERE clause sorted -Calculate query string hash -Identify semantically identical queries 25
26
SQL Query Re-Writing Normalizing queries to detect identical queries -WHERE clause sorted -Calculate query string hash -Identify semantically identical queries - non-identical queries: columns in different order 26
27
SQL Query Re-Writing 27 Adapt query to history table
28
Dynamic Data Citation - Pilots Dynamic Data Citation for CSV Data Stefan Pröll Secure Business Austria sproell@sba-research.org
29
Dynamic Data Citation for CSV Data Goals : -Ensure cite-ability of CSV data -Enable subset citation -Support particularly small and large volume data -Support dynamically changing data Why CSV data? -Well understood and widely spread -Simple and flexible -Most frequently requested during initial RDA meetings
30
CSV: Basic Steps Upload interface Upload CSV files Migrate CSV file into RDBMS Generate table structure Add metadata columns for versioning Add indices Dynamic data Update existing records Append new data Access interface Track subset creation Store queries Barrymieny
31
CSV Data Prototype
34
Dynamic Data Citation for XML Data Stefan Pröll, Secure Business Austria sproell@sba-research.org Dynamic Data Citation - Pilots
35
Dynamic Data Citation for XML Data Goals : -Cite arbitrary subsets of XML data Subsets, nodes, attributes -Enable dynamic data -Utilize query available languages Why XML data? -Used in many different settings -Complex structures possible -Schema available
36
XML Data Approaches Apply data citation framework Add metadata for versioning Mark insert, update and delete operations No actual deletes Approaches Copy branches upon updates and deletes Simple approach, but uses storage space Introduce parent/child relationships and Resolution more complex
37
Dynamic Data Citation for XML Data XML Database: Base X Lightweight system Client/Server architecture, ACID safe transactions XPath/XQuery 3.0 Processor Scalable E.g. used in UK Data Service Use Case Textual transcripts in XML format Unique identification of (sub) sections
38
Dynamic Data Citation for XML Data Adapt XQuery parser, rewrite operations for Inserting Deleting Replacing Renaming Reuse query parser for alternative implementations E.g. eXistDB XQuery XQuery Parser
39
Support for Dynamic Data Citation in CLARIN Dieter van Uytvanck Max Plank Institute dieter.vanuytvanck@mpi.nl Dynamic Data Citation - Pilots
40
Field Linguistics: language archive: transcriptions Type: well-described XML files (stand-off annotations to stable video/audio files) Size: small (few 100 KBs) Dynamics: rather tens than hundreds of versions Citation practice: rather fragments than the whole file (illustration, examples, counter-examples) Use case: field linguistics
41
https://corpus1.mpi.nl/ds/imdi_browser/versioninginfo.js p?nodeid=MPI2002357%23 https://corpus1.mpi.nl/ds/imdi_browser/versioninginfo.js p?nodeid=MPI2002357%23 The only timestamps displayed are those at the moment a new version is created (hover over the date) - but this is rather a limit of the displaying application Permission system: by default older versions are not accessible for others than the resource owner Use case: field linguistics
42
MD5 checksum and timestamp in handle records: -http://hdl.handle.net/1839/00-0000-0000-001E-8DA4-1?noredirecthttp://hdl.handle.net/1839/00-0000-0000-001E-8DA4-1?noredirect -http://hdl.handle.net/1839/00-0000-0000-001E-8DB5-6?noredirecthttp://hdl.handle.net/1839/00-0000-0000-001E-8DB5-6?noredirect Use case: field linguistics – handle record
43
Beta version available of the Virtual Collection Registry: http://clarin.eu/vcr http://clarin.eu/vcr Official release planned for early October Free software, GPL v3 Comes with -Federated Identity -Persistent Identifiers -Metadata harvesting Allows to easily publish links to versioned datasets Virtual Collections - Implementation
45
Dynamic Data Citation - Pilots Results from VAMDC Workshop Carlo Maria Zwölf Virtual Atomic and Molecular Data Centre carlo-maria.zwolf@obspm.fr
46
VAMDC Virtual Atomic and Molecular Data Centre Worldwide e-infrastructure federating 41 heterogeneous and interoperable Atomic and Molecular databases Nodes decide independently about growing rate, ingest system, corrections to apply to already stored data Data-node may use different technology for storing data (SQL, No-sql, ASCII files), All implement VAMDC access/query protocols Return results in standardized XML format (XSAMS) Access directly node-by-node or via VAMDC portal, which relays the user request to each node
47
VAMDC Issues identified Each data node could modify/delete/add data without tracing No support for reproducibility of past data extraction Proposed Data Citation WG Solution: Considering the distributed architecture of the federated VAMDC infrastructure, it seemed very complex to apply the “Query Store” strategy -Should we need a QS on each node? -Should we need an additional QS on the central portal? -Since the portal acts as a relay between the user and the existing nodes, how can we coordinate the generation of PID for queries in this distributed context?
48
VAMDC Changes adopted following the workshop An existing dataset will never be deleted nor modified If a correction and/or addition to an existing data node are/is needed, this will be associated with the creation of a new dataset Automatically maintain the genealogy for the families of datasets Users and data-providers will be able to know the creation date of a dataset, its ancestor and its descendants The XSAMS format (the VAMDC standard for formatting the results) will be modified to natively include references to the datasets used for its composition
49
Results from NERC Workshop June 1-2 2014, London John Watkins Centre for Ecology and Hydrology jww@ceh.ac.uk Dynamic Data Citation - Pilots
50
Following RDA Plenary 3 in Dublin – Plans made for a workshop looking at specific use cases of data citation WG members were invited to British Library in London to contribute use cases This workshop was arranged and mainly attended by UK Natural Environment Research Council data centres and hosted by British Library Data Citation WG – July 2014 London workshop report
51
Data Citation WG – July 2014 London Workshop report To present RDA WG conceptual model addressing citation of dynamic data to a group of data curation practitioners To assess goodness of fit of the model for the requirements of users, curators, publishers, authors To extend and/or improve the model to meet the widest range of data users To plan test implementations of the citation model with various dynamic data curated by the group Aims of workshop
52
We had a facilitator! Detailed plan for the 2 days We made sure everybody got out of their seats and contributed ideas We captured all the work done and tried to present a summary report Looked from 4 perspectives Data user Data depositor Data Centre / repository Data publishers / journals Looked from 4 use cases Butterfly monitoring Ocean buoy network Sociological Archive National hydrological archive Workshop facilitation: What we did: Data Citation WG – July 2014 London Workshop report
53
Data centres liked PIDs but didn’t want thousands of them Publishers didn’t want very fine grained PIDs into data sets Originators wanted recognition for data sets produced We voted on what we thought were the most important aspects to think of improvements Worked on positives and negatives of the model: Data Citation WG – July 2014 London Workshop report
54
We looked at different issues in the model especially subsets and versioning in file repositories UKDA gave a clear demonstration of how to cite parts of documents We found need for broad PIDs for collections and finer PIDs for versions and additions We looked at improvements to the working of the current use cases Ideas of general improvements and practical steps Data Citation WG – July 2014 London Workshop report
55
Stefan has generated a paper for IEEE detailing the approach and use cases The ARGO buoy network has a draft proposal for how to implement data citation of the SeaDataNet Other UK NERC data centres are continuing with their data citation developments The ESIP (Federation of Earth science Information Partners) will hold a case study workshop in Jan 2015 (Washington DC) A working group maybe organized in China next year by Inst. of Scientific and Technical Information of China (ISTIC) Outcome of workshop Data Citation WG – July 2014 London Workshop report
56
Outline What are the challenges in citing dynamic data? -Data Citation: the status quo and requirements How can we enable precise citation of dynamic data? -Making Dynamic Data Citeable Will the concept work? -A look at some pilots and reference implementations Does this work for all data? -Next steps, open issues, and the RDA working Group Page 56
57
Data Citation: Next steps Solution devised for SQL -> expand to other data types -SQL: LNEC, MSD -Pilot for CSV: MSD -Analyze how to make XML and RDF time-stamped, versioned -LOD, noSQL, … Verify pilots conceptually -Does it work? -Impact on data center (size, operations, APIs, …) specifically: how to realize versioning -How to integrate in workbenches? Implement several pilots and verify Test stability under migrations of data management systems
58
Join RDA and Working Group If you are interested in joining the discussion, wish to establish a data citation solution, … Register for the RDA WG on Data Citation: -Website: https://rd-alliance.org/working-groups/data-citation-wg.html https://rd-alliance.org/working-groups/data-citation-wg.html -Mailinglist: https://rd-alliance.org/node/141/archive-post-mailinglist https://rd-alliance.org/node/141/archive-post-mailinglist -Web Conferences: https://rd-alliance.org/webconference-data-citation-wg.html https://rd-alliance.org/webconference-data-citation-wg.html -List of pilots: https://rd-alliance.org/groups/data-citation- wg/wiki/collaboration-environments.html https://rd-alliance.org/groups/data-citation- wg/wiki/collaboration-environments.html
59
Page 59 Thank you for your attention.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.