Download presentation
Presentation is loading. Please wait.
1
Vienna University of Technology
Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation Andreas Rauber Vienna University of Technology Favoritenstr. 9-11/188 1040 Vienna, Austria 1
2
Motivation Research data is fundamental for science
Results are based on data Data serves as input for workflows and experiments Data is the source for graphs and visualisations in publications Data is needed for Reproducibility Repeat experiments Verify / compare results Need to provide specific data set Service for data repositories Put data in data repository, Assign PID (DOI, Ark, URI, …) Make is accessible done!? Source: open.wien.gv.at
3
Identification of Dynamic Data
Usually, datasets have to be static Fixed set of data, no changes: no corrections to errors, no new data being added But: (research) data is dynamic Adding new data, correcting errors, enhancing data quality, … Changes sometimes highly dynamic, at irregular intervals Current approaches Identifying entire data stream, without any versioning Using “accessed at” date “Artificial” versioning by identifying batches of data (e.g. annual), aggregating changes into releases (time-delayed!) Would like to identify precisely the data as it existed at a specific point in time
4
Granularity of Subsets
What about the granularity of data to be identified? Enormous amounts of CSV data Researchers use specific subsets of data Need to identify precisely the subset used Current approaches Storing a copy of subset as used in study -> scalability Citing entire dataset, providing textual description of subset -> imprecise (ambiguity) Storing list of record identifiers in subset -> scalability, not for arbitrary subsets (e.g. when not entire record selected) Would like to be able to identify precisely the subset of (dynamic) data used in a process
5
What we do NOT want… Common approaches to data management… (from PhD Comics: A Story Told in File Names, ) Source:
6
RDA WG Data Citation Research Data Alliance
WG on Data Citation: Making Dynamic Data Citeable March 2014 – September 2015 Concentrating on the problems of large, dynamic (changing) datasets Final version presented Sep 2015 at P7 in Paris, France Endorsed September 2016 at P8 in Denver, CO
7
RDA WGDC - Solution We have Data & some means of access („query“)
8
RDA WGDC - Solution We have Make data: time-stamped and versioned
Data & some means of access („query“) Make data: time-stamped and versioned Prepare some way of storing the queries
9
RDA WGDC - Solution We have Make data: time-stamped and versioned
Data & some means of access („query“) Make data: time-stamped and versioned Prepare some way of storing the queries Data Citation: Store query with timestamp Assign the PID to the timestamped query (which, dynamically, leads to the data) Access: Re-execute query on versioned data according to timestamp Dynamic Data Citation: Dynamic data & dynamic citation of data
10
Data Citation – Deployment
Researcher uses workbench to identify subset of data Upon executing selection („download“) user gets Data (package, access API, …) PID (e.g. DOI) (Query is time-stamped and stored) Hash value computed over the data for local storage Recommended citation text (e.g. BibTeX) PID resolves to landing page Provides detailed metadata, link to parent data set, subset,… Option to retrieve original data OR current version OR changes Upon activating PID associated with a data citation Query is re-executed against time-stamped and versioned DB Results as above are returned Query store aggregates data usage
11
Data Citation – Deployment
Note: query string provides excellent provenance information on the data set! Researcher uses workbench to identify subset of data Upon executing selection („download“) user gets Data (package, access API, …) PID (e.g. DOI) (Query is time-stamped and stored) Hash value computed over the data for local storage Recommended citation text (e.g. BibTeX) PID resolves to landing page Provides detailed metadata, link to parent data set, subset,… Option to retrieve original data OR current version OR changes Upon activating PID associated with a data citation Query is re-executed against time-stamped and versioned DB Results as above are returned Query store aggregates data usage
12
Data Citation – Deployment
Note: query string provides excellent provenance information on the data set! Researcher uses workbench to identify subset of data Upon executing selection („download“) user gets Data (package, access API, …) PID (e.g. DOI) (Query is time-stamped and stored) Hash value computed over the data for local storage Recommended citation text (e.g. BibTeX) PID resolves to landing page Provides detailed metadata, link to parent data set, subset,… Option to retrieve original data OR current version OR changes Upon activating PID associated with a data citation Query is re-executed against time-stamped and versioned DB Results as above are returned Query store aggregates data usage This is an important advantage over traditional approaches relying on, e.g. storing a list of identifiers/DB dump!!!
13
Data Citation – Deployment
Note: query string provides excellent provenance information on the data set! Researcher uses workbench to identify subset of data Upon executing selection („download“) user gets Data (package, access API, …) PID (e.g. DOI) (Query is time-stamped and stored) Hash value computed over the data for local storage Recommended citation text (e.g. BibTeX) PID resolves to landing page Provides detailed metadata, link to parent data set, subset,… Option to retrieve original data OR current version OR changes Upon activating PID associated with a data citation Query is re-executed against time-stamped and versioned DB Results as above are returned Query store aggregates data usage This is an important advantage over traditional approaches relying on, e.g. storing a list of identifiers/DB dump!!! Identify which parts of the data are used. If data changes, identify which queries (studies) are affected
14
Data Citation – Output 14 Recommendations grouped into 4 phases:
Preparing data and query store Persistently identifying specific data sets Resolving PIDs Upon modifications to the data infrastructure 2-page flyer More detailed report: IEEE TCDL
15
Data Citation – Recommendations
Preparing Data & Query Store R1 – Data Versioning R2 – Timestamping R3 – Query Store When Data should be persisted R4 – Query Uniqueness R5 – Stable Sorting R6 – Result Set Verification R7 – Query Timestamping R8 – Query PID R9 – Store Query R10 – Citation Text When Resolving a PID R11 – Landing Page R12 – Machine Actionability Upon Modifications to the Data Infrastructure R13 – Technology Migration R14 – Migration Verification
16
Data Citation – Recommendations
A) Preparing the Data and the Query Store R1 – Data Versioning: Apply versioning to ensure earlier states of data sets the data can be retrieved R2 – Timestamping: Ensure that operations on data are timestamped, i.e. any additions, deletions are marked with a timestamp R3 – Query Store: Provide means to store the queries and metadata to re-execute them in the future
17
Data Citation – Recommendations
Note: R1 & R2 are already pretty much standard in many (RDBMS-) research databases Different ways to implement A bit more challenging for some data types (XML, LOD, …) Data Citation – Recommendations A) Preparing the Data and the Query Store R1 – Data Versioning: Apply versioning to ensure earlier states of data sets the data can be retrieved R2 – Timestamping: Ensure that operations on data are timestamped, i.e. any additions, deletions are marked with a timestamp R3 – Query Store: Provide means to store the queries and metadata to re-execute them in the future Note: R3: query store usually pretty small, even for extremely high query volumes
18
Data Citation – Recommendations
B) Persistently Identify Specific Data sets (1/2) When a data set should be persisted: R4 – Query Uniqueness: Re-write the query to a normalized form so that identical queries can be detected. Compute a checksum of the normalized query to efficiently detect identical queries R5 – Stable Sorting: Ensure an unambiguous sorting of the records in the data set R6 – Result Set Verification: Compute fixity information/checksum of the query result set to enable verification of the correctness of a result upon re-execution R7 – Query Timestamping: Assign a timestamp to the query based on the last update to the entire database (or the last update to the selection of data affected by the query or the query execution time). This allows retrieving the data as it existed at query time
19
Data Citation – Recommendations
B) Persistently Identify Specific Data sets (2/2) When a data set should be persisted: R8 – Query PID: Assign a new PID to the query if either the query is new or if the result set returned from an earlier identical query is different due to changes in the data. Otherwise, return the existing PID R9 – Store Query: Store query and metadata (e.g. PID, original and normalized query, query & result set checksum, timestamp, superset PID, data set description and other) in the query store R10 – Citation Text: Provide citation text including the PID in the format prevalent in the designated community to lower barrier for citing data.
20
Data Citation – Recommendations
C) Resolving PIDs and Retrieving Data R11 – Landing Page: Make the PIDs resolve to a human readable landing page that provides the data (via query re- execution) and metadata, including a link to the superset (PID of the data source) and citation text snippet R12 – Machine Actionability: Provide an API / machine actionable landing page to access metadata and data via query re-execution
21
Data Citation – Recommendations
D) Upon Modifications to the Data Infrastructure R13 – Technology Migration: When data is migrated to a new representation (e.g. new database system, a new schema or a completely different technology), migrate also the queries and associated checksums R14 – Migration Verification: Verify successful data and query migration, ensuring that queries can be re-executed correctly
22
Pilots / Adopters Several adopters Pilots:
Different types of data, different settings, … CSV & SQL reference implementation (SBA/TUW) Pilots: Biomedical BigData Sharing, Electronic Health Records (Center for Biomedical Informatics, Washington Univ. in St. Louis) Marine Research Data Biological & Chemical Oceanography Data Management Office (BCO-DMO) Vermont Monitoring Cooperative: Forest Ecosystem Monitoring ARGO Boy Network, British Oceanographic Data Centre (BODC) Virtual Atomic and Molecular Data Centre (VAMDC) UK Riverflow Riverflow Archive, Centre for Ecology and Hydrology 22
23
WG Data Citation Pilot WUSTL Cynthia Hudson Vitale, Leslie McIntosh, Snehil Gupta Washington University in St.Luis
24
Biomedical Adoption Project Goals
Implement RDA Data Citation WG recommendation to local Washington U i2b2 Engage other i2b2 community adoptees Contribute source code back to i2b2 community Repository Slides Bibliography Specifically, through this project, we set out to: Implement the RDA Data Citation WG Rec Engage with other i2b2 adoptees in this conversation on reproducibility Contribute any code back to the i2b2 community
25
RDA-MacArthur Grant Focus
Data Broker Researchers
26
R1 and R2 Implementation 3 1 2 c1 c2 c3 sys_period c1 c2 c3 sys_period
RDC.hist_table* triggers 2 c1 c2 c3 sys_period RDC.table c1 c2 c3 sys_period PostgreSQL Extension “temporal_tables” In our proof of concept for RDC for timestamping and versioning we used temporal_tables extension for Postgres Picture copyright Each live data table that needs logging was given an additional column named “sys_period” which holds a timestamp range value identifying the start timestamp, end timestamp, and boundary condition of when the record is or was valid. The name “sys_period” was chosen to be consistent with the temporal_tables documentation. A second copy of each live data table was created to hold historical data. These history tables have the same structure as the live data tables, but without the same primary keys and are used for logging each data change. For consistency all history tables were given the same name as the table being logged, but prefaced with “hist_”. The trigger provided by the temporal_tables extension was added to each live data table to be called before insert, update, or delete transactions. The trigger inserts a copy of the existing row from the live data table into the historical table with the timestamp range modified to show the current date as the upper bound of the valid range. *stores history of data changes
27
Return on Investment (ROI) - Estimated
20 hours to complete 1 study $150/hr (unsubsidized) $3000 per study 115 research studies per year 14 replication studies There are different ways to calculate the financial ROI. For one, it is to determine the hours of time we potentially save (or do not waste) trying to replicate a study. Given our calculations, it will take 14 studies to complete this.
28
Adoption of Data Citation Outcomes by BCO-DMO Cynthia Chandler, Adam Shepherd
29
A story of success enabled by RDA
An existing repository ( Marine research data curation since 2006 Faced with new challenges, but no new funding e.g. data publication practices to support citation Used the outcomes from the RDA Data Citation Working Group to improve data publication and citation services
30
Adoption of Data Citation Outputs
Evaluation Evaluate recommendations (done December 2015) Try implementation in existing BCO-DMO architecture (work began 4 April 2016) Trial BCO-DMO: R1-11 fit well with current architecture; R12 doable; test as part of DataONE node membership; R13-14 are consistent with Linked Data approach to data publication and sharing Timeline: Redesign/protoype completed by 1 June 2016 New citation recommendation by 1 Sep 2016 Report out at RDA P8 (Denver, CO) September 2016 Final report by 1 December 2016
31
Vermont Monitoring Cooperative James Duncan, Jennifer Pontius VMC
32
Ecosystem Monitoring Collaborator Network Data Archive, Access and Integration
33
User Workflow to Date Modify a dataset Commit to version, assign name
changes tracked original data table unchanged Commit to version, assign name computes result hash (table pkid, col names, first col data) and query hash updates data table to new state formalizes version Restore previous version creates new version table from current data table state, walks it back using stored SQL. Garbage collected after a period of time
34
UK Riverflow Archive Matthew Fry mfry@ceh.ac.uk
35
UK National River Flow Archive
ORACLE Relational database Time series and metadata tables ~20M daily flow records, + monthly / daily catchment rainfall series Metadata (station history, owners, catchment soils / geology, etc.) Total size of ~5GB Time series tables automatically audited, But reconstruction is complex Users generally download simple files But public API is in development / R-NRFA package is out there Fortunately all access is via single codeset
36
Versioning / citation solution
Automated archiving of entire database – version controlled scripts defining tables, creating / populating archived tables (largely complete) Fits in with data workflow – public / dev versions – this only works because we have irregular / occasional updates Simplification of the data model (complete) API development (being undertaken independently of dynamic citation requirements): allows subsetting of dataset in a number of ways – initially simply need to implement versioning (started) to ensure will cope with changes to data structures Fit to dynamic data citation recommendations? Largely Need to address mechanism for users to request / create citable version of a query Resource required: estimated ~2 person months
37
Reference Implementation for CSV Data (and SQL) Stefan Pröll, SBA
38
CSV/SQL Reference Implementation 1
Reference Implementation available on Github Upload interface -->Upload CSV files Migrate CSV file into RDBMS Generate table structure, identify primary key Add metadata columns for versioning (transparent) Add indices (transparent) Dynamic data: upload new version of file Versioned update / delete existing records in RDBMS Access interface Track subset creation Store queries -> PID + Landing Page Barrymieny
39
CSV/Git Reference Implementation 2
Based on Git only (no SQL database) Upload CSV files to Git repository SQL-style queries operate on CSV file via API Data versioning with Git Store scripts versioned as well Make subset creation reproducible
40
CSV/Git Reference Implementation 2
Stefan Pröll, Christoph Meixner, Andreas Rauber Precise Data Identification Services for Long Tail Research Data. Proceedings of the intl. Conference on Preservation of Digital Objects (iPRES2016), Oct , Bern, Switzerland. Source at Github: RDA-WGDC-CSV-Data-Citation-Prototype Videos: Login: Upload: Subset: Resolver: Update:
41
Benefits Retrieval of precise subset with low storage overhead
Subset as cited or as it is now (including e.g. corrections) Query provides provenance information Query store supports analysis of data usage Checksums support verification Same principles applicable across all settings Small and large data Static and dynamic data Different data representations (RDBMS, CSV, XML, LOD, …) Would work also for more sophisticated/general transformations on data beyond select/project
42
Interested in implementing?
If you have a dataset / are operating a data center that has dynamic data and / or allows researchers to select subsets of this data for studies and would like to support precise identification / citability Let us know! Join RDA WGDC / IG Support for adoption pilots Cooperate on implementing and deploying recommendations Collecting feedback, identifying potential issues, …
43
Thank you! Thanks!
44
The following set of slides was presented by the various adoption pilots at the RDA Plenary in Denver, CO during RDA WGDC break-out session and is also available at the WGDC Repository
45
Reference Implementation for CSV Data (and SQL) Stefan Pröll, SBA
46
Large Scale Research Settings
RDA recommendations implemented in data infrastructures Required adaptions Introduce versioning, if not already in place Capture sub-setting process (queries) Implement dedicated query store to store queries A bit of additional functionality (query re-writing, hash functions, …) Done! ? “Big data”, database driven Well-defined interfaces Trained experts available “Complex, only for professional research infrastructures” ?
47
Long Tail Research Data
Big data, well organized, often used and cited Data set size Amount of data sets Less well organized, non-standardised no dedicated infrastructure “Dark data” [1] Heidorn, P. Bryan. "Shedding light on the dark data in the long tail of science." Library Trends 57.2 (2008):
48
Prototype Implementations
Solution for small-scale data CSV files, no “expensive” infrastructure, low overhead 2 Reference implementations : Git based Prototypes: widely used versioning system A) Using separate folders B) Using branches MySQL based Prototype: C) Migrates CSV data into relational database Data backend responsible for versioning data sets Subsets are created with scripts or queries via API or Web Interface Transparent to user: always CSV
49
CSV Reference Implementation 2
Git Implementation 1 Upload CSV files to Git repository (versioning) Subsets created via scripting language (e.g. R) Select rows/columns, sort, returns CSV + metadata file Metdata file with script parameters stored in Git (Scripts stored in Git as well) PID assigned to metadata file Use Git to retrieve proper data set version and re-execute script on retrieved file
50
Git-based Prototype Git Implementation 2 Addresses issues
common commit history, branching data Using Git branching model: Orphaned branches for queries and data Keeps commit history clean Allows merging of data files Web interface for queries (CSV2JDBC) Use commit hash for identification Assigned PID hashed with SHA1 Use hash of PID as filename (ensure permissible characters)
51
Git-based Prototype Prototype: https://github.com/Mercynary/recitable
Step 2: Create a subset with a SQL query (on CSV data) Step 1: Select a CSV file in the repository Step 3: Store the query script and metadata Step 4: Re-Execute!
52
MySQL-based Prototype
MySQL Prototype Data upload User uploads a CSV file into the system Data migration from CSV file into RDBMS Generate table structure Add metadata columns (versioning) Add indices (performance) Dynamic data Insert, update and delete records Events are recorded with a timestamp Subset creation User selects columns, filters and sorts records in web interface System traces the selection process Exports CSV Barrymieny
53
MySQL-based Prototype
Source at Github: Videos: Login: Upload: Subset: Resolver: Update:
54
CSV Reference Implementation 2
Stefan Pröll, Christoph Meixner, Andreas Rauber Precise Data Identification Services for Long Tail Research Data. Proceedings of the intl. Conference on Preservation of Digital Objects (iPRES2016), Oct , Bern, Switzerland. Source at Github: RDA-WGDC-CSV-Data-Citation-Prototype Videos: Login: Upload: Subset: Resolver: Update:
55
WG Data Citation Pilot WUSTL Cynthia Hudson Vitale, Leslie McIntosh, Snehil Gupta Washington University in St.Luis
56
Moving Biomedical Big Data Sharing Forward An adoption of the RDA Data Citation of Evolving Data Recommendation to Electronic Health Records Leslie McIntosh, PHD, MPH RDA P8 Cynthia Hudson Vitale, MA Denver, USA September 2016 @mcintold @cynhudson
57
Background Millions of dollars are being put into data sharing and brokerage using clinical data. Clinical decisions are being made based upon these (and other) decisions. Hence, it is important to make this process rigorous. But why do I care?
58
Leslie McIntosh, PHD, MPH Cynthia Hudson Vitale, MA
Director, Center for Biomedical Informatics Data Services Coordinator Leslie McIntosh, PHD, MPH Cynthia Hudson Vitale, MA
59
BDaaS Biomedical Data as a Service
Researchers Data Broker i2b2 Application Biomedical Data Repository Center for Biomedical
60
Move some of the responsibility of reproducibility
Biomedical Researcher Biomedical Pipeline How to we facilitate moving some of the responsibility of reproducibility from the biomedical researcher and incorporating the principles into the biomedical pipeline.
61
RDA/MacArthur Grant We saw the RDA/MacArthur grant as an opportunity to move forward with our vision.
62
Biomedical Adoption Project Goals
Implement RDA Data Citation WG recommendation to local Washington U i2b2 Engage other i2b2 community adoptees Contribute source code back to i2b2 community Specifically, through this project, we set out to: Implement the RDA Data Citation WG Rec Engage with other i2b2 adoptees in this conversation on reproducibility Contribute any code back to the i2b2 community
63
RDA Data Citation WG Recommendations
R1: Data Versioning R2: Data Timestamping R3, R9: Query Store R7: Query Timestamping R8: Query PID R10: Query Citation Data Versioning: For retrieving earlier states of data of datasets the data needs to be versioned. Markers shall indicate inserts, updates, and deletes of data in the database. Data Timestamping: Ensure that operations are timestamped, i.e. any additions, deletions are marked with a timestamp. Query Store: Provide means for storing queries and the associated metadata in order to re-execute them in future. Query Timestamping: Assign a timestamp to the query based on the last update to the entire database. Query PID: The data shall be identified via a persistent identifier (PID) pointing to a timestamped query, resolving to a landing page. Query Citation: Provide a recommended citation text and the PID to the user.
64
Internal Implementation Requirements
Scalable Available for PostgreSQL Actively supported Easy to maintain Easy for data brokers to use Must be able to handle large volume of data efficiently and relatively fast, because versioning requires storing history of changes to data, as well as scanning and updating large amounts of data Must be implementable for PostgreSQL, because this is the database system at CBMI for EHR Software tools used must have regular updates and maintenance support from community or reliable commercial source. Otherwise, the software tools used should be understood by in-house developers, so that support and updates could be provided by CBMI team, if no other source is available. Should be relatively easy to implement to allow for easy maintenance and upkeep. Should be relatively easy to use for data brokers, as they would need to run queries and access query audit.
65
RDA-MacArthur Grant Focus
Data Broker Researchers
66
R1 and R2 Implementation 3 1 2 c1 c2 c3 sys_period c1 c2 c3 sys_period
RDC.hist_table* triggers 2 c1 c2 c3 sys_period RDC.table c1 c2 c3 sys_period PostgreSQL Extension “temporal_tables” In our proof of concept for RDC for timestamping and versioning we used temporal_tables extension for Postgres Picture copyright Each live data table that needs logging was given an additional column named “sys_period” which holds a timestamp range value identifying the start timestamp, end timestamp, and boundary condition of when the record is or was valid. The name “sys_period” was chosen to be consistent with the temporal_tables documentation. A second copy of each live data table was created to hold historical data. These history tables have the same structure as the live data tables, but without the same primary keys and are used for logging each data change. For consistency all history tables were given the same name as the table being logged, but prefaced with “hist_”. The trigger provided by the temporal_tables extension was added to each live data table to be called before insert, update, or delete transactions. The trigger inserts a copy of the existing row from the live data table into the historical table with the timestamp range modified to show the current date as the upper bound of the valid range. *stores history of data changes
67
ETL Incrementals … sys_period 2016-9-8 00:00, 2016-9-9 00:00 …
Source Data Update? Insert? RDC.hist_table RDC.table RDC.table … sys_period :00, :00 … sys_period :00, NULL … sys_period :00, NULL Old Data During the Incremental ETL from Clinical Data Source, our ETL identifies whether a row of data is an update or insert to existing RDC data and if it’s an update, the data in RDC.table gets updated to the new values. However, the triggers we set up before each update or delete (if any) will first copy the original field values to the RDC.hist_table, and only then allow the update of data in RDC.table. Sys_period for live data tables will have the start of date range during which the record is valid and NULL as the end_date Inside hist_table the sys_period will have both start and end date of record validity.
68
R3, R7, R8, R9, and R10 Implementation
2 3 1 RDC.table RDC.hist_table functions triggers query audit tables PostgreSQL Extension “temporal_tables” RDC.table_with_history (view) In order to implement the RDA recommendations in RDC for: Storing query and query metadata, Query Timestamping, storing Query PID (persistent modifier) and suggested Query Citation, we did the following: In addition to using Temporal_tables extension we also created a view as a union between each live data table and its corresponding history table so that data could be queried as it was at any given time. These views were named with a name matching the live table with the addition of the suffix “_with_history”. Basically the view would have the data currently valid and all the historical versions of the same data along with the sys_period ranges of when the versions were valid. Several functions and triggers were set up to allow for storing user data and query data into query audit tables, at the time when data brokers would be querying the views.
69
Data Reproducibility Workflow
Today The Data Broker runs a query on the special View (a view over union of live data and hist_table). The query run is audited automatically, storing the query, its metadata, the date of execution, query PID and suggested Citation. The Data Broker gets back the data, and suggested citation, containing Query Date and Query PID. 2) Some Time Later The Data Broker can reproduce the results received for the Query PID, by providing that PID to a special function to re-run the Query. The function gets the Query Date range using the provided PID, and then operates over the union of live data and all the historical versions of that data, picking only the versions of data valid at the time when the Query with the provided PID was executed. The Data Broker gets the same dataset back as they did before. Tardis picture credit:
70
Bonus Feature: Determine if Change Occurred
The nice feature we added is a way for the Data Broker to find out whether the version at the time of Query with a certain PID has actually changed over time, by comparing it to the current data version. If the data haven’t changed, the result of the function would be false. If the data have changed, the comparison will be provided to the Data Broker.
71
Future Developments Develop a process for sharing Query PID with researchers in an automated way Resolve Query PIDs to a landing page with Query metadata Implement research reproducibility requirements in other systems as possible
72
Outcomes and Support
73
Obtained Outcomes Implemented WG recommendations
Engaged with other i2b2 adoptees (Harvard, Nationwide Children’s Hospital) Success of this project will be based on a number of factors including: 1. The integration of WGDC recommendations into the WU CBMI instance of i2b2 2. Feedback from existing i2b2 installations on recommended changes, RDA WGDC compliant code
74
Dissemination Poster presentation (Harvard U, July 2016)
Scientific manuscript based on our proof of concept to AMIA TBI/CRI 2017 conference Sharing the code with the community Success of this project will be based on a number of factors including: 1. The integration of WGDC recommendations into the WU CBMI instance of i2b2 2. Feedback from existing i2b2 installations on recommended changes, RDA WGDC compliant code
75
Return on Investment (ROI) - Estimated
20 hours to complete 1 study $150/hr (unsubsidized) $3000 per study 115 research studies per year 14 replication studies There are different ways to calculate the financial ROI. For one, it is to determine the hours of time we potentially save (or do not waste) trying to replicate a study. Given our calculations, it will take 14 studies to complete this.
76
Funding Support MacArthur Foundation 2016 Adoption Seeds program Foundation through a sub-contract with Research Data Alliance Washington University Institute of Clinical and Translational Sciences NIH CTSA Grant Number UL1TR and UL1TR S1 Siteman Cancer Center at Washington University NIH/NCI Grant P30 CA
77
Center for Biomedical Informatics @WUSTL Teams for Reproducible Research
NIH-NLM Supplement Leslie McIntosh Cynthia Hudson-Vitale Anthony Juehne Rosalia Alcoser Xiaoyan ‘Sean’ Liu Brad Evanoff Research Data Alliance Leslie McIntosh Cynthia Hudson-Vitale Anthony Juehne Snehil Gupta Connie Zabarovskaya Brian Romine Dan Vianello RDA Collaborators Andreas Rauber Stefan Pröll
78
WashU CBMI Research Reproducibility Resources
Repository Slides Bibliography
79
Adoption of Data Citation Outcomes by BCO-DMO Cynthia Chandler, Adam Shepherd
80
A story of success enabled by RDA
An existing repository ( Marine research data curation since 2006 Faced with new challenges, but no new funding e.g. data publication practices to support citation Used the outcomes from the RDA Data Citation Working Group to improve data publication and citation services
81
BCO-DMO curated data are Served: http://bco-dmo.org (URLs, URIs)
BCO-DMO is a thematic, domain-specific repository funded by NSF Ocean Sciences and Polar Programs BCO-DMO curated data are Served: (URLs, URIs) Published: at an Institutional Repository (CrossRef DOI) Archived: at NCEI, a US National Data Center WHOAS: Woods Hole Open Access Server Example: Linked Data URI: NCEI: DOI: Also harvested by and discoverable from DataONE for Linked Data URI:
82
BCO-DMO Dataset Landing Page (Mar ‘16)
The larval krill dataset with measurements from 2001 and The full dataset has been archived at NCEI and assigned an accession number. Data used for a publication may only include a portion of the data. Ideally the appropriate subset of data would be published out and assigned a DOI, enabling subsequent retrieval of the exact data used for a publication.
83
Initial Architecture Design Considerations (Jan 2016)
84
Modified Architecture (March 2016)
Opportunity to update our information model
85
BCO-DMO Data Publication System Components
BCO-DMO publishes data to WHOAS and a DOI is assigned. As of June 2016, the BCO-DMO architecture supports data versioning. In the current BCO-DMO data management system architecture data are served via the BCO-DMO website, with parallel options for publishing data at the WHOAS Institutional Repository and archiving data at the appropriate NCEI national data center. In the current BCO-DMO system architecture a Drupal content management system provides access to the metadata catalog and direct access to data via a URL. The Drupal MySQL content is published out in a variety of forms one of which is Linked Data. When the data are declared ‘Final, no updates expected’, a package of metadata and data can be exported from the BCO-DMO system and submitted to the Woods Hole Open Access System (WHOAS). The WHOAS is an Institutional Repository (IR) hosted by the WHOI Data Library and Archives (DLA) which is part of the larger Marine Biological Laboratory and Woods Hole Oceanographic Institution (MBLWHOI) Library system located in Woods Hole, MA. A Digital Object Identifier (DOI) is assigned when the package is deposited in WHOAS, with reciprocal links entered at BCO-DMO and WHOAS to connect the package at WHOAS with the record in the BCO-DMO catalog. The DOI resolves to the WHOAS dataset landing page, and a Linked Data URI resolves to the BCO-DMO landing page for the same dataset. However, the current system only supports one version of a dataset. NCEI is the appropriate National Data Center in the US for ocean research data WHOAS is the local OAIS-compliant institutional repository BCO-DMO publishes data to WHOAS and a DOI is assigned. The BCO-DMO architecture now supports data versioning.
86
BCO-DMO Data Citation System Components
Data managed by BCO-DMO are published at WHOAS (the Institutional Repository (IR) curated by the MBLWHOI Data Library and Archives (DLA)), archived at NCEI and harvested by DataONE (an NSF funded harvesting system for environmental data). New data version assigned a new DOI (handle is versioned if only metadata changes) New capability (implemented): procedure: when a BCO-DMO data set is updated … A copy of the previous version is preserved Request a DOI for the new version of data Publish data, and create new landing page for new version of data, with new DOI assigned BCO-DMO database has links to all versions of the data (archived and published) Both archive and published dataset landing pages have links back to best version of full dataset at BCO-DMO BCO-DMO data set landing page displays links to all archived and published versions
87
BCO-DMO Data Set Landing Page
Versioned dataset example: original indicates the metadata only changed DOI: /1912/6421 Dataset URL: Maas - Pteropod respiration rates LOD URI: Data published at WHOAS: doi: /1912/bco-dmo
88
Published dataset DOI NSF award numbers BCO-DMO dataset URI (published out as Linked Data)
89
BCO-DMO Data Set Landing Page
LOD URI: Data published at WHOAS: doi: /1912/bco-dmo Linked from BCO-DMO dataset landing page to:
90
Linked to Publication via DOI
Linking data sets with publications using DOIs
91
New Capabilities … BCO-DMO becoming a DataONE Member Node
92
New Capabilities … BCO-DMO Data Set Citation
LOD URI: BCO-DMO Dataset landing page with new citation button CC by 4.0 license with suggested Citation text Based on DOI Data published at WHOAS: doi: /1912/bco-dmo Using the DOI citation formatter service: Doesn’t work yet for our DOIs (cause we don’t have enough DOI metadata), but when it does, it will return something like this Cite as: Twining, B. (2016). “Element Quotas of Individual Synechococcus Cells Collected During Bermuda Atlantic Time-Series Study (BATS) Cruises Aboard the R/V Atlantic Explorer Between Dates and ”. Version 05/06/2016. Biological and Chemical Oceanography Data Management Office (BCO-DMO) Dataset. doi: /1912/bco-dmo [access date]
93
Thank you … To the Data Citation Working Group for their efforts RDA US for funding this adoption project TIMELINE: Redesign/protoype completed by 1 June 2016 New citation recommendation by 1 Sep 2016 Report out at RDA P8 (Denver, CO) September 2016 Final report by 1 December 2016 Cyndy Chandler @cynDC @bcodmo ORCID:
94
EXTRA SLIDES Removed these to reduce talk to minutes
95
Adoption of Data Citation Outputs
Evaluation Evaluate recommendations (done December 2015) Try implementation in existing BCO-DMO architecture (work began 4 April 2016) Trial BCO-DMO: R1-11 fit well with current architecture; R12 doable; test as part of DataONE node membership; R13-14 are consistent with Linked Data approach to data publication and sharing NOTE: adoption grant received from RDA US (April 2016)
96
RDA Data Citation (DC) of evolving data
DC goals: to create identification mechanisms that: allow us to identify and cite arbitrary views of data, from a single record to an entire data set in a precise, machine-actionable manner allow us to cite and retrieve that data as it existed at a certain point in time, whether the database is static or highly dynamic DC outcomes: 14 recommendations and associated documentation ensuring that data are stored in a versioned and timestamped manner identifying data sets by storing and assigning persistent identifiers (PIDs) to timestamped queries that can be re-executed against the timestamped data store More information from RDA site URL: In addition to BCO-DMO, there are already 8 other pilot studies that have expressed interest in adopting the DC WG recommendations: ARGO Austrian Centre for Digital Humanities BCO-DMO Center for Biomedical Informatics (CBMI), Washington University, St. Louis Climate Change Centre Austria ENVRIplus Natural History Museum London Ocean Networks Canada Virtual Atomic and Molecular Data Centre
97
RDA Data Citation WG Recommendations
»» Data Versioning: For retrieving earlier states of datasets the data need to be versioned. Markers shall indicate inserts, updates and deletes of data in the database. »» Data Timestamping: Ensure that operations on data are timestamped, i.e. any additions, deletions are marked with a timestamp. »» Data Identification: The data used shall be identified via a PID pointing to a time-stamped query, resolving to a landing page. Oct 2015 version w/ 14 recommendations DC WG chairs: Andreas Rauber, Ari Asmi, Dieter van Uytvanck
98
New capability (implemented)
procedure: when a BCO-DMO data set is updated … A copy of the previous version is preserved Request a DOI for the new version of data Publish data, and create new landing page for new version of data, with new DOI assigned BCO-DMO database has links to all versions of the data (archived and published) Both archive and published dataset landing pages have links back to best version of full dataset at BCO-DMO BCO-DMO data set landing page displays links to all archived and published versions
99
Extended description of recommendations
REFERENCES Extended description of recommendations Altman and Crosas “Evolution of Data Citation …” CODATA-ICSTI “Out of cite, out of mind” FORCE11 R. E. Duerr, et al. “On the utility of identification schemes for digital earth science data”, ESI, 2011. Altman and Crosas “Evolution of Data Citation Altman, M., & Crosas, M. (2013). The evolution of data citation: from principles to implementation. IAssist Quarterly, 37(1-4), CODATA-ICSTI 2013 “Out of cite, out of mind” Data Science Journal Vol. 12 (2013) p. CIDCR1-CIDCR75 Out of Cite, Out of Mind: The Current State of Practice, Policy, and Technology for the Citation of Data R. E. Duerr, et al “On the utility of identification schemes for digital earth science data” DOI: /s (online version, open access) Duerr, R. E., Downs, R. R., Tilmes, C., Barkstrom, B., Lenhardt, W. C., Glassy, J., ... & Slaughter, P. (2011). On the utility of identification schemes for digital earth science data: an assessment and recommendations. Earth Science Informatics, 4(3),
100
Vermont Monitoring Cooperative James Duncan, Jennifer Pontius VMC
101
Implementation of Dynamic Data Citation
James Duncan and Jennifer Pontius 9/16/2016
102
Ecosystem Monitoring Collaborator Network Data Archive, Access and Integration
103
Many Disciplines, Many Contributors
VMC houses any data related to forest ecosystem condition, regardless of affiliation or discipline This gives us a unique advantage to house data that has nowhere else to go, and the staff and technical capacity to bring in that data This brings its own complexity in terms of data diversity
104
Why We Need It Continually evolving datasets
Some errors not caught till next field season Frequent reporting and publishing
105
Dynamic Data Citation – Features Needed
Light footprint on database resources Works on top of existing catalog and metadata Works in an institutionally managed PHP/MySQL environment User-driven control of what quantity of change constitutes a version Integration with management portal Track granular changes in data
106
User Workflow to Date Modify a dataset Commit to version, assign name
changes tracked original data table unchanged Commit to version, assign name computes result hash (table pkid, col names, first col data) and query hash updates data table to new state formalizes version Restore previous version creates new version table from current data table state, walks it back using stored SQL. Garbage collected after a period of time
107
User Workflow, still to come
V1.1 Web-based data editing validation DOI minting integration Public display Subsetting workflow Other methods of data modification? Upgrade to rest of system DOI: V1.2 DOI: V1.3 Query 1 Unsaved DOI?
108
Technical Details Version Info Table Version PID Dataset ID
Version Name Version ID Person ID Query Hash Result Hash Timestamp 23456 3525 Version 1.5 3 …. 23574 Unsaved -1 NULL Step Tracking Table (Child of Version Info) Step ID Version PID Forward Backward Order 983245 23574 DELETE FROM… INSERT INTO… 1 UPDATE SET site=“Winhall”… UPDATE SET site=“Lye Brook”… 2
109
Implementation Challenges and Questions
Large updates Re-creation of past versions, in terms of garbage collection and storage Questions Query uniqueness checking and query normalization Efficient but effective results hashing strategies Linear progression of data, versus branching network
110
Acknowledgments Adoption seed funding - MacArthur Foundation and the Research Data Alliance The US Forest Service State and Private Forestry program for core operational funding of the VMC Fran Berman, Yolanda Meleco and the other adopters who have been sharing their experiences. All the VMC cooperators that contribute
111
Thank you!
112
ARGO Justin Buck, Helen Glaves BODC
113
WG Data Citation: Adoption meeting Argo DOI pilot
RDA P8 2016, Denver, 16th September 2016 WG Data Citation: Adoption meeting Argo DOI pilot Justin Buck, National Oceanography Centre (UK), Thierry Carval, Ifremer (France), Thomas Loubrieu, Ifremer (France), Frederic Merceur, Ifremer (France),
114
300+ citations per year How to cite Argo data at a given point in time? Possible with a single DOI?
115
Argo data system (simplified)
Data Assembly Centres (DAC) (national level, 10 of, RT processing and QC) GTS (+GTSPP archive) Raw data Global data assembly Centres (GDAC) (2 of, mirrored, ftp access, repository of NetCDF files, no version control, latest version served) Delayed mode quality control Data users
116
Key associated milestones
2014 – Introduction of dynamic data into the DataCite metadata schema Snapshots were used as an interim solution for Argo 2015 – RDA recommendations on evolving data: Legacy GDAC architecture does not permit full implementation
117
How to apply a DOI for Argo
Archive (collection of snapshots/granules) Obs time State time + Obs time State time + Obs time State time + To cite a particular snapshot one can potentially cite a time slice of an archive i.e. the snapshot at a given point in time.
118
New single DOI Argo (2000). Argo float data and metadata from Global Data Assembly Centre (Argo GDAC). SEANOE.
119
# key used to identify snapshot
120
# key used to identify snapshot
121
Step towards RDA recommendation
Archive snapshots enables R1 and R2 at monthly granularity R1 – Data Versioning R2 – Timestamping Argo Pilot effectively uses predetermined referencing of snapshots removing the need for requirements R3 to R7. # keys are PIDs for the snapshots and have associated citation texts. R3 – Query Store Facilities R4 – Query Uniqueness R5 – Stable Sorting R6 – Result Set Verification R7 – Query Timestamping R8 – Query PID R9 – Store Query R10 – Automated Citation Texts SEANOE landing page architecture means R11 and R12 effectively met R11 – Landing Page R12 – Machine Actionability Final two requirements untested at this stage R13 – Technology Migration R14 – Migration Verification
122
Summary There is now a single DOI for Argo
Takes account of legacy GDAC architecture Monthly temporal granularity Enables both reproducible research and simplifies the tracking of citations ‘#’ rather than ‘?’ in identifier takes account of current DOI resolution architecture Extensible to other observing systems such as OceanSITES and EGO The concept allows for different subsets of Argo data e.g. ocean basins, Bio-Argo data
123
Progress on Data Citation within VAMDC C. M
Progress on Data Citation within VAMDC C.M. Zwölf and VAMDC Consortium
124
C.M. Zwölf and VAMDC consortium RDA 8th Plenary - Denver
From RDA Data Citation Recommendations to new paradigms for citing data from VAMDC C.M. Zwölf and VAMDC consortium RDA 8th Plenary - Denver
125
The Virtual Atomic and Molecular Data Centre
Federates 29 heterogeneous databases The “V” of VAMDC stands for Virtual in the sense that the e-infrastructure does not contain data. The infrastructure is a wrapping for exposing in a unified way a set of heterogeneous databases. The consortium is politically organized around a Memorandum of understanding (15 international members have signed the MoU, 1 November 2014) High quality scientific data come from different Physical/Chemical Communities Provides data producers with a large dissemination platform Remove bottleneck between data-producers and wide body of users Plasma sciences Lighting technologies Atmospheric Physics Environmental sciences Fusion technologies Health and clinical sciences Astrophysics VAMDC Single and unique access to heterogeneous A+M Databases
126
The VAMDC infrastructure technical architecture
Existing Independent A+M database
127
The VAMDC infrastructure technical architecture
Resource registered into Registry Layer VAMDC wrapping layer VAMDC Node Standard Layer Existing Independent A+M database Accept queries submitted in standard grammar Results formatted into standard XML file (XSAMS)
128
The VAMDC infrastructure technical architecture
Resource registered into Registry Layer Asks for available resources VAMDC wrapping layer VAMDC Node Standard Layer VAMDC Clients (dispatch query on all the registered resources) Portal SpecView SpectCol Existing Independent A+M database Unique A+M query Accept queries submitted in standard grammar Set of standard files Results formatted into standard XML file (XSAMS)
129
The VAMDC infrastructure technical architecture
Resource registered into Registry Layer Asks for available resources VAMDC wrapping layer VAMDC Node Standard Layer VAMDC Clients (dispatch query on all the registered resources) Portal SpecView SpectCol Existing Independent A+M database Unique A+M query Accept queries submitted in standard grammar Set of standard files Results formatted into standard XML file (XSAMS) VAMDC is agnostic about the local data storage strategy on each node. Each node implements the access/query/result protocols. There is no central management system. Decisions about technical evolutions are made by consensus in Consortium. It is both technical and political challenging to implement the WG recommendations.
130
Let us implement the recommendation!!
The problem is more anthropological than technical… Tagging and versioning data What does it really mean data citation?
131
Let us implement the recommendation!!
The problem is more anthropological than technical… Tagging and versioning data What does it really mean data citation?
132
Let us implement the recommendation!!
The problem is more anthropological than technical… Tagging and versioning data We see technically how to do that Ok, but What is the data granularity for tagging? Naturally it is the dataset (A+M data have no meaning outside this given context) But each data provider defines differently what a dataset is. What does it really mean data citation?
133
Let us implement the recommendation!!
The problem is more anthropological than technical… Tagging and versioning data We see technically how to do that Ok, but What is the data granularity for tagging? Naturally it is the dataset (A+M data have no meaning outside this given context) But each data provider defines differently what a dataset is. What does it really mean data citation? Everyone knows what it is! Yes, but everyone has its own definition RDA cite databases record or output files. (an extracted data file may have an H-factor) VAMDC cite all the papers used for compiling the content of a given output file.
134
Let us implement the recommendation!!
Implementation will be an overlay to the standard / output layer, thus independent from any specific data-node Tagging versions of data 1 Fine grained granularity: Evolution of XSAMS output standard for tracking data modifications Two layers mechanisms 2 Coarse grained granularity: At each data modification to a given data node, the version of the Data-Node changes With the second mechanism we know that something changed : in other words, we know that the result of an identical query may be different from one version to the other. The detail of which data changed is accessible using the first mechanisms.
135
Let us implement the recommendation!!
Implementation will be an overlay to the standard / output layer, thus independent from any specific data-node Tagging versions of data Query Store 1 Fine grained granularity: Evolution of XSAMS output standard for tracking data modifications Is built over the versioning of Data Is plugged over the existing VAMDC data-extraction mechanisms. Two layers mechanisms 2 Coarse grained granularity: At each data modification to a given data node, the version of the Data-Node changes Due to the distributed VAMDC architecture, the Query Store architecture is similar to a log-service. With the second mechanism we know that something changed : in other words, we know that the result of an identical query may be different from one version to the other. The detail of which data changed is accessible using the first mechanisms.
136
Data-Versioning: overview of the fine grained mechanisms
We adopted a change of paradigm (weak structuration): Element 1 Radiative process 1 Element 2 Radiative process 2 … … Element q Radiative process n Output XSAMS file Energy State 1 Collisional process 1 Energy State 2 Collisional process 2 … … Energy State p Collisional process m
137
Data-Versioning: overview of the fine grained mechanisms
We adopted a change of paradigm (weak structuration): Element 1 Radiative process 1 Element 2 Version 1 (tagged according to infrastructure state & updates) Radiative process 2 … … Element q Radiative process n Output XSAMS file Version 2 (tagged according to infrastructure state & updates) Energy State 1 Collisional process 1 Energy State 2 Collisional process 2 … … Energy State p Collisional process m
138
Data-Versioning: overview of the fine grained mechanisms
We adopted a change of paradigm (weak structuration): Element 1 Radiative process 1 Element 2 Version 1 (tagged according to infrastructure state & updates) Radiative process 2 … … Element q Radiative process n Output XSAMS file Version 2 (tagged according to infrastructure state & updates) Energy State 1 Collisional process 1 Energy State 2 Collisional process 2 … … Energy State p Collisional process m
139
Data-Versioning: overview of the fine grained mechanisms
We adopted a change of paradigm (weak structuration): This approach has several advantages: It solves the data tagging granularity problem It is independent from what is considered a dataset The new files are compliant with old libraries & processing programs We add a new feature, an overlay to the existing structure We induce a structuration, without changing the structure (weak structuration)
140
Data-Versioning: overview of the fine grained mechanisms
We adopted a change of paradigms: This approach has several advantages: It solves the data tagging granularity problem It is independent from what is considered a dataset The new files are compliant with old libraries & processing programs We add a new feature, an overlay to the existing structure We induce a structuration, without changing the structure (weak structuration) Technical details described in New model for datasets citation and extraction reproducibility in VAMDC, C.M. Zwölf, N. Moreau, M.-L. Dubernet, In press J. Mol. Spectrosc. (2016), Arxiv version:
141
Let us focus on the query store:
The difficulty we have to cope with: Handle a query store in a distributed environment (RDA did not design it for these configurations). Integrate the query store with the existing VAMDC infrastructure.
142
Let us focus on the query store:
The difficulty we have to cope with: Handle a query store in a distributed environment (RDA did not design it for these configurations). Integrate the query store with the existing VAMDC infrastructure. The implementation of the query store is the goal of a joint collaboration between VAMDC and RDA-Europe. Development started during spring 2016. Final product released during 2017.
143
Let us focus on the query store:
The difficulty we have to cope with: Handle a query store in a distributed environment (RDA did not design it for these configurations). Integrate the query store with the existing VAMDC infrastructure. The implementation of the query store is the goal of a jointly collaboration between VAMDC and RDA-Europe. Development started during spring 2016. Final product released during 2017. Collaboration with Elsevier for embedding the VAMDC query store into the pages displaying the digital version of papers. Designing technical solution for Paper / data linking at the paper submission (for authors) Paper / data linking at the paper display (for readers)
144
Let us focus on the query store:
Sketching the functioning – From the final-user point of view: Data extraction procedure Access to the output data file VAMDC portal (query interface) VAMDC infrastructure Computed response VAMD portal (result part) Query Digital Unique Identifier associated to the current extraction Submitting query
145
Let us focus on the query store:
Sketching the functioning – From the final-user point of view: Data extraction procedure Access to the output data file VAMDC portal (query interface) VAMDC infrastructure Computed response VAMD portal (result part) Query Digital Unique Identifier associated to the current extraction Submitting query Resolves The original query Date & time where query was processed Query Metadata Landing Page Version of the infrastructure when the query was processed Query Store List of publications needed for answering the query When supported (by the VAMDC federated DB): retrieve the output data-file as it was computed (query re-execution)
146
Let us focus on the query store:
Sketching the functioning – From the final-user point of view: Data extraction procedure Access to the output data file VAMDC portal (query interface) VAMDC infrastructure Computed response VAMD portal (result part) Query Digital Unique Identifier associated to the current extraction Submitting query Resolves The original query Manage queries (with authorisation/authentication) Date & time where query was processed Query Metadata Landing Page Version of the infrastructure when the query was processed Use DOI in papers Query Store Group arbitrary set of queries (with related DUI) and assign them a DOI to use in publications List of publications needed for answering the query When supported (by the VAMDC federated DB): retrieve the output data-file as it was computed (query re-execution)
147
Let us focus on the query store:
Sketching the functioning – From the final-user point of view: Data extraction procedure Access to the output data file VAMDC portal (query interface) VAMDC infrastructure Computed response VAMD portal (result part) Query Digital Unique Identifier associated to the current extraction Submitting query Resolves The original query Manage queries (with authorisation/authentication) Date & time where query was processed Query Metadata Landing Page Version of the infrastructure when the query was processed Use DOI in papers Query Store Group arbitrary set of queries (with related DUI) and assign them a DOI to use in publications List of publications needed for answering the query When supported (by the VAMDC federated DB): retrieve the output data-file as it was computed (query re-execution) Editors may follow the citation pipeline : credit delegation applies
148
Let us focus on the query store:
Sketching the functioning – Technical internal point of view: Query Store Listener Service VAMDC Node Notification 1 When a node receives a user query, it notifies to the Listener Service the following information: The identity of the user (optional) The used client software The identifier of the node receiving the query The version (with related timestamp) of the node receiving the query The version of the output standard used by the node for replying the results The query submitted by the user The link to the result data.
149
Let us focus on the query store:
Sketching the functioning – Technical internal point of view: Query Store Listener Service VAMDC Node Notification 2 For each received notification, the listener check if it exists an already existing query: Having the same Query Having the same node version Submitted to the same node and having the same node version
150
Let us focus on the query store:
Sketching the functioning – Technical internal point of view: Query Store Listener Service VAMDC Node Notification Provides the Query with a unique time-stamped identifier Following the link, get data and process them for extracting relevant metadata (ex. Bibtex of references used for compiling the output file) Store these relevant metadata If identity of user is provided, we associate the user with the Query ID 2 For each received notification, the listener check if it exists an already existing query: Having the same Query Having the same node version Submitted to the same node and having the same node version existing Not existing We get the already existing unique identifier and (incrementally) associate the new query timestamp (and if provided the identifier of the user) with the query Identifier
151
Let us focus on the query store:
Sketching the functioning – Technical internal point of view: Query Store Listener Service VAMDC Node Notification Remark on query uniqueness: The query language supported by the VAMDC infrastructure is VSS2 (VAMDC SQL Subset 2, We are working on a specific VSS2 parser (based on Antlr) which should identify, from queries expressed in different ways, the ones that are semantically identical We are designing this analyzer as an independent module, hoping to extend it to all SQL.
152
Final remarks: Our aims:
Provide the VAMDC infrastructure with an operational query store Share our experience with other data-providers Provide data-providers with a set of libraries/tools/methods for an easy implementation of a query store. We will try to build a generic query store (i.e. using generic software blocks)
153
UK Riverflow Archive Matthew Fry mfry@ceh.ac.uk
154
UK National River Flow Archive
Curation and dissemination of regulatory river flow data for research and other access Data used for significant research outputs, and a large number of citations annually Updated annually but also regular revision of entire flow series through time (e.g. stations resurveyed) Internal auditing, but history is not exposed to users
155
UK National River Flow Archive
ORACLE Relational database Time series and metadata tables ~20M daily flow records, + monthly / daily catchment rainfall series Metadata (station history, owners, catchment soils / geology, etc.) Total size of ~5GB Time series tables automatically audited, But reconstruction is complex Users generally download simple files But public API is in development / R-NRFA package is out there Fortunately all access is via single codeset
156
Our data citation requirements
Cannot currently cite whole dataset Allow citation a subset of the data, as it was at the time Fit with current workflow / update schedule, and requirements for reproducibility Fit with current (file download) and future (API) user practices Resilient to gradual or fundamental changes in technologies used Allow tracking of citations in publications
157
Options for versioning / citation
“Regulate” queries: limitations on service provided Enable any query to be timestamped / cited / reproducible: does not readily allow verification (e.g. checksum) of queries (R7), or storage of queries (R9) Manage which queries can be citable: limitation on publishing workflow?
158
Versioning / citation solution
Automated archiving of entire database – version controlled scripts defining tables, creating / populating archived tables (largely complete) Fits in with data workflow – public / dev versions – this only works because we have irregular / occasional updates Simplification of the data model (complete) API development (being undertaken independently of dynamic citation requirements): allows subsetting of dataset in a number of ways – initially simply need to implement versioning (started) to ensure will cope with changes to data structures Fit to dynamic data citation recommendations? Largely Need to address mechanism for users to request / create citable version of a query Resource required: estimated ~2 person months
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.