Presentation is loading. Please wait.

Presentation is loading. Please wait.

VAMDC use-case for the RDA Data Citation Working Group C.M. Zwölf and VAMDC consortium 6 th RDA Plenary PARIS September 2015.

Similar presentations


Presentation on theme: "VAMDC use-case for the RDA Data Citation Working Group C.M. Zwölf and VAMDC consortium 6 th RDA Plenary PARIS September 2015."— Presentation transcript:

1 VAMDC use-case for the RDA Data Citation Working Group C.M. Zwölf and VAMDC consortium 6 th RDA Plenary PARIS September 2015

2 Plasma sciences Lighting technologies Atmospheric Physics Environmental sciences Fusion technologies Health and clinical sciences Astrophysics VAMDC Single and unique access to heterogeneous A+M Databases VAMDC Single and unique access to heterogeneous A+M Databases  Federates 28 heterogeneous databases http://portal.vamdc.org/http://portal.vamdc.org/  The “V” of VAMDC stands for Virtual in the sense that the e-infrastructure does not contain data. The infrastructure is a wrapping for exposing in a unified way a set of heterogeneous databases.  The consortium is politically organized around a Memorandum of understanding (15 international members have signed the MoU, 1 November 2014)  High quality scientific data come from different Physical/Chemical Communities  Provides data producers with a large dissemination platform  Remove bottleneck between data- producers and wide body of users The Virtual Atomic and Molecular Data Centre

3 Existing Independent A+M database Existing Independent A+M database Existing Independent A+M database Existing Independent A+M database The VAMDC infrastructure technical organization

4 VAMDC wrapping layer  VAMDC Node Existing Independent A+M database Existing Independent A+M database Standard vocabulary for submitting queries Results provided formatted into standard XML file (XSAMS) VAMDC wrapping layer  VAMDC Node Existing Independent A+M database Existing Independent A+M database Standard vocabulary for submitting queries Results provided formatted into standard XML file (XSAMS) The VAMDC infrastructure technical organization

5 VAMDC wrapping layer  VAMDC Node Existing Independent A+M database Existing Independent A+M database Standard vocabulary for submitting queries Results provided formatted into standard XML file (XSAMS) VAMDC Registry VAMDC Registry Resource registered into VAMDC wrapping layer  VAMDC Node Existing Independent A+M database Existing Independent A+M database Standard vocabulary for submitting queries Results provided formatted into standard XML file (XSAMS) The VAMDC infrastructure technical organization

6 VAMDC wrapping layer  VAMDC Node Existing Independent A+M database Existing Independent A+M database Standard vocabulary for submitting queries Results provided formatted into standard XML file (XSAMS) VAMDC Registry VAMDC Registry Resource registered into VAMDC Clients (dispatch query on all the registered resources) Portal SpecView SpectCol VAMDC Clients (dispatch query on all the registered resources) Portal SpecView SpectCol VAMDC wrapping layer  VAMDC Node Existing Independent A+M database Existing Independent A+M database Standard vocabulary for submitting queries Results provided formatted into standard XML file (XSAMS) Unique A+M query Set of XSAMS files Asks for available resources The VAMDC infrastructure technical organization

7 Trying to implement the recommendations Query Store Tagging Datasets with Ids (Relational Database case)

8 Trying to implement the recommendations Query Store Tagging Datasets with Ids (Relational Database case) From spring 2014 to Spring 2015 A first study shown that internal VAMDC standards (XSAMS format) and protocols could be extended for implementing DataSet identification (no blocking technological issues). Atomic and Molecular Data have no intrinsic meaning outside a given context (defining the zero energy state, the molecular symmetries, etc…). This context naturally define the dataset perimeter From spring 2014 to Spring 2015 A first study shown that internal VAMDC standards (XSAMS format) and protocols could be extended for implementing DataSet identification (no blocking technological issues). Atomic and Molecular Data have no intrinsic meaning outside a given context (defining the zero energy state, the molecular symmetries, etc…). This context naturally define the dataset perimeter

9 Trying to implement the recommendations Query Store Tagging Datasets with Ids (Relational Database case) From spring 2014 to Spring 2015 A first study shown that internal VAMDC standards (XSAMS format) and protocols could be extended for implementing DataSet identification (no blocking technological issues). Atomic and Molecular Data have no intrinsic meaning outside a given context (defining the zero energy state, the molecular symmetries, etc…). This context naturally define the dataset perimeter From spring 2014 to Spring 2015 A first study shown that internal VAMDC standards (XSAMS format) and protocols could be extended for implementing DataSet identification (no blocking technological issues). Atomic and Molecular Data have no intrinsic meaning outside a given context (defining the zero energy state, the molecular symmetries, etc…). This context naturally define the dataset perimeter From spring 2015 to present The issue is more “anthropological” than technological, since each database provider (VAMDC node owner, recall VAMDC federates 28 heterogeneous DB) has its own understanding and a- priori idea of what a dataset is (some examples on the Working group wiki page, VAMDC usecase section) Indeed, a “DataSet” is not uniquely defined and understood by VAMDC members. Depending on the definition a unique query may be the result of combination of multitudes of dataset. In this case how to use datasets for citing data if one has to cite hundreds of different dataset for a single query? Need to find a common understanding From spring 2015 to present The issue is more “anthropological” than technological, since each database provider (VAMDC node owner, recall VAMDC federates 28 heterogeneous DB) has its own understanding and a- priori idea of what a dataset is (some examples on the Working group wiki page, VAMDC usecase section) Indeed, a “DataSet” is not uniquely defined and understood by VAMDC members. Depending on the definition a unique query may be the result of combination of multitudes of dataset. In this case how to use datasets for citing data if one has to cite hundreds of different dataset for a single query? Need to find a common understanding

10 Query Store Tagging Datasets with Ids (Relational Database case) From spring 2014 to Spring 2015 A first study shown that internal VAMDC standards (XSAMS format) and protocols could be extended for implementing DataSet identification (no blocking technological issues). Atomic and Molecular Data have no intrinsic meaning outside a given context (defining the zero energy state, the molecular symmetries, etc…). This context naturally define the dataset perimeter From spring 2014 to Spring 2015 A first study shown that internal VAMDC standards (XSAMS format) and protocols could be extended for implementing DataSet identification (no blocking technological issues). Atomic and Molecular Data have no intrinsic meaning outside a given context (defining the zero energy state, the molecular symmetries, etc…). This context naturally define the dataset perimeter The actual proposition in discussion Keeping in mind that this evolution is for reproducing request at later time and for sustainable data citation We introduced the notion of Version rather than DataSet A Version is the snapshot of a entire database at a given (timestamped time) Each evolution (even minimal) of the DB will be associated to a new Snapshot Version. All the data extracted from VAMDC will be attached (i.e. will refer to) a specific Version. We are internally discussing this approach and evaluating its implementing cost. The actual proposition in discussion Keeping in mind that this evolution is for reproducing request at later time and for sustainable data citation We introduced the notion of Version rather than DataSet A Version is the snapshot of a entire database at a given (timestamped time) Each evolution (even minimal) of the DB will be associated to a new Snapshot Version. All the data extracted from VAMDC will be attached (i.e. will refer to) a specific Version. We are internally discussing this approach and evaluating its implementing cost. Trying to implement the recommendations

11 Considering the distributes architecture of VAMDC, many questions arose when trying to apply the Query Store (QS) strategy on VAMDC: Should we need a QS on each node? Should we need an additional QS on the central portal? Since the portal acts as a relay between the users and the nodes, how can we coordinate the generation of ID for queries in this distributed context? Considering the distributes architecture of VAMDC, many questions arose when trying to apply the Query Store (QS) strategy on VAMDC: Should we need a QS on each node? Should we need an additional QS on the central portal? Since the portal acts as a relay between the users and the nodes, how can we coordinate the generation of ID for queries in this distributed context? Query Store Tagging Datasets with Ids (Relational Database case) Trying to implement the recommendations

12 Considering the distributes architecture of VAMDC, many questions arose when trying to apply the Query Store (QS) strategy on VAMDC: Should we need a QS on each node? Should we need an additional QS on the central portal? Since the portal acts as a relay between the users and the nodes, how can we coordinate the generation of ID for queries in this distributed context? Considering the distributes architecture of VAMDC, many questions arose when trying to apply the Query Store (QS) strategy on VAMDC: Should we need a QS on each node? Should we need an additional QS on the central portal? Since the portal acts as a relay between the users and the nodes, how can we coordinate the generation of ID for queries in this distributed context? We are prototyping an implementation based on a central service for collecting logs from each VAMDC infrastructure element... Query Store Tagging Datasets with Ids (Relational Database case) Trying to implement the recommendations

13 VAMDC Node (with versioning) Client 1 VAMDC Node (with versioning) Schema of the proposed architecture Client 2 Client 3 Central Log Service A given user is using at time t the Client 1, from a given IP, for submitting a given request to the infrastructure

14 VAMDC Node (with versioning) Client 1 VAMDC Node (with versioning) Schema of the proposed architecture Client 2 Client 3 Central Log Service A given user is using at time t the Client 1, from a given IP, for submitting a given request to the infrastructure I am receiving at time t a given request by a user running a given client from a given IP

15 VAMDC Node (with versioning) Client 1 VAMDC Node (with versioning) Schema of the proposed architecture Client 2 Client 3 Central Log Service Non blocking communications for avoiding bottleneck effects

16 VAMDC Node (with versioning) Client 1 VAMDC Node (with versioning) Schema of the proposed architecture Client 2 Client 3 Central Log Service From Raw information on the log service We will be able to identify unique queries (that have been virtually multiplied by the infrastructure) with unique IDs and assign time-stamps

17 Architecture of the query store Web service: takes a date and a query. returns a result identical to the one that would be obtained by submitting the query on the provided date Web service: takes a date and a query. returns a result identical to the one that would be obtained by submitting the query on the provided date Web service: Takes a query ID Returns the query and the associated timestamp. Web service: Takes a query ID Returns the query and the associated timestamp. Web Service Takes a query and a date Returns the associated query ID. Web Service Takes a query and a date Returns the associated query ID. A proposed API for the query store Central Log Service Versioning on Databases

18 Architecture of the query store Central Log Service Web Service Takes the query ID Return the associated results Web Service Takes the query ID Return the associated results A proposed API for the query store Web service: takes a date and a query. returns a result identical to the one that would be obtained by submitting the query on the provided date Web service: takes a date and a query. returns a result identical to the one that would be obtained by submitting the query on the provided date Web service: Takes a query ID Returns the query and the associated timestamp. Web service: Takes a query ID Returns the query and the associated timestamp. Web Service Takes a query and a date Returns the associated query ID. Web Service Takes a query and a date Returns the associated query ID. Versioning on Databases

19 Concluding remarks / open questions about query store How to deal with confidentiality of the information? Should we need an authentication/authorization policy on the query store? Is the sketched log service compliant with the EU law about confidentiality? We are providing to users the tools for efficiently cite our dynamic data, but How can we be sure that they will use it for citing our data? In other words, how to enforce the ‘citation instincts’ in our final users? We are thinking at proposing a ‘reverse approach’: We may cite the users accessing to our data. They will accept these terms, that will be explained in the condition of usage of the VAMDC services. How to prevent plagiarism?: A user might extract data, modify and cite them as the original extracted ones. Do we have tools for preventing such behaviors? MD5 of extracted data on query-store?


Download ppt "VAMDC use-case for the RDA Data Citation Working Group C.M. Zwölf and VAMDC consortium 6 th RDA Plenary PARIS September 2015."

Similar presentations


Ads by Google