VAMDC use-case for the RDA Data Citation Working Group C.M. Zwölf and VAMDC consortium 6 th RDA Plenary PARIS September 2015.

Slides:



Advertisements
Similar presentations
Abteilung Systeme und Betrieb UNIDART: A Uniform Data Request Interface The UNIDART Project Jürgen Seib Deutscher Wetterdienst Referat für Datenbanksysteme.
Advertisements

General introduction to Web services and an implementation example
TCP/IP Protocol Suite 1 Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 22 World Wide Web and HTTP.
VAMDC Registry Portal Proof of Concept. Registry VAMDC Registry is available at – ex.jsp
19-20 March 2003 IVOA Registry Workgroup LeSc Astrogrid Registry: Early Designs Elizabeth Auden Astrogrid Registry Workgroup Leader IVOA Registry Workgroup.
Provenance in Open Distributed Information Systems Syed Imran Jami PhD Candidate FAST-NU.
The Community Authorisation Service – CAS Dr Steven Newhouse Technical Director London e-Science Centre Department of Computing, Imperial College London.
Distributed Heterogeneous Data Warehouse For Grid Analysis
Office of Water Water Quality Exchange Pilot. Purpose To Establish a platform/software independent data exchange format for ambient water quality and.
The Open Archives Initiative Simeon Warner (Cornell University) Symposium on “Scholarly Publishing and Archiving on the Web”, University.
App-ID Ad-Hoc Technical Issues TP AppID R02 Group Name: App-ID Ad-Hoc Group Source: Darold Hemphill, iconectiv,
1 IPSG WORKSHOP 1 - CHALLENGES AND TOOLS FOR THE CENTRE OF GOVERNMENT There is an observable trend towards direction of centralization of the CoG: Reasons.
New Tools for Storing and Accessing Spectroscopic Data The Development of an XML Schema for the HITRAN Database Dr Christian Hill Department of Physics.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Introduction to ebXML Mike Rawlins ebXML Requirements Team Project Leader.
By: Shawn Li. OUTLINE XML Definition HTML vs. XML Advantage of XML Facts Utilization SAX Definition DOM Definition History Comparison between SAX and.
This chapter is extracted from Sommerville’s slides. Text book chapter
OpenMDR: Generating Semantically Annotated Grid Services Rakesh Dhaval Shannon Hastings.
Collaborating with VAMDC Guy Rixon RADAM database workshop, Caen, October 2013.
● Problem statement ● Proposed solution ● Proposed product ● Product Features ● Web Service ● Delegation ● Revocation ● Report Generation ● XACML 3.0.
OpenMDR: Alternative Methods for Generating Semantically Annotated Grid Services Rakesh Dhaval Shannon Hastings.
VAMDC Virtual Atomic and Molecular Data Centre (.org) Coordinator: M.L. Dubernet, Paris GREAT-ESF Workshop, August.
BIO1130 Lab 2 Scientific literature. Laboratory objectives After completing this laboratory, you should be able to: Determine whether a publication can.
Lecture 15 Introduction to Web Services Web Service Applications.
Web Services Description Language CS409 Application Services Even Semester 2007.
Configuration Management (CM)
OEI’s Services Portfolio December 13, 2007 Draft / Working Concepts.
Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering Nithya N. Vijayakumar, Beth Plale DDE Lab, Indiana University {nvijayak,
XML Registries Source: Java TM API for XML Registries Specification.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
INTEGRATED OCEAN DRILLING PROGRAM MANAGEMENT INTERNATIONAL International Data Exchange Workshop – Kiel, Germany – May 9-11, 2007 SEDIS Scientific Earth.
Data Citation Working Group P6 23 nd Sep 2015, Paris.
Dynamic Document Sharing Detailed Profile Proposal for 2010 presented to the IT Infrastructure Technical Committee Karen Witting November 10, 2009.
The european ITM Task Force data structure F. Imbeaux.
1 CS 502: Computing Methods for Digital Libraries Lecture 19 Interoperability Z39.50.
Implementing the XDS Infrastructure Bill Majurski IT Infrastructure National Institute of Standards and Technology.
Designing Pervasive Services for Physical Hypermedia Cecilia Challiol, Silvia Gordillo, Gustavo Rossi (LIFIA, Universidad Nacional de La Plata, Argentina)
VAMDC infrastructure VAMDC 7th Developer’s workshop Guy Rixon.
1 Meeting on the Management of Statistical Information Systems (MSIS 2010) SDMX architecture for data sharing and interoperability Francesco Rizzo, ISTAT,
INTRODUCTION TO DBS Database: a collection of data describing the activities of one or more related organizations DBMS: software designed to assist in.
Common Terminology Services 2 CTS 2 Submission Team Status Update HL7 Vocabulary Working Group May 17, 2011.
The Global Land Cover Facility is sponsored by NASA and the University of Maryland.The GLCF is a founding member of the Federation of Earth Science Information.
Enabling e-Research in Combustion Research Community T.V Pham 1, P.M. Dew 1, L.M.S. Lau 1 and M.J. Pilling 2 1 School of Computing 2 School of Chemistry.
The Client-Server Model And the Socket API. Client-Server (1) The datagram service does not require cooperation between the peer applications but such.
System/SDWG Update Management Council Face-to-Face Flagstaff, AZ August 22-23, 2011 Sean Hardman.
Query Health Technical WG Update 12/1/2011. Agenda TopicTime Slot F2F Update (Actions, Decisions and FollowUps) 2:05 – 2:50 pm Wrap Up2:50 - 2:55 pm.
1 Chapter 12 Configuration management This chapter is extracted from Sommerville’s slides. Text book chapter 29 1.
ΕΚΤ Access to Knowledge ΕΚΤ Access to Knowledge CERIF API: Access and reuse research information in CRIS Dimitris Karaiskos Vasilis Bonis, Nikos Pougounias.
Dynamic/Deferred Document Sharing (D3S) Profile for 2010 presented to the IT Infrastructure Technical Committee Karen Witting February 1, 2010.
1 ECHO ECHO 9.0 for Data Partners Rob Baker January 23, 2007.
Using VAMDC: tutorial introduction Guy Rixon VAMDC annual meeting 2013, Open University.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
IPDA Architecture Project International Planetary Data Alliance IPDA Architecture Project Report.
Dynamic/Deferred Document Sharing (D3S) Profile for 2010 presented to the IT Infrastructure Technical Committee Karen Witting February 1, 2010.
International Planetary Data Alliance Registry Project Update September 16, 2011.
Enabling Grids for E-sciencE Claudio Cherubino INFN DGAS (Distributed Grid Accounting System)
IAEA International Atomic Energy Agency Implementing SDMX for Energy Domain: From Discussion to Actual Implementation and Testing Andrii Gritsevskyi Oslo.
IPDA Registry Definitions Project Dan Crichton Pedro Osuna Alain Sarkissian.
Developing our Metadata: Technical Considerations & Approach Ray Plante NIST 4/14/16 NMI Registry Workshop BIPM, Paris 1 …don’t worry ;-) or How we concentrate.
App-ID Ad-Hoc Technical Issues TP AppID R02
Improving searches through community clustering of information
CUAHSI HIS Sharing hydrologic data
GSAF Grid Storage Access Framework
The evolution of the SDMX infrastructure and services
New input for CEOS Persistent Identifier Best Practices
Introduction of Week 11 Return assignment 9-1 Collect assignment 10-1
REST APIs Maxwell Furman Department of MIS Fox School of Business
Presentation transcript:

VAMDC use-case for the RDA Data Citation Working Group C.M. Zwölf and VAMDC consortium 6 th RDA Plenary PARIS September 2015

Plasma sciences Lighting technologies Atmospheric Physics Environmental sciences Fusion technologies Health and clinical sciences Astrophysics VAMDC Single and unique access to heterogeneous A+M Databases VAMDC Single and unique access to heterogeneous A+M Databases  Federates 28 heterogeneous databases  The “V” of VAMDC stands for Virtual in the sense that the e-infrastructure does not contain data. The infrastructure is a wrapping for exposing in a unified way a set of heterogeneous databases.  The consortium is politically organized around a Memorandum of understanding (15 international members have signed the MoU, 1 November 2014)  High quality scientific data come from different Physical/Chemical Communities  Provides data producers with a large dissemination platform  Remove bottleneck between data- producers and wide body of users The Virtual Atomic and Molecular Data Centre

Existing Independent A+M database Existing Independent A+M database Existing Independent A+M database Existing Independent A+M database The VAMDC infrastructure technical organization

VAMDC wrapping layer  VAMDC Node Existing Independent A+M database Existing Independent A+M database Standard vocabulary for submitting queries Results provided formatted into standard XML file (XSAMS) VAMDC wrapping layer  VAMDC Node Existing Independent A+M database Existing Independent A+M database Standard vocabulary for submitting queries Results provided formatted into standard XML file (XSAMS) The VAMDC infrastructure technical organization

VAMDC wrapping layer  VAMDC Node Existing Independent A+M database Existing Independent A+M database Standard vocabulary for submitting queries Results provided formatted into standard XML file (XSAMS) VAMDC Registry VAMDC Registry Resource registered into VAMDC wrapping layer  VAMDC Node Existing Independent A+M database Existing Independent A+M database Standard vocabulary for submitting queries Results provided formatted into standard XML file (XSAMS) The VAMDC infrastructure technical organization

VAMDC wrapping layer  VAMDC Node Existing Independent A+M database Existing Independent A+M database Standard vocabulary for submitting queries Results provided formatted into standard XML file (XSAMS) VAMDC Registry VAMDC Registry Resource registered into VAMDC Clients (dispatch query on all the registered resources) Portal SpecView SpectCol VAMDC Clients (dispatch query on all the registered resources) Portal SpecView SpectCol VAMDC wrapping layer  VAMDC Node Existing Independent A+M database Existing Independent A+M database Standard vocabulary for submitting queries Results provided formatted into standard XML file (XSAMS) Unique A+M query Set of XSAMS files Asks for available resources The VAMDC infrastructure technical organization

Trying to implement the recommendations Query Store Tagging Datasets with Ids (Relational Database case)

Trying to implement the recommendations Query Store Tagging Datasets with Ids (Relational Database case) From spring 2014 to Spring 2015 A first study shown that internal VAMDC standards (XSAMS format) and protocols could be extended for implementing DataSet identification (no blocking technological issues). Atomic and Molecular Data have no intrinsic meaning outside a given context (defining the zero energy state, the molecular symmetries, etc…). This context naturally define the dataset perimeter From spring 2014 to Spring 2015 A first study shown that internal VAMDC standards (XSAMS format) and protocols could be extended for implementing DataSet identification (no blocking technological issues). Atomic and Molecular Data have no intrinsic meaning outside a given context (defining the zero energy state, the molecular symmetries, etc…). This context naturally define the dataset perimeter

Trying to implement the recommendations Query Store Tagging Datasets with Ids (Relational Database case) From spring 2014 to Spring 2015 A first study shown that internal VAMDC standards (XSAMS format) and protocols could be extended for implementing DataSet identification (no blocking technological issues). Atomic and Molecular Data have no intrinsic meaning outside a given context (defining the zero energy state, the molecular symmetries, etc…). This context naturally define the dataset perimeter From spring 2014 to Spring 2015 A first study shown that internal VAMDC standards (XSAMS format) and protocols could be extended for implementing DataSet identification (no blocking technological issues). Atomic and Molecular Data have no intrinsic meaning outside a given context (defining the zero energy state, the molecular symmetries, etc…). This context naturally define the dataset perimeter From spring 2015 to present The issue is more “anthropological” than technological, since each database provider (VAMDC node owner, recall VAMDC federates 28 heterogeneous DB) has its own understanding and a- priori idea of what a dataset is (some examples on the Working group wiki page, VAMDC usecase section) Indeed, a “DataSet” is not uniquely defined and understood by VAMDC members. Depending on the definition a unique query may be the result of combination of multitudes of dataset. In this case how to use datasets for citing data if one has to cite hundreds of different dataset for a single query? Need to find a common understanding From spring 2015 to present The issue is more “anthropological” than technological, since each database provider (VAMDC node owner, recall VAMDC federates 28 heterogeneous DB) has its own understanding and a- priori idea of what a dataset is (some examples on the Working group wiki page, VAMDC usecase section) Indeed, a “DataSet” is not uniquely defined and understood by VAMDC members. Depending on the definition a unique query may be the result of combination of multitudes of dataset. In this case how to use datasets for citing data if one has to cite hundreds of different dataset for a single query? Need to find a common understanding

Query Store Tagging Datasets with Ids (Relational Database case) From spring 2014 to Spring 2015 A first study shown that internal VAMDC standards (XSAMS format) and protocols could be extended for implementing DataSet identification (no blocking technological issues). Atomic and Molecular Data have no intrinsic meaning outside a given context (defining the zero energy state, the molecular symmetries, etc…). This context naturally define the dataset perimeter From spring 2014 to Spring 2015 A first study shown that internal VAMDC standards (XSAMS format) and protocols could be extended for implementing DataSet identification (no blocking technological issues). Atomic and Molecular Data have no intrinsic meaning outside a given context (defining the zero energy state, the molecular symmetries, etc…). This context naturally define the dataset perimeter The actual proposition in discussion Keeping in mind that this evolution is for reproducing request at later time and for sustainable data citation We introduced the notion of Version rather than DataSet A Version is the snapshot of a entire database at a given (timestamped time) Each evolution (even minimal) of the DB will be associated to a new Snapshot Version. All the data extracted from VAMDC will be attached (i.e. will refer to) a specific Version. We are internally discussing this approach and evaluating its implementing cost. The actual proposition in discussion Keeping in mind that this evolution is for reproducing request at later time and for sustainable data citation We introduced the notion of Version rather than DataSet A Version is the snapshot of a entire database at a given (timestamped time) Each evolution (even minimal) of the DB will be associated to a new Snapshot Version. All the data extracted from VAMDC will be attached (i.e. will refer to) a specific Version. We are internally discussing this approach and evaluating its implementing cost. Trying to implement the recommendations

Considering the distributes architecture of VAMDC, many questions arose when trying to apply the Query Store (QS) strategy on VAMDC: Should we need a QS on each node? Should we need an additional QS on the central portal? Since the portal acts as a relay between the users and the nodes, how can we coordinate the generation of ID for queries in this distributed context? Considering the distributes architecture of VAMDC, many questions arose when trying to apply the Query Store (QS) strategy on VAMDC: Should we need a QS on each node? Should we need an additional QS on the central portal? Since the portal acts as a relay between the users and the nodes, how can we coordinate the generation of ID for queries in this distributed context? Query Store Tagging Datasets with Ids (Relational Database case) Trying to implement the recommendations

Considering the distributes architecture of VAMDC, many questions arose when trying to apply the Query Store (QS) strategy on VAMDC: Should we need a QS on each node? Should we need an additional QS on the central portal? Since the portal acts as a relay between the users and the nodes, how can we coordinate the generation of ID for queries in this distributed context? Considering the distributes architecture of VAMDC, many questions arose when trying to apply the Query Store (QS) strategy on VAMDC: Should we need a QS on each node? Should we need an additional QS on the central portal? Since the portal acts as a relay between the users and the nodes, how can we coordinate the generation of ID for queries in this distributed context? We are prototyping an implementation based on a central service for collecting logs from each VAMDC infrastructure element... Query Store Tagging Datasets with Ids (Relational Database case) Trying to implement the recommendations

VAMDC Node (with versioning) Client 1 VAMDC Node (with versioning) Schema of the proposed architecture Client 2 Client 3 Central Log Service A given user is using at time t the Client 1, from a given IP, for submitting a given request to the infrastructure

VAMDC Node (with versioning) Client 1 VAMDC Node (with versioning) Schema of the proposed architecture Client 2 Client 3 Central Log Service A given user is using at time t the Client 1, from a given IP, for submitting a given request to the infrastructure I am receiving at time t a given request by a user running a given client from a given IP

VAMDC Node (with versioning) Client 1 VAMDC Node (with versioning) Schema of the proposed architecture Client 2 Client 3 Central Log Service Non blocking communications for avoiding bottleneck effects

VAMDC Node (with versioning) Client 1 VAMDC Node (with versioning) Schema of the proposed architecture Client 2 Client 3 Central Log Service From Raw information on the log service We will be able to identify unique queries (that have been virtually multiplied by the infrastructure) with unique IDs and assign time-stamps

Architecture of the query store Web service: takes a date and a query. returns a result identical to the one that would be obtained by submitting the query on the provided date Web service: takes a date and a query. returns a result identical to the one that would be obtained by submitting the query on the provided date Web service: Takes a query ID Returns the query and the associated timestamp. Web service: Takes a query ID Returns the query and the associated timestamp. Web Service Takes a query and a date Returns the associated query ID. Web Service Takes a query and a date Returns the associated query ID. A proposed API for the query store Central Log Service Versioning on Databases

Architecture of the query store Central Log Service Web Service Takes the query ID Return the associated results Web Service Takes the query ID Return the associated results A proposed API for the query store Web service: takes a date and a query. returns a result identical to the one that would be obtained by submitting the query on the provided date Web service: takes a date and a query. returns a result identical to the one that would be obtained by submitting the query on the provided date Web service: Takes a query ID Returns the query and the associated timestamp. Web service: Takes a query ID Returns the query and the associated timestamp. Web Service Takes a query and a date Returns the associated query ID. Web Service Takes a query and a date Returns the associated query ID. Versioning on Databases

Concluding remarks / open questions about query store How to deal with confidentiality of the information? Should we need an authentication/authorization policy on the query store? Is the sketched log service compliant with the EU law about confidentiality? We are providing to users the tools for efficiently cite our dynamic data, but How can we be sure that they will use it for citing our data? In other words, how to enforce the ‘citation instincts’ in our final users? We are thinking at proposing a ‘reverse approach’: We may cite the users accessing to our data. They will accept these terms, that will be explained in the condition of usage of the VAMDC services. How to prevent plagiarism?: A user might extract data, modify and cite them as the original extracted ones. Do we have tools for preventing such behaviors? MD5 of extracted data on query-store?