The OGSA-DAI Project Databases and the Grid Neil Chue Hong Project Manager EPCC, Edinburgh
What is OGSA-DAI? It is a project: –OGSA Data Access and Integration: funded by the UK eScience Grid Core Programme It is a vision: –From simple database access to truly virtualised data resources It is a standard: –The GridDataService Specification from the Data Access and Integration Working Group (DAIS-WG) of the Global Grid Forum (GGF) It is software that you can use: –Current version is R2.5
OGSA-DAI Objective To define: –open standards and –open source based –uniform service interfaces –for accessing heterogeneous data sources –within the Open Grid Services Architecture (OGSA) framework Why? –Because we are increasingly wanting to integrate different data sources from different organisations together –The Grid, and OGSA, appears to provide a framework for producing software to do this
Who are we? £3 million, 18 months, started February 2002 Funded by the Grid Core Programme IBM USA Oxford Glasgow Cardiff Southampton London Belfast Daresbury Lab RAL EPCC & NeSC Newcastle IBM Hursley Oracle Manchester Cambridge Hinxton Contributing to the global grid computing community EPCC & NeSC IBM UK IBM USA Manchester e-SC Newcastle e-SC Oracle 373 man months
What are we doing? Grid Plumbing & Security Infrastructure SchedulingAccounting MonitoringDiagnosisLogging Data Intensive Applications Data & Storage Resources Distributed Scientific Data Mining & Integration Technology
What are we doing? Grid Plumbing & Security Infrastructure SchedulingAccounting MonitoringDiagnosisLogging Data Intensive Applications Data & Storage Resources Distributed Authorisation Data Access Data Integration Structured Data Scientific Data Mining & Integration Technology
What are we doing? Grid Plumbing & Security Infrastructure SchedulingAccounting MonitoringDiagnosisLogging Data Intensive Applications Data & Storage Resources Distributed Authorisation Data Access Data Integration Structured Data Scientific Data Mining & Integration Technology Operations Team App. Developers Owners
What are we doing? Grid Plumbing & Security Infrastructure SchedulingAccounting MonitoringDiagnosisLogging Data Intensive Applications Data & Storage Resources Distributed Authorisation Data Access Data Integration Structured Data Scientific Data Mining & Integration Technology Operations Team App. Developers Owners Data Intensive Application Scientists Data Providers Data Curators Tech. Developers
What are we doing? Grid Plumbing & Security Infrastructure SchedulingAccounting MonitoringDiagnosisLogging Data Intensive Applications Data & Storage Resources Distributed Authorisation Data Access Data Integration Structured Data Scientific Data Mining & Integration Technology Operations Team App. Developers Owners Data Intensive Application Scientists Data Providers Data Curators Tech. Developers Keep all the groups happy
Project Requirements Derived from project requirements survey –see DAIS WG Driven by Technical Authority and Early Adopters –AstroGrid –MyGrid Close relationship with many other projects
DAIS WG GridDatabaseService Specification –DAIS WG of the GGF –Aim to produce a V1.0 specification by early 2004 –Defines an interface for a GridDatabaseService –May contributors, not just OGSA-DAI Project –OGSA-DAI (the software) seeks to be a reference implementation of this standard But does not necessarily track it exactly just now –Requirements and Overview Informational documents also published
The OGSA-DAI Approach Reuse existing technologies and standards –OGSA, Query languages, Java, transport Three key services: –GridDataService –GridDataServiceFactory –DAIServiceGroupRegistry Benefits: –Location independence –Hides heterogeneity –Scalable –Flexible –Dynamic
OGSA-DAI Positioning - Today Location Meta Data Notification OGSA Lifetime Drivers Query (Create Retrieve Update Delete) Data Format OGSA-DAI Basic Services OGSA-DAI Distributed Query Delivery Database, Communication, OS… Technology GDS DAISGRGDSF
OGSA-DAI in one slide
OGSA-DAI To Date Assuming that OGSA becomes the standard framework –Have adopted the OGSA approach Have first concentrated on data access –Released software has only limited data integration so far –Distributed query processor prototype due in July Implementation provides focus on basic functionality first –But architecturally we have tried to answer many pertinent questions –Functionality will increase over subsequent releases
GDS in action Database (Xindice MySQL Oracle DB2) 1a. Request to Registry for sources of data about “x” 1b. Registry responds with Factory handle 2a. Request to Factory for access to database 2b. Factory creates GridDataService to manage access 2c. Factory returns handle of GDS to client 3a. Client queries GDS with SQL, XPath, XQuery etc 3b. GDS interacts with database 3c. Results of query returned to client as XML SOAP/HTTP service creation API interactions Analyst Registry DAISGR Factory GDSF Grid Data Service GDS Consumer OR 3d. Results of query delivered to consumer as XML
Activities OGSA-DAI is structured around the concept of activities This framework allows new functionality to be added easily Three types of activity at present: –statement (e.g. SQLQuery, Xupdate) –transformation (e.g. XSL translation, compression) –delivery (e.g. GridFTP) OGSA-DAI provides implementations of common functionality, others can extend
Documents Accessing a Grid Data Resource is done using Documents –caveat: this may change A document allows you to: –define parameters –execute activities –deliver results Written in XML, normally used by a client. 10 SELECT * FROM littleblackbook WHERE id=?
OGSA-DAI Core Services OGSA-DAI Release 2.5 – out now –Java, Tomcat, Globus Toolkit 3 Beta –Supports MySQL, DB2, Xindice; SQL92, XPath, Xupdate OGSA-DAI Release 3 – end July –Java, Tomcat, Globus Toolkit 3.0 –Supports MySQL, DB2, Oracle, Xindice; SQL92, XPath, Xupdate –Adds Notification, Internationalisation, Transactions, Caching Continue to track Globus Toolkit 3 releases –Experimental, then production, GT3 grids will help
Data Resource Implementation Mapping
Activity Mapping
Asynchronous delivery – Pull Asynchronous delivery – Push Client Consumer DB GDS GDT GDS Instance RaRa Q RsRs DT GSH/R + data id D + GDH Client Consumer DB GDS GDT GDS Instance RaRa Q + D + GSH/R RsRs DT GSH/R Asynchronous Delivery
GDS Client GDS Client 1 Operation GDS Client 2 DB Operation DB 4 Operation DB GDS 3 Operation DB GDS Client 5 Operation DB GDS GDS Composition
Distributed Query Service A higher level service: –Extension of Polar* query processor, partitions and schedules queries –Sits on top of OGSA and OGSA-DAI Defines new portTypes and services –GridDistributedQuery(GDQ) PortType –GridDistributedQueryService(GDQS) – wraps Polar* –GridQueryEvaluatorService(GQES) – perform subqueries Currently based on OGSA-DAI Release 1.5
DQS Architecture
DQP in action
DQS: the future The GridDistributedQueryService –is an example of a higher level data integration service which utilises OGSA-DAI core services –Assumes that GDSF, GDQS Factory and client live in different containers –Really requires a well-defined meta-model for the physical schema of a database Being partially addressed in DAIS WG –Shows how a GDS can be both client and service Service hierarchy and composition DAIT (proposed follow-on to OGSA-DAI) would produce a robust reference implementation of the DQP components
Projects using OGSA-DAI Industry: –FirstDIG: business process analysis (with First Transport Group) OGSA-DAI with datamining Collaborative –Bridges: database integration over six geographically distributed genomics research sites (with IBM UK) OGSA-DAI with DiscoveryLink –eDIKT: porting OGSA-DAI to other platforms OGSA-DAI with performance –DEISA: linking Europe’s HPC centres OGSA-DAI with distributed accounting –MS.Net Grid: porting OGSA-DAI to the.Net framework (with Microsoft Research UK) OGSA-DAI with.Net
ODD Genes OGSA-DAI used to query gene expression data resources at GTI and HGU –One data resource: low spatial resolution, high gene resolution –Other resource: high spatial resolution, low gene resolution –Query one database and use data to find correct data resource to run more detailed query and produce visualisation –Simple example of data integration at work Client Query Render GTI GDS EPCC HGU
Project Timeline Feb ’02May ’02Jul ’02Sep ’02Dec ’02Feb ’03Sep ’03 Ship Release 1 (Jan 15 th 2003) RDB + GT2 / OGSA Prototypes Available XML + OGSA Prototype Available Design Documents & Demos for DAIS GGF5 XML + OGSA Prototypes for Early Adopters WS + GSI UK support ( > 100 downloads) GGF7 GGF6 WG Papers & Prototypes today Release 2 Release 3 Phase 2 Starts Phase 1 Starts Release 1.5 (Feb 28 th 2003) OGSADAI NeSC Early Adopters NeSC NeSC GT3 A3GT3 Beta GT3 A4GT3 Final May ’03 GT3 A1 GT3 A2 TP5TP4 Release 2.5
A DAIT for the Future DAIT (Data Access and Integration Two) –follow on project from OGSA-DAI, funded for two years –continue to research, prototype and productise –release every six months, R4 in December 2003 –R4: support for SQL Server and structured filesystems extended DBMS management functionality (e.g. archive) bulk load operations (where supported) support for DFDL file access triggers exposed through notification –R5 Distributed Query Processing, Distributed Transactions Virtualised views across databases
Further information The OGSA-DAI Project Site: – The DAIS-WG site: – OGSA-DAI Users Mailing list –General discussion on grid data access and integration Formal support for OGSA-DAI releases – + OGSA-DAI training courses –