Neil Chue Hong Project Manager, EPCC OGSA-DAI Status and Benchmarks All Hands Meeting 2005 Nottingham, 22 September 2005
AHM20052 Overview The all new OGSA-DAI overview Benchmarking and profiling work Project collaboration Future plans
AHM20053 OGSA-DAI team IBM Development Team, Hursley NEReSC, Newcastle NeSC, Edinburgh EPCC Team, Edinburgh ESNW, Manchester IBM Dissemination Team
AHM20054 OGSA-DAI In One Slide An extensible framework for data access and integration. Expose heterogeneous data resources to a grid through web services. Interact with data resources: – Queries and updates. – Data transformation / compression – Data delivery. Customise for your project using – Additional Activities – Client Toolkit APIs – Data Resource handlers A base for higher-level services – federation, mining, visualisation,…
AHM20055 MySQL OGSA-DAI service Engine SQLQuery JDBC Data Resources Activities DB2 The OGSA-DAI Framework GZipGridFTPXPath XMLDB XIndice readFile File SWISS PROT XSLT SQL Server Data- bases Application Client Toolkit
AHM20056 MySQL OGSA-DAI service Engine SQLQuery JDBC SQL JDBC SQL JDBC SQL JDBC SQL JDBC Multiple SQL GDS SQLQuery Extensibility Example
AHM Timeline Release 1 interim Release 2 Release 2 interim Release 3 Release 3.1 Release 4 Release 5 OGSI Release 6 Release 1 OGSA-DAI WSRF 1.0 OGSA-DAI WS-I 1.0/ OGSA-DAI WS-I 1.1 (OMII)
AHM20058 Release downloads Data up to 28/07/05
AHM20059 Geographical download profiles OGSIWSRFWS-I China (28%)China (32%)UK (30%) UK (20%)UK (19%)China (28%) US (12%)Germany (8%)US (8%) Unknown (10%)US (7%)Japan (7%) Data up to 29/07/05
AHM Our stakeholders OMII –Current version of OGSA-DAI WS-I 1.0 distribution runs on OMII –Release 1.1 due out soon –Issues when security is introduced Globus –WSRF distribution bundled with GT4.0 –WSRF 1.0 distribution bundled with GT4.0.1 Projects –Number of projects have used/use/will use OGSA-DAI AstroGridBiogridBioSimGridBridgescaGridDataMiningGrid eDiamondFirstDigGEDDMGeneGridGEONGridMiner INWAIU RGRBenchLEADMCS my GridN2Grid ODD-GenesOGSA-WebDBSIMDATGOLD
AHM Out with the old… Client Client Toolkit API Relational XML Files Client Server Data SOAP DAISGR GDS GDSF
AHM … in with the new! Client Generic Client Toolkit API WS-I WSRF DAI Core DSR Data Service WSRF WS-I DSR RelationalXML Files Client Server Data SOAP
AHM Changes in moving to WSRF/WS-I Registry component (DAISGR) no longer supported –Hope to leverage of third party registration services –GRIMOIRES ( –Others … GDS/GDSF roles combined –Use data services –Currently static services but –Reconfigurable services Improvements to the GDS –Data resource abstraction decoupled from the service –Renaming (consistent naming across platform versions) –Ability to enforce control flow constraints (ordering activities) –Refactored exception framework Temporary set-backs (we promise we’ll fix them) –No security model –No concurrency –Previously used GDSs for concurrency –Support now moving to the engine
AHM The Client Toolkit (CTk) Provides programmatic abstraction for perform documents – Do not have to write XML explicitly Abstraction over WSI and WSRF services at client side – don’t need to know what type of service is at the other end (almost) – security model is the remaining issue Currently only Java version of CTk – Stabilising API – Publish an API document – Allow 3 rd parties to develop CTk for other programming languages Client Generic Client Toolkit API WS-I WSRF
AHM The Server Side Server side: – Presentation layer: – Deal with messaging differences – Get one version per distribution – Core/Business Logic: – Common to all distributions – Data Service Resource (DSR) – Data Layer: – Relational databases – XML document repositories – File based repositories New architecture being rolled out – see Malcolm’s talk in next session – concurrency, sessions and transactions DAI Core DSR Data Service WSRF DSR Relational XML Files WS-I
AHM Benchmarking/Profiling Establish benchmark suite to: –Measure performance gains/losses between releases –Reveal implementation issues –Allows focused improvements –Establish best practice –Summer intern (Heather Kelly) produced results Profiling allows us to identify particular areas which are causing poor performance in the benchmarks –Summer intern (Radoslaw Ostrowski) extended Netlogger and did some profiling Most of the results are for OGSA-DAI R6 –one slide showing what is happening in R7
AHM Configuration Measure the time to: –Send SQL query to server –Return nRows –Sum the values in one of the columns Do this 30 times –Calculate mean and standard deviation Repeat the process having increased nRows by stepsize Try various different databases Notes: –Time to establish connection in JDBC runs not included –JDBC does not return results in WebRowSet format –Server is already running Data source little blackbook –Test database included in distributions Windows XP Pro SP2 Intel PIII 863MHz 512Mb RAM Windows XP Pro SP2 Intel PIII 863MHz 512Mb RAM SunOS 5.9 UltraSPARC-IIe 502 MHz 128Mb RAM SunOS 5.9 UltraSPARC-IIe 502 MHz 128Mb RAM Tomcat GT OGSA-DAI OGSI R6.0 j2sdk 1.4.2_01 Tomcat GT OGSA-DAI OGSI R6.0 j2sdk 1.4.2_01 10MBit network
AHM Some benchmarks Relational query – StreamServlet requires two communications – could improve this – FTP not iterating over result set – JDBC scales much better than SOAP ResultSet implementations – Forwards-backwards implementation builds DOM tree; larger memory footprint
AHM MySQL (nRows = 10000, number of runs = 30, stepsize = 500, blockSize = 200)
AHM DB2 (nRows = 10000, number of runs = 30, stepsize = 500, blockSize = 200)
AHM PostgreSQL (nRows = 10000, number of runs = 30, stepsize = 500, blockSize = 200)
AHM SQL Server (nRows = 10000, number of runs = 30, stepsize = 500, blockSize = 200)
AHM Oracle (nRows = 10000, number of runs = 30, stepsize = 500, blockSize = 200)
AHM OGSA-DAI WS-I (nRows = 10000, number of runs = 30, stepsize = 500)
AHM Database comparison (OGSA-Dai WSRF 1.0, nRows = 10000, number of runs = 30, stepsize = 500)
AHM Platform comparison (MySQL database, nRows = 10000, number of runs = 30, stepsize = 500)
AHM Profiling: better RowSet conversion ResultSet to RowSet conversion
AHM R6->R7: removal of RowSet
AHM Challenges Intermediate representation –between multiple models (relational, XML,…) –XML WebRowSet is flexible (c.f. GridMiner) but expansive –DFDL and GridFTP/parallel HTTP? Query definition –translation of queries Data transport and workflow –workflow is typically compute driven Move computation to data –mobile code activities? –data services hosted on DBMS?
AHM caBIG “Object-Oriented” view of data –Data types are well-defined and registered in a repository –Standardized metadata facilitates discovery –custom query language implemented as an activity
AHM LEAD IU NCSA Illinois UA Huntsville Millersville UCAR Unidata Okla Univ Master catalog Each satellite replicates its contents to the master catalog
AHM Users Group and DIALOGUE Workshops 3 rd Users Group meeting –June 1 st – DIALOGUE Workshops –Data Integration Applications: Linking Organisations to Gain Understanding and Experience –Columbus, Edinburgh, Vienna, Indiana –Bringing together Data Integration middleware and application providers with users –
AHM Future plans A new version of the OGSA-DAI Engine –should look mostly the same externally –better support for concurrency, sessions and monitoring –see Architecture paper/talk presented on Monday Implementing new versions of specifications –DAIS Specifications Key things that we will be addressing after Release 7: –Performance –A Security Model which can be applied across platforms –Full Transactions provision, including implementation of compensatory activities, distributed transactions –More data integration facilities –Better abstraction over DBMS variation
AHM Conclusions OGSA-DAI has had to undergo significant refactoring to keep stakeholders happy Refactoring has allowed us to create an extensible framework which can be used for many data related tasks We need to identify the components and improvements which will be useful to users There is obviously room for improvement on performance, and we are working on it
AHM Further information The OGSA-DAI Project Site: – The DAIS-WG site: – OGSA-DAI Users Mailing list –General discussion on grid DAI matters Formal support for OGSA-DAI releases – OGSA-DAI training courses
AHM Core features of OGSA-DAI – I A framework for building applications –Supports data access, insert and update –Relational: MySQL, Oracle, DB2, SQL Server, Postgres –XML: Xindice, eXist –Files – CSV, BinX, EMBL, OMIM, SWISSPROT,… –Supports data delivery –SOAP over HTTP –FTP; GridFTP – –Inter-service –Supports data transformation –XSLT –ZIP; GZIP –Supports security –X.509 certificate based security
AHM Core features of OGSA-DAI – II A framework for building data clients –Client toolkit library for application developers A framework for developing functionality –Extend existing activities, or implement your own –Mix and match activities to provide functionality you need Highly-extensible –Customise our out-of-the-box product –Provide your own services, client-side support and data-related functionality Comprehensive documentation and tutorials Latest release supports GT3.2 (to be deprecated), GT4.0, and Axis 1.2 / OMII_2 using Java 1.4
AHM OGSA-DAI Design Principles – I Efficient client-server communication –Minimise where possible –One request specifies multiple operations No unnecessary data movement –Move computation to the data –Utilise third-party delivery –Apply transforms (e.g., compression) Build on existing standards –Fill-in gaps where necessary
AHM OGSA-DAI Design Principles – II Do not hide underlying data model –Users must know where to target queries –Data virtualisation is hard Extensible architecture –Modular and customisable –e.g., to accommodate stronger security Extensible activity framework –Cannot anticipate all desired functionality –Activity = unit of functionality –Allow users to plug-in their own
AHM Data Integration challenges Metadata extraction –define a common model for e.g. database schema? Intermediate representation –between multiple models (relational, XML,…) –XML WebRowSet is flexible (c.f. GridMiner) but expansive –DFDL and GridFTP/parallel HTTP? Query definition –translation of queries Data transport and workflow –workflow is typically compute driven Move computation to data –mobile code activities? –data services hosted on DBMS?
AHM Contributing to OGSA-DAI Additional functionality: –Provide activities which implement specific functionality –Provide extra client functionality –Provide different security mechanisms –Provide higher level components and applications Different levels of contributions –Based on OGSA-DAI? –Works with OGSA-DAI? –Part of OGSA-DAI?
AHM Distributed Query Processing Queries mapped to algebraic expressions for evaluation Parallelism represented by partitioning queries –Use exchange operators Prototype available from: – Being integrated into OGSA-DAI table_scan (protein) table_scan termID=S92 (proteinTerm) reduce hash_join (proteinId) op_call (Blast) reduce exchange 3,4 12
AHM caBIG “Object-Oriented” view of data –Data types are well-defined and registered in a repository –Standardized metadata facilitates discovery –custom query language implemented as an activity
AHM LEAD IU NCSA Illinois UA Huntsville Millersville UCAR Unidata Okla Univ Master catalog Each satellite replicates its contents to the master catalog
AHM FirstDIG Data mining with the First Transport Group, UK –Example: “When buses are more than 10 minutes late there is an 82% chance that revenue drops by at least 10%” – OGSA-DAI OGSA-DAI Client Application Data Mining Application
AHM GridMiner Test application area: medical –traumatic brain injury treatment –Predicting the outcome of seriously ill patients –analytical part focuses on data mining and On-Line Analytical Processing (OLAP) Target: –provide tools to discover and access relevant knowledge and information from different distributed and heterogeneous data sources –building on and extending OGSA-DAI
AHM GridMiner Scenario Heterogeneities: –Name in A is „First Last“ (as the target format) –Name in C has to be combined Distribution: –3 data sources
AHM Software Process Testing Reqs. Prototype Prioritisation Fix Bugs Use Cases Requests Design ImplementQA Release Support Test Cases Programme Board Technical Review Board Technical Reviewer DEVELOPERS USERS REVIEW Contribs Ingest Dissem. Training Nightly unit + system tests Additional test cases System tests based on reqs Continual process → Deep track features Users’ Group Peer Review and Inspection
AHM Curtin,Australia EPCC,UK INWA Grid Engine BankTelco Grid Engine BankTelco OGSA-DAI TOG Data Browser Telco data Bank data Australian property UK Property