Web: OGSA-DAI 3.0 Ally Hume, Amy Krause OGSA-DAI Workshop 17th October 2007
Web: Overview What is OGSA-DAI? Sharing data in a grid Data-centric workflows Accessing OGSA-DAI Components and customisation Case study – SEE-GEO Performance
Web: What is OGSA-DAI? An extensible framework accessed via web services that executes data-centric workflows involving heterogeneous data resources for the purposes of data access, integration, transformation and delivery within a grid and is intended as a toolkit for building higher-level application-specific data services
Web: OGSA-DAI 3.0 OGSA-DAI has evolved constantly since February 2002 OGSA-DAI 2.2 released April 2006 As the number of users grew so did the requirements o More effective data streaming o Standardisation of activity inputs and outputs o Targeting multiple data resources in a single workflow o Supporting application-specific presentation layers OGSA-DAI 2.2 was not suitable for addressing these OGSA-DAI 3.0 o A complete re-design and re-implementation of OGSA-DAI o A stable framework for the future o Released September 2007
Web: Sharing data in a grid
Web: Motivation Grid is about sharing resources OGSA-DAI is about sharing structured data resources
Web: Sharing data via web site download ZIP up data and put it on a web site Pros o Easy distribution for providers o Easy access for consumers Cons o Consumers have to download all the data o Consumers have to load data into local databases to use it o Static snapshot o Security
Web: Sharing data via direct access Providers tell consumers o Database URL – mycomputer.epcc.ed.ac.uk:3306 o Username – userID o Password – password Pros o Consumers have direct access Cons o Firewall issues o User and password management is hard o No consistent security model o Hard to use in grid/web service workflows
Web: Sharing data via direct access Cons (continued) o No server-side layer in which to standardize database heterogeneities o Myriad drivers o Different APIs across different data types Relational and JDBC XML and XMLDB Indexed files and Lucene
Web: Manipulate data using domain-specific operations, e.g. o Book findByISBN(ISBN) o List findByAuthor(Author) o List findByKeyword(Word) Pros o Fits with grid/web service approach o Abstraction hides back-end database details o Web services are programming language neutral o Operations likely to map well to authorization policies Domain-specific web services
Web: Cons o Slower than direct access Web service layer SOAP transport overhead – especially for large result sets o Domain-specific API prevents use of generic data exploration, mining and manipulation tools Domain-specific web services
Web: OGSA-DAI generic web services Manipulate data using OGSA-DAI’s generic web services Clients sees the data in its ‘raw’ format, e.g. o Tables, columns, rows for relational data o Collections, elements etc. for XML data Clients can obtain the schema of the data Clients send queries in appropriate query language, e.g. SQL, XPath
Web: Getting away from SOAP – workflows
Web: Getting away from SOAP Asides from FTP there is also… SOAP attachments o Data comes along with, but external to, a SOAP message GridFTP …
Web: Data-centric workflows
Web: OGSA-DAI is not just about data access SQLQuery SELECT * FROM Bands WHERE name = Bangles; TupleToWebRowSetCharArrays DeliverToURL ftp:// XSLTransform ObtainFromHTTP esheets/webRowSetToHTML.xsl tuples XSL HTML WebRowSet XML Access Transform Deliver Data streams between activities Activity Request
Web: Data integration with OGSA-DAI workflows Using a single workflow
Web: Data integration with OGSA-DAI workflows Across OGSA-DAI services
Web: Distributed query processing
Web: Workflows in more detail
Web: Activities An activity is a named unit of functionality o A well defined workflow unit o Pluggable Example activities include o Execute an SQL query o ZIP a batch of data o List the files in a directory o Execute an XSL transform on an XML document o Deliver data to an FTP server Comprehensive and consistent standard activity set o Karasavvas, K. Atkinson, M.P. and Hume, A.C. OGSA-DAI – Redesigned and New Activities o AndNewActivitiesV1.9.pdf AndNewActivitiesV1.9.pdf
Web: Activity inputs and outputs An activity can have o 0 or more named inputs o 0 or more named outputs Blocks of data flow from an activity’s output into another activity’s input
Web: Activity inputs and parameters No distinction between inputs and parameters Input literal o Special kind of input o Value is provided by client Client chooses whether input value is o Specified by the client in a request o Is obtained from the output of another activity
Web: Activity input and output types and blocks Inputs expect blocks of specific types Outputs produce blocks of specific types
Web: Block types Java’s basic types o Object, String, Integer, Long, Double, Number, Boolean Binary types o char[], byte[], Clob, Blob Tuple o OGSA-DAI representation of a row of relational data o One element per column MetadataWrapper o OGSA-DAI wrapper for any object to be treated as meta-data o Use application-specific meta-data within OGSA-DAI o Individual activities handle metadata blocks as they see fit Application-specific objects
Web: Blocks and binary data BLOBs o BLOBs obtained from databases are stored as BLOB objects within Tuples o References to entire BLOBs are passed between activities o Keep data grouped as a tuple Byte arrays o Data obtained from FTP o Fits pipeline streaming model used in OGSA-DAI All binary data processing activities can handle both representations
Web: Blocks and lists A list groups related blocks together o Special blocks are used to mark the beginning and the end of a list For example SQLQuery can dynamically take any number of SQL query expressions as input o Lists allow differentiation between the results of query1 and those of query2 Activities define the granularity of their inputs and outputs
Web: Activities and resources Activities can be targeted at OGSA-DAI resources Data resource o OGSA-DAI abstraction of a data resource Session o OGSA-DAI container for state Data source o Exposes data for asynchronous retrieval (pull) Data sink o Receives data for asynchronous delivery (push)
Web: Executing workflows – data streaming Activities in a workflow execute in parallel Data streams through activities in a pipeline-like way Each activity operates on a different portion of a data stream o If the activities are well defined
Web: Types of workflows Pipeline workflow o A set of chained activities executed in parallel with data flowing between the activities Sequence workflow o A set of sub-workflows each executed in sequence o For example Sub-workflow 1 – create a database table Sub-workflow 2 – bulk load data into the table Parallel workflow o A set of sub-workflows executed in parallel
Web: Example workflows
Web: Query – Transform – Deliver
Web: Query – Transform – Deliver
Web: Inter-database data transfer
Web: Get and deliver BLOBs
Web: Federate resources via resource groups
Web: Spawn sub-workflows
Web: Execute complex data-centric workflows Obtain scan data for scans since date d of embryos in stage s showing expression of gene g
Web: Using OGSA-DAI
Web: Accessing OGSA-DAI – executing workflows Client submits workflow (= request) to data request execution service Data request execution service (DRES) o Web service o Exposes a data request execution resource (DRER)
Web: Accessing OGSA-DAI – executing workflows Request status o Returned to client o Status of execution of each activity in the workflow Did it complete? Did it run into an error? o Status of execution of whole workflow Derived from status of individual activities Did they all complete? Did any run into errors? Was the workflow prematurely terminated by the client? o Data – depending upon the activities in the workflow
Web: Accessing OGSA-DAI Data Request Execution Resource (DRER) o Workflow execution engine Parses workflow Creates activities Provides activities with target resources (if any) Executes workflow Builds request status Manages sessions Data resources o OGSA-DAI abstractions of data resources databases, file systems, web services,… o Provides access to the data resource e.g. via JDBC, XMLDB, Java File I/O,…
Web: More OGSA-DAI resources and services Data sources o Expose data for asynchronous retrieval (pull) Data sinks o Receive data from asynchronous delivery (push) Sessions o A state container associated with a set of workflows o Share state between workflows Requests o One per workflow submitted to a DRER o Access request status
Web: Resources and activities revisited Activities can be written to interact with any type of resource o SQLQuery – JDBC data resource o XPathQuery – XMLDB data resource o SQLBag – ResourceGroup data resource o ObtainFromDataSource – data source Some activities can create resources o CreateDataSource o CreateDataSink o CreateResourceGroup
Web: Components and customisation
Web: OGSA-DAI 3.0 Extension Points OMII Transform GTAxisUNICOREWS-DAI? Embedded OGSA-DAI Core Resource management Activity management Workflow engine RDBFile?XMLDB SQLQuery?DeliverToFTPObtainFromGFTP gLite
Web: OGSA-DAI 3.0 Persistence and Configuration OMII ActivitiesData Resources GTAxisUNICOREgLiteWS-DAI?Embedded OGSA-DAI Core Resource management Activity management Workflow engine
Web: Extending OGSA-DAI – activities Additional generic functionality o e.g. deliverToMessageQueue Additional resource-specific functionality o e.g. sqlStoredProcedure Application-specific functionality o e.g. transformToFasta
Web: Example – application-specific activities
Web: Extending OGSA-DAI – data resources A data resource can be anything… o Local or remote o Real or virtual o Persistent or in-memory For example o A view onto a relational database o A new XML database o Open Geospatial Consortium (OGC) data access services o Application specific web service
Web: Extending OGSA-DAI – presentation layers Expose workflows Hide OGSA-DAI behind domain-specific web services o Map service operations to “template” OGSA-DAI workflows o Assist in using OGSA-DAI within workflow engines e.g. Taverna
Web: Example – OGSA-DAI and Globus Security Authorization on incoming SOAP request
Web: Example – OGSA-DAI and Globus Security SecurityContext object o One for each request o By default contains DN and credential Login provider plug-in objects o One for each relational resource o Maps SecurityContext -> database login ResourceAuthorizer plug-in object o Used by ResourceAuthorizerPDP o Listens to event from resource manager RuntimeWorkflowAuthorizer plug-in object o Authorizes dynamically created workflows at runtime
Web: Client Toolkit
Web: Clients and web services Clients interact with web services via SOAP over HTTP o Deduce service interface from service WSDL description o Construct SOAP request to invoke operation o Parse SOAP response from service
Web: OGSA-DAI client toolkit Client-side abstractions of o Activities o Workflows o Resources o Services Get client-side proxies for OGSA-DAI resources exposed by OGSA-DAI services Submit workflows to these proxies Client toolkit manages o Submission of workflow to OGSA-DAI service o Parsing of the request execution status and data from the service Focus on constructing applications
Web: OGSA-DAI case-study – SEE-GEO
Web: SEE-GEO SEcurE access to GEOspatial services o EDINA, NeSC, NCeSS, MIMAS o Access to geospatial information on a grid Open Geospatial Consortium (OGC) web services
Web: OGC Geolinking Interoperability Experiment OGSA-DAI being extended to offer integrated, distributed resource management for geo-spatial tools Using established open interoperability standards Web Feature Service (WFS) and Web Map Service (WMS) integrated into OGSA-DAI The IE is hardening candidate OGC specifications o Geolinked Data Access Service (GDAS) o GeoLinking Service (GLS) Validate Web Coverage Service (WCS) scheduled Extend to support secure access
Web: e-Social Science demonstrator Two data resources o Census statistics Attributes about a region e.g. the cost of a loaf of bread Geo-data access service (GDAS) o Borders data Unique regions encoded as polygons Web feature service (WFS) How to link the attributes to the regions? A geo-linking service o Execute a join across the two data sets Implemented as a Web Processing Service (WPS)
Web: Demographic forecasting Census DB Borders DB WFS GDAS OGSA-DAI getData getFeature geoLink Feature Portrayal GLS Portal Map Server Receive ticket for results Retrieve annotated image Store image on server Send parameterised query FPS Call out to existing FP service Cache attributes Stream polygons Request attributes Request features Run algorithm Stream relevant annotated polygons Concentrate on algorithm Access domain-specific data sets Utilise existing services Efficient delivery methods
Web: What did OGSA-DAI give SEE-GEO? Could implement GLS service without OGSA- DAI But using OGSA-DAI allowed leverage of o Workflow engine o Out-of-the-box activities for Queries Delivery o Security o Other grid technologies e.g. GridFTP
Web: What did OGSA-DAI give SEE-GEO? A toolkit to o Develop domain-specific activities o Develop support for domain-specific data resources o Ability to execute workflows using these o Build OGC Web Processing Services (WPS) Relatively little effort to o Choose different data resources dynamically o Merge GDAS XML into a relational data resource o Transfer data using GridFTP o Protect data using GSI o Experiment!
Web: What next for SEE-GEO? Deployment o Component integration o Bug fixing o Testing o Performance testing of OGSA-DAI as a GLS Complete participation in the OGC Interoperability Experiment Add Web Coverage Service (WCS) Look at security and OGC o Shibboleth o Grid Security Infrastructure (GSI) o PrivilEge and Role Management Infrastructure Standards Validation (PERMIS) OGSA-DAI:Z SRW/U bridge o Ordnance Survey Master Map delivery using a grid
Web: OGSA-DAI and Performance
Web: Performance OGSA-DAI is (another) component that sits between clients and the data they want How can we minimize the overhead of OGSA- DAI? How can OGSA-DAI be used effectively?
Web: Synchronous execution Client submits workflow to OGSA-DAI OGSA-DAI does not return until workflow has executed Request status is then returned to the client Pros o Good for interactive clients which need constant communication with an OGSA-DAI server for small operations Cons o Data is returned via SOAP/HTTP which incurs a performance hit o Not ideal for complex requests with lots of operations o Not ideal for requests which return large volumes of data o Not ideal for responsive clients o Not ideal for clients that need interim results during time-consuming workflows
Web: Asynchronous execution Client submits workflow to OGSA-DAI OGSA-DAI returns immediately with initial request status Client contacts OGSA-DAI later to retrieve final request status Pros o Can avoid returning data via SOAP/HTTP o Good for complex requests with lots of operations o Good for requests which produce large volumes of data o Good for responsive clients o Good for clients that need interim results during time-consuming workflows Cons o Client must poll OGSA-DAI to determine when execution is complete
Web: Request status Synchronous request – returned by data request execution service Asynchronous request – returned by request management service Pros o Get data directly from OGSA-DAI o Easy to manipulate client-side – client toolkit supports extraction of data and parsing into useful objects Cons o Is transferred from server to client via SOAP/HTTP and so incurs serialization/deserialization overhead o Performance (time and memory) quickly degrades if it contains large amounts of data
Web: Improving request status – aggregators Aggregator activities make request status more scalable Group character and byte arrays into larger chunks Improve performance up to 50% in some scenarios
Web: Aggregators
Web: File Transfer Protocol (FTP) Standard OGSA-DAI delivery option Limited by OGSA-DAI throughput and file transfer rate OGSA-DAI streaming model => should be higher than standard FTP as there is data collection and processing being executed simultaneously o e.g. 60MB data transfer FTP is 30s, OGSA-DAI transform to CSV file = +15% hit Pros o Improved scalability over request status o Great for large data sets of 10,000,000+ rows o Useful if using OGSA-DAI in conjunction with third-party systems Cons o Requires an FTP server o Requires clients to pick up the data from the FTP server
Web: Data sources and data sinks Data sources o Data can be streamed from an activity into a data source o Clients pull the data from the data source via a data source service o Options to pull all the data back at once or a set number of blocks at a time Pros o No need for external components e.g. FTP o Can stream back data in small chunks o Client toolkit supports interaction with data sources and parsing data into a useful form o More performant that using request status o Used with asynchronous requests it can handle 1,000,000 row datasets Cons o Data is returned via SOAP/HTTP but aggregators can offset this o Limited storage capacity for synchronous requests – a workflow will block when capacity is reached Data sinks o Complement of data sources o Used for transferring data from clients to OGSA-DAI
Web: Summary
Web: OGSA-DAI 3.0 Releases OGSA-DAI Project o OGSA-DAI 3.0 on Globus Toolkit o OGSA-DAI 3.0 on Apache Axis 1.4 or OMII-Europe (OMII-EU) Project o OGSA-DAI 3.0 on UNICORE 6 o OGSA-DAI 3.0 on gLite 3.0 o
Web: Future Events Training courses offered by OMII-Europe and OMII-UK o e-Science Institute, Edinburgh o Deploying Grid Data Services using OGSA-DAI o Thursday 1 st -Friday 2 nd November 2007
Web: Summary OGSA-DAI not just an out-of-the box application for data access OGSA-DAI is o an extensible framework o accessed via web services o that executes data-centric workflows o involving heterogeneous data resources o for the purposes of data access, integration, transformation and delivery o within a Grid o and is intended as a toolkit for building higher-level application-specific data services
Web: Further information Come and chat to us at the booth Grab anyone wearing our T-shirts OGSA-DAI o WWW site – o Info – o Users list – OMII-UK o WWW site – o Info –