Neil Chue Hong Project Manager, EPCC OGSA-DAI data access and integration NERC GridGIS workshop eSI, 1 February 2006
NERC GridGIS workshop - 1 February Overview The Data Deluge –challenges of increasing data availability –benefits of bringing data together OGSA-DAI –overview –use as a data integration base layer
NERC GridGIS workshop - 1 February The Data Deluge Entering an age of data –Data Explosion –CERN: LHC will generate 1GB/s = 10PB/y –VLBA (NRAO) generates 1GB/s today –Pixar generate 100 TB/Movie –Storage getting cheaper Data stored in many different ways –Data resources –Relational databases –XML databases / files –Result files Need ways to facilitate –Data discovery –Data access –Data integration Empower e-Business and e-Science –The Grid is a vehicle for achieving this
NERC GridGIS workshop - 1 February Composing Observations in Astronomy No. & sizes of data sets as of mid-2002, grouped by wavelength 12 waveband coverage of large areas of the sky Total about 200 TB data Doubling every 12 months Largest catalogues near 1B objects Data and images courtesy Alex Szalay, John Hopkins
NERC GridGIS workshop - 1 February Data Services: motives Key to Integration of Scientific Methods –Publication and sharing of results –Primary data from observation, simulation & experiment –Encourages novel uses –Allows validation of methods and derivatives –Enables discovery by combining data collected independently Key to Large-scale Collaboration –Economies: data production, publication & management –Sharing cost of storage, management and curation –Many researchers contributing increments of data –Pooling annotation leads to rapid incremental publication –Accommodates global distribution –Data & code travel faster and more cheaply –Accommodates temporal distribution –Researchers assemble data –Later (other) researchers access data
NERC GridGIS workshop - 1 February Data Services: challenges to management Scale –Many sites, large collections, many uses Longevity –Research requirements outlive technical decisions Diversity –No one size fits all solutions will work –Primary Data, Data Products, Meta Data, Administrative data, … Many Data Resources –Independently owned & managed –No common goals –No common design –Work hard for agreements on foundation types and ontologies –Autonomous decisions change data, structure, policy, … –Geographically distributed and I havent even mentioned security yet!
NERC GridGIS workshop - 1 February Small problems Not just Grand Challenges! –Also the small problems For instance: –What happens to data when a researcher leaves a team? –How can a research leader point to popular data when a new researcher joins? –How can you manage your data when you start to run out of local storage space? –How do I get my data from one format/database to another? –How do I combine my data with your data? You need to manage your data
NERC GridGIS workshop - 1 February What is a data service? An interface to a stored collection of data –e.g. Google and Amazon –web services But the data could be: –replicated –shared –federated –virtual –incomplete Dont care about the underlying representation –do care about the information it represents Adding a service layer to existing data sources can improve composability
NERC GridGIS workshop - 1 February Examples of Data Services Many Data Services and applications –Commercial databases –Web interfaces –Applications developed individually by groups and projects Also many places to get hold of public data –Publications and citation servers –Results servers But… no such thing as a free lunch –Things are not yet Plug and Play –You need to expend some effort to use these services effectively
NERC GridGIS workshop - 1 February Use Cases for Data Services Data Filtering: –Single source producing large amounts of data distributed to many sites downstream Data Discovery: –many sources, many query entry points in a linked system Data Translation: –source to sink, conversion of data model / structure Data Federation: –many sources, linked to provide view as a single source Data Replication –full or partial copies to improve throughput Data Integration (model aggregation) –e.g. integration of time variant data, streams, files Data Integration (knowledge expansion) –forming links between databases to increase knowledge
NERC GridGIS workshop - 1 February Trade Offs Speed vs completeness –do you require the exact answer or an answer? Application specific vs language specific queries –how will users interrogate a data service? Static system vs Dynamic Discovery –do you actually have dynamic resources? Static vs Dynamic data –READ only, READ/INSERT only, UPDATE permitted Static vs Dynamic queries –optimisation over flexibility Intranet vs Internet –speed over security Single data model versus mixed data models –ease/speed over integration Queries vs Questions –assume that we know the structure when we form the query
NERC GridGIS workshop - 1 February Requirements on Data Services? Common Data Model e.g. RowSet Common Query Language(s) e.g. XQuery, SQL Standard access to –data resource schema information for schema mapping –physical data resource information for optimisation purposes –data resource descriptive information for discovery / integration Single, seamless security model Dynamic publication and discovery Multiple, efficient delivery methods Move computation towards data Data aggregation functionality Provenance information Replication information
NERC GridGIS workshop - 1 February OGSA-DAI In One Slide An extensible framework for data access and integration. Expose heterogeneous data resources to a grid through web services. Interact with data resources: – Queries and updates. – Data transformation / compression – Data delivery. Customise for your project using – Additional Activities – Client Toolkit APIs – Data Resource handlers A base for higher-level services – federation, mining, visualisation,…
NERC GridGIS workshop - 1 February OGSA-DAI team IBM Development Team, Hursley NEReSC, Newcastle NeSC, Edinburgh EPCC Team, Edinburgh ESNW, Manchester IBM Dissemination Team
NERC GridGIS workshop - 1 February OGSA-DAI Design Principles – I Efficient client-server communication –Minimise where possible –One request specifies multiple operations No unnecessary data movement –Move computation to the data –Utilise third-party delivery –Apply transforms (e.g., compression) Build on existing standards –Fill-in gaps where necessary –DAIS specifications from DAIS WG at GGF
NERC GridGIS workshop - 1 February OGSA-DAI Design Principles – II Do not hide underlying data model –Users must know where to target queries –Data virtualisation is hard Extensible architecture –Modular and customisable –e.g., to accommodate stronger security Extensible activity framework –Cannot anticipate all desired functionality –Activity = unit of functionality –Allow users to plug-in their own
NERC GridGIS workshop - 1 February MySQL OGSA-DAI service Engine SQLQuery JDBC Data Resources Activities DB2 The OGSA-DAI Framework GZipGridFTPXPath XMLDB XIndice readFile File SWISS PROT XSLT SQL Server Data- bases Application Client Toolkit
NERC GridGIS workshop - 1 February Intermediary Simple intermediary –potential to accelerate development, logging, or filtering Persistent intermediary –e.g. to allow efficient local indexing
NERC GridGIS workshop - 1 February Redirector, Coordinator, Network Allowing composition and decentralisation
NERC GridGIS workshop - 1 February MySQL OGSA-DAI service Engine SQLQuery JDBC SQL JDBC SQL JDBC SQL JDBC SQL JDBC Multiple SQL GDS SQLQuery Extensibility Example
NERC GridGIS workshop - 1 February Map Retrieval: Current OGC browser Internet ServiceGIS Oracle EDINA
NERC GridGIS workshop - 1 February Map Retrieval: Grid Prototype OGC GIS Oracle OGSA-DAI 1 Client EDINA Basic client to demonstrate proof of concept SO-OGC
NERC GridGIS workshop - 1 February Map Retrieval: Security Exploit NGS infrastructure to provide secure access layer OGC ODS 1GIS Oracle Portlet Allowed users dn SO-OGC NGS Authentication EDINA
NERC GridGIS workshop - 1 February Map Retrieval: Integration Exploit OGSA-DAI extensibility to add e.g. overlay OGC ODS 2GIS Oracle Portlet ODS 1 Oracle Census ODS 3 Application data SO-OGC JDBC SO-OGC SQL/XML NGS Authentication
NERC GridGIS workshop - 1 February OGSA-DAI / EDINA prototyping work Stage 1: Using existing OGSA-DAI technology Stage 2: Extending OGSA-DAI OGSA-DAI service HTTP Data Resource WMS Server DeliverFrom URL GIS Client GIS Client URL Input Parameters Image/XML File HTTP Request HTTP Response GIS Activities
NERC GridGIS workshop - 1 February Core features of OGSA-DAI – I A framework for building applications –Supports data access, insert and update –Relational: MySQL, Oracle, DB2, SQL Server, Postgres –XML: Xindice, eXist –Files – CSV, BinX, EMBL, OMIM, SWISSPROT,… –Supports data delivery –SOAP over HTTP –FTP; GridFTP – –Inter-service –Supports data transformation –XSLT –ZIP; GZIP –Supports security –X.509 certificate based security
NERC GridGIS workshop - 1 February Core features of OGSA-DAI – II A framework for building data clients –Client toolkit library for application developers A framework for developing functionality –Extend existing activities, or implement your own –Mix and match activities to provide functionality you need Highly-extensible –Customise our out-of-the-box product –Provide your own services, client-side support and data-related functionality Comprehensive documentation and tutorials Latest release supports GT4.0 and Axis 1.2 / OMII_2 using Java 1.4
NERC GridGIS workshop - 1 February Distributed Query Processing Higher level services building on OGSA-DAI – specialised metadata extraction Execute queries in parallel over multiple data resources Queries mapped to algebraic expressions for evaluation Parallelism represented by partitioning queries –Use exchange operators Equality based joins in current release – supported types: long, integer, string, double and float table_scan (protein) table_scan termID=S92 (proteinTerm) reduce hash_join (proteinId) op_call (Blast) reduce exchange 3,4 12
NERC GridGIS workshop - 1 February DQP architecture
NERC GridGIS workshop - 1 February GridMiner: Data Mediation Service Principles –Tight Federation: –global (relational) schema –Virtual integration: –leave the data where it is –always up-to-date data –Build on data access from OGSA-DAI –Not bound to special architecture Supported data sources: –RDBMS (via JDBC), XMLDB (Xindice), CSV files Operators: Union all and inner join Operators are XQuery based (using SAXON)
NERC GridGIS workshop - 1 February Data Integration Scenario Heterogeneities: –Name in A is First Last (as the target format) –Name in C has to be combined Distribution: –3 data sources Java based schema mapping to global schema –types limited by WebRowSet
NERC GridGIS workshop - 1 February Data Integration Scenario (cont.) Query: SELECT p_name FROM patient WHERE id=10 to Standard optimized
NERC GridGIS workshop - 1 February caBIG Object-Oriented view of data –Data types are well-defined and registered in a repository –Standardized metadata facilitates discovery –custom query language implemented as an activity
NERC GridGIS workshop - 1 February LEAD IU NCSA Illinois UA Huntsville Millersville UCAR Unidata Okla Univ Master catalog Each satellite replicates its contents to the master catalog
NERC GridGIS workshop - 1 February FirstDIG Data mining with the First Transport Group, UK –Example: When buses are more than 10 minutes late there is an 82% chance that revenue drops by at least 10% –"The results of this exercise will revolutionise the way we do things in the bus industry., Darren Unwin, Divisional Manager, First South Yorkshire. –Client based joins, using temporary tables OGSA-DAI OGSA-DAI Client Application Data Mining Application
NERC GridGIS workshop - 1 February OGSA-DAI Challenges Metadata extraction –define a common model for e.g. database schema? Intermediate representation –between multiple models (relational, XML,…) –XML WebRowSet is flexible (c.f. GridMiner) but expansive –DFDL and GridFTP/parallel HTTP? Query definition –translation of queries –aggregation of results Data transport and workflow –workflow is typically compute driven Move computation to data –mobile code activities? –data services hosted on DBMS?
NERC GridGIS workshop - 1 February Contributing to OGSA-DAI Additional functionality: –Provide activities which implement specific functionality –Provide extra client functionality –Provide different security mechanisms –Provide higher level components and applications Different levels of contributions –Based on OGSA-DAI? –Works with OGSA-DAI? –Part of OGSA-DAI?
NERC GridGIS workshop - 1 February In the near future A new version of the OGSA-DAI Engine –should look mostly the same externally –better support for concurrency, sessions and monitoring Implementing new versions of specifications –DAIS Specifications Key things that we will be addressing: –Performance –A Security Model which can be applied across platforms –Full Transactions framework, distributed transactions –More data integration facilities –Better abstraction over DBMS variation Application centric queries –collaborating with other projects Research projects looking at: –schema mapping –extended data resources
NERC GridGIS workshop - 1 February Associated Meetings and Workshops DIALOGUE Workshops ( –Data Integration Applications: Linking Organisations to Gain Understanding and Experience –Bringing together Data Integration middleware and application providers with users –Next one at NeSC: 9-10 th February 2006 – Next Generation Distributed Data Management (HPDC15, Paris) – Data Management on Grids (VLDB06, Seoul)
NERC GridGIS workshop - 1 February Conclusions The benefits of trying to integrate data are hindered by challenges such as heterogeneity, scale and distribution A common data service layer should make data integration easier OGSA-DAI provides an extensible, data service based framework which makes it easier to implement data integration GIS data is amenable to integration using data services
NERC GridGIS workshop - 1 February Further information The OGSA-DAI Project Site: – The DAIS-WG site: – OGSA-DAI Users Mailing list –General discussion on grid DAI matters Formal support for OGSA-DAI releases – OGSA-DAI training courses