Presentation is loading. Please wait.

Presentation is loading. Please wait.

Neil Chue Hong Project Manager, EPCC +44 131 650 5957 OGSA-DAI data access and integration NERC GridGIS workshop eSI, 1 February.

Similar presentations


Presentation on theme: "Neil Chue Hong Project Manager, EPCC +44 131 650 5957 OGSA-DAI data access and integration NERC GridGIS workshop eSI, 1 February."— Presentation transcript:

1 Neil Chue Hong Project Manager, EPCC N.ChueHong@epcc.ed.ac.uk +44 131 650 5957 OGSA-DAI data access and integration NERC GridGIS workshop eSI, 1 February 2006

2 NERC GridGIS workshop - 1 February 20062 Overview The Data Deluge –challenges of increasing data availability –benefits of bringing data together OGSA-DAI –overview –use as a data integration base layer

3 NERC GridGIS workshop - 1 February 20063 The Data Deluge Entering an age of data –Data Explosion –CERN: LHC will generate 1GB/s = 10PB/y –VLBA (NRAO) generates 1GB/s today –Pixar generate 100 TB/Movie –Storage getting cheaper Data stored in many different ways –Data resources –Relational databases –XML databases / files –Result files Need ways to facilitate –Data discovery –Data access –Data integration Empower e-Business and e-Science –The Grid is a vehicle for achieving this

4 NERC GridGIS workshop - 1 February 20064 Composing Observations in Astronomy No. & sizes of data sets as of mid-2002, grouped by wavelength 12 waveband coverage of large areas of the sky Total about 200 TB data Doubling every 12 months Largest catalogues near 1B objects Data and images courtesy Alex Szalay, John Hopkins

5 NERC GridGIS workshop - 1 February 20065 Data Services: motives Key to Integration of Scientific Methods –Publication and sharing of results –Primary data from observation, simulation & experiment –Encourages novel uses –Allows validation of methods and derivatives –Enables discovery by combining data collected independently Key to Large-scale Collaboration –Economies: data production, publication & management –Sharing cost of storage, management and curation –Many researchers contributing increments of data –Pooling annotation leads to rapid incremental publication –Accommodates global distribution –Data & code travel faster and more cheaply –Accommodates temporal distribution –Researchers assemble data –Later (other) researchers access data

6 NERC GridGIS workshop - 1 February 20066 Data Services: challenges to management Scale –Many sites, large collections, many uses Longevity –Research requirements outlive technical decisions Diversity –No one size fits all solutions will work –Primary Data, Data Products, Meta Data, Administrative data, … Many Data Resources –Independently owned & managed –No common goals –No common design –Work hard for agreements on foundation types and ontologies –Autonomous decisions change data, structure, policy, … –Geographically distributed and I havent even mentioned security yet!

7 NERC GridGIS workshop - 1 February 20067 Small problems Not just Grand Challenges! –Also the small problems For instance: –What happens to data when a researcher leaves a team? –How can a research leader point to popular data when a new researcher joins? –How can you manage your data when you start to run out of local storage space? –How do I get my data from one format/database to another? –How do I combine my data with your data? You need to manage your data

8 NERC GridGIS workshop - 1 February 20068 What is a data service? An interface to a stored collection of data –e.g. Google and Amazon –web services But the data could be: –replicated –shared –federated –virtual –incomplete Dont care about the underlying representation –do care about the information it represents Adding a service layer to existing data sources can improve composability

9 NERC GridGIS workshop - 1 February 20069 Examples of Data Services Many Data Services and applications –Commercial databases –Web interfaces –Applications developed individually by groups and projects Also many places to get hold of public data –Publications and citation servers –Results servers But… no such thing as a free lunch –Things are not yet Plug and Play –You need to expend some effort to use these services effectively

10 NERC GridGIS workshop - 1 February 200610 Use Cases for Data Services Data Filtering: –Single source producing large amounts of data distributed to many sites downstream Data Discovery: –many sources, many query entry points in a linked system Data Translation: –source to sink, conversion of data model / structure Data Federation: –many sources, linked to provide view as a single source Data Replication –full or partial copies to improve throughput Data Integration (model aggregation) –e.g. integration of time variant data, streams, files Data Integration (knowledge expansion) –forming links between databases to increase knowledge

11 NERC GridGIS workshop - 1 February 200611 Trade Offs Speed vs completeness –do you require the exact answer or an answer? Application specific vs language specific queries –how will users interrogate a data service? Static system vs Dynamic Discovery –do you actually have dynamic resources? Static vs Dynamic data –READ only, READ/INSERT only, UPDATE permitted Static vs Dynamic queries –optimisation over flexibility Intranet vs Internet –speed over security Single data model versus mixed data models –ease/speed over integration Queries vs Questions –assume that we know the structure when we form the query

12 NERC GridGIS workshop - 1 February 200612 Requirements on Data Services? Common Data Model e.g. RowSet Common Query Language(s) e.g. XQuery, SQL Standard access to –data resource schema information for schema mapping –physical data resource information for optimisation purposes –data resource descriptive information for discovery / integration Single, seamless security model Dynamic publication and discovery Multiple, efficient delivery methods Move computation towards data Data aggregation functionality Provenance information Replication information

13 NERC GridGIS workshop - 1 February 200613 OGSA-DAI In One Slide An extensible framework for data access and integration. Expose heterogeneous data resources to a grid through web services. Interact with data resources: – Queries and updates. – Data transformation / compression – Data delivery. Customise for your project using – Additional Activities – Client Toolkit APIs – Data Resource handlers A base for higher-level services – federation, mining, visualisation,…

14 NERC GridGIS workshop - 1 February 200614 OGSA-DAI team IBM Development Team, Hursley NEReSC, Newcastle NeSC, Edinburgh EPCC Team, Edinburgh ESNW, Manchester IBM Dissemination Team

15 NERC GridGIS workshop - 1 February 200615 OGSA-DAI Design Principles – I Efficient client-server communication –Minimise where possible –One request specifies multiple operations No unnecessary data movement –Move computation to the data –Utilise third-party delivery –Apply transforms (e.g., compression) Build on existing standards –Fill-in gaps where necessary –DAIS specifications from DAIS WG at GGF

16 NERC GridGIS workshop - 1 February 200616 OGSA-DAI Design Principles – II Do not hide underlying data model –Users must know where to target queries –Data virtualisation is hard Extensible architecture –Modular and customisable –e.g., to accommodate stronger security Extensible activity framework –Cannot anticipate all desired functionality –Activity = unit of functionality –Allow users to plug-in their own

17 NERC GridGIS workshop - 1 February 200617 MySQL OGSA-DAI service Engine SQLQuery JDBC Data Resources Activities DB2 The OGSA-DAI Framework GZipGridFTPXPath XMLDB XIndice readFile File SWISS PROT XSLT SQL Server Data- bases Application Client Toolkit

18 NERC GridGIS workshop - 1 February 200618 Intermediary Simple intermediary –potential to accelerate development, logging, or filtering Persistent intermediary –e.g. to allow efficient local indexing

19 NERC GridGIS workshop - 1 February 200619 Redirector, Coordinator, Network Allowing composition and decentralisation

20 NERC GridGIS workshop - 1 February 200620 MySQL OGSA-DAI service Engine SQLQuery JDBC SQL JDBC SQL JDBC SQL JDBC SQL JDBC Multiple SQL GDS SQLQuery Extensibility Example

21 NERC GridGIS workshop - 1 February 200621 Map Retrieval: Current OGC browser Internet ServiceGIS Oracle EDINA

22 NERC GridGIS workshop - 1 February 200622 Map Retrieval: Grid Prototype OGC GIS Oracle OGSA-DAI 1 Client EDINA Basic client to demonstrate proof of concept SO-OGC

23 NERC GridGIS workshop - 1 February 200623 Map Retrieval: Security Exploit NGS infrastructure to provide secure access layer OGC ODS 1GIS Oracle Portlet Allowed users dn SO-OGC NGS Authentication EDINA

24 NERC GridGIS workshop - 1 February 200624 Map Retrieval: Integration Exploit OGSA-DAI extensibility to add e.g. overlay OGC ODS 2GIS Oracle Portlet ODS 1 Oracle Census ODS 3 Application data SO-OGC JDBC SO-OGC SQL/XML NGS Authentication

25 NERC GridGIS workshop - 1 February 200625 OGSA-DAI / EDINA prototyping work Stage 1: Using existing OGSA-DAI technology Stage 2: Extending OGSA-DAI OGSA-DAI service HTTP Data Resource WMS Server DeliverFrom URL GIS Client GIS Client URL Input Parameters Image/XML File HTTP Request HTTP Response GIS Activities

26 NERC GridGIS workshop - 1 February 200626 Core features of OGSA-DAI – I A framework for building applications –Supports data access, insert and update –Relational: MySQL, Oracle, DB2, SQL Server, Postgres –XML: Xindice, eXist –Files – CSV, BinX, EMBL, OMIM, SWISSPROT,… –Supports data delivery –SOAP over HTTP –FTP; GridFTP –E-mail –Inter-service –Supports data transformation –XSLT –ZIP; GZIP –Supports security –X.509 certificate based security

27 NERC GridGIS workshop - 1 February 200627 Core features of OGSA-DAI – II A framework for building data clients –Client toolkit library for application developers A framework for developing functionality –Extend existing activities, or implement your own –Mix and match activities to provide functionality you need Highly-extensible –Customise our out-of-the-box product –Provide your own services, client-side support and data-related functionality Comprehensive documentation and tutorials Latest release supports GT4.0 and Axis 1.2 / OMII_2 using Java 1.4

28 NERC GridGIS workshop - 1 February 200628 Distributed Query Processing Higher level services building on OGSA-DAI – specialised metadata extraction Execute queries in parallel over multiple data resources Queries mapped to algebraic expressions for evaluation Parallelism represented by partitioning queries –Use exchange operators Equality based joins in current release – supported types: long, integer, string, double and float table_scan (protein) table_scan termID=S92 (proteinTerm) reduce hash_join (proteinId) op_call (Blast) reduce exchange 3,4 12

29 NERC GridGIS workshop - 1 February 200629 DQP architecture

30 NERC GridGIS workshop - 1 February 200630 GridMiner: Data Mediation Service Principles –Tight Federation: –global (relational) schema –Virtual integration: –leave the data where it is –always up-to-date data –Build on data access from OGSA-DAI –Not bound to special architecture Supported data sources: –RDBMS (via JDBC), XMLDB (Xindice), CSV files Operators: Union all and inner join Operators are XQuery based (using SAXON)

31 NERC GridGIS workshop - 1 February 200631 Data Integration Scenario Heterogeneities: –Name in A is First Last (as the target format) –Name in C has to be combined Distribution: –3 data sources Java based schema mapping to global schema –types limited by WebRowSet

32 NERC GridGIS workshop - 1 February 200632 Data Integration Scenario (cont.) Query: SELECT p_name FROM patient WHERE id=10 to Standard optimized

33 NERC GridGIS workshop - 1 February 200633 caBIG Object-Oriented view of data –Data types are well-defined and registered in a repository –Standardized metadata facilitates discovery –custom query language implemented as an activity

34 NERC GridGIS workshop - 1 February 200634 LEAD IU NCSA Illinois UA Huntsville Millersville UCAR Unidata Okla Univ Master catalog Each satellite replicates its contents to the master catalog

35 NERC GridGIS workshop - 1 February 200635 FirstDIG Data mining with the First Transport Group, UK –Example: When buses are more than 10 minutes late there is an 82% chance that revenue drops by at least 10% –"The results of this exercise will revolutionise the way we do things in the bus industry., Darren Unwin, Divisional Manager, First South Yorkshire. –Client based joins, using temporary tables OGSA-DAI OGSA-DAI Client Application Data Mining Application

36 NERC GridGIS workshop - 1 February 200636 OGSA-DAI Challenges Metadata extraction –define a common model for e.g. database schema? Intermediate representation –between multiple models (relational, XML,…) –XML WebRowSet is flexible (c.f. GridMiner) but expansive –DFDL and GridFTP/parallel HTTP? Query definition –translation of queries –aggregation of results Data transport and workflow –workflow is typically compute driven Move computation to data –mobile code activities? –data services hosted on DBMS?

37 NERC GridGIS workshop - 1 February 200637 Contributing to OGSA-DAI Additional functionality: –Provide activities which implement specific functionality –Provide extra client functionality –Provide different security mechanisms –Provide higher level components and applications Different levels of contributions –Based on OGSA-DAI? –Works with OGSA-DAI? –Part of OGSA-DAI?

38 NERC GridGIS workshop - 1 February 200638 In the near future A new version of the OGSA-DAI Engine –should look mostly the same externally –better support for concurrency, sessions and monitoring Implementing new versions of specifications –DAIS Specifications Key things that we will be addressing: –Performance –A Security Model which can be applied across platforms –Full Transactions framework, distributed transactions –More data integration facilities –Better abstraction over DBMS variation Application centric queries –collaborating with other projects Research projects looking at: –schema mapping –extended data resources

39 NERC GridGIS workshop - 1 February 200639 Associated Meetings and Workshops DIALOGUE Workshops (http://www.datagrids.org)http://www.datagrids.org –Data Integration Applications: Linking Organisations to Gain Understanding and Experience –Bringing together Data Integration middleware and application providers with users –Next one at NeSC: 9-10 th February 2006 –http://www.nesc.ac.uk/esi/events/636/ Next Generation Distributed Data Management (HPDC15, Paris) –http://www.isi.edu/~annc/distributedDataWorkshop.htmlhttp://www.isi.edu/~annc/distributedDataWorkshop.html Data Management on Grids (VLDB06, Seoul)

40 NERC GridGIS workshop - 1 February 200640 Conclusions The benefits of trying to integrate data are hindered by challenges such as heterogeneity, scale and distribution A common data service layer should make data integration easier OGSA-DAI provides an extensible, data service based framework which makes it easier to implement data integration GIS data is amenable to integration using data services

41 NERC GridGIS workshop - 1 February 200641 Further information The OGSA-DAI Project Site: –http://www.ogsadai.org.uk The DAIS-WG site: –http://forge.gridforum.org/projects/dais-wg/ OGSA-DAI Users Mailing list –users@ogsadai.org.uk –General discussion on grid DAI matters Formal support for OGSA-DAI releases –http://bugs.ogsadai.org.uk/ OGSA-DAI training courses


Download ppt "Neil Chue Hong Project Manager, EPCC +44 131 650 5957 OGSA-DAI data access and integration NERC GridGIS workshop eSI, 1 February."

Similar presentations


Ads by Google