A Grid Data Integration Service (OGSA-DQP) Paul Watson, University of Newcastle-upon-Tyne based on the work of… Norman Paton, Tasos Gounaris, Alvaro Fernandes, Rizos Sakellariou University of Manchester Jim Smith, Arijit Mukherjee, Paul Watson University of Newcastle-upon-Tyne Paul Watson, University of Newcastle-upon-Tyne based on the work of… Norman Paton, Tasos Gounaris, Alvaro Fernandes, Rizos Sakellariou University of Manchester Jim Smith, Arijit Mukherjee, Paul Watson University of Newcastle-upon-Tyne
2 The Problem Many grid applications would benefit from access to distributed data Data sources are scattered and autonomous Integration is often done by tedious manual process or (recently) hand-coded workflows We are interested in how to simplify the process of querying distributed data Focussing initially on information held in (relational) databases
3 Distributed Query Processing Queries are expressed in OQL allows computations to be included in the query A single query may reference data at multiple sites the data locations may be transparent to the query author select p.proteinId, Blast(p.sequence) from protein p, proteinTerm t where t.termId = ‘S92’ and p.proteinId = t.proteinId
4 Query Compiler Logical Optimiser Physical Optimiser PartitionerScheduler Evaluator OQL Parser Single-node optimiser Multi-node optimiser OGSA-DQP automatically compiles and executes the query on a set of Grid nodes - in parallel where possible
5 Execution Plan select p.proteinId, Blast(p.sequence) from protein p, proteinTerm t where t.termId = ‘S92’ and p.proteinId = t.proteinId The plan is split in to a set of partitions Grid resources are acquired to execute the partitions in parallel where possible, required and affordable table_scan (protein) table_scan termID=S92 (proteinTerm) reduce hash_join (proteinId) op_call (Blast) reduce exchange 1 2 9,10 3-8
6 Evaluation on the Grid The OGSA-DQP builds on OGSA-DAI accesses relational databases wrapped by OGSA-DAI Oracle, DB2, MySQL Data streams between nodes flow control All services are OGSI-compliant built on GT3
7 Execution on the Grid
8 Mutual Benefit The Grid needs DQP: Declarative, high-level resource integration with implicit parallelism Cost based optimisation DQP needs the Grid: Systematic access to remote data and computational resources Dynamic resource discovery and allocation
9 Summary DQP is a potentially important technology for the Grid OGSA-DQP supports: declarative expression of queries location transparency access to both data and computational resources dynamic deployment on Grid resources implicit parallelism First release made in September 2003 available for download Dynamic adaptation now being investigated fault-tolerance, performance, cost
10 Experiences and Issues Remote service deployment not yet available for Grids, but some work… PhD Project at Newcastle (Chris Fowler) dynamically deploy individual services remotely initial prototype by end of November 2003 working on security issues WS only GridShed project (Newcastle + BT) design of hosting environments for Grids install execution images on nodes as required
11 Experiences & Issues DQP vs Workflow? for what space of problems is each better DQP advantages? declarative expression of intent cost-based choice of execution plans implicit parallelisation Investigating with Bioinformatics applications in the my Grid project DQP with workflows & workflows with DQP
12 Projects/Sponsors Projects OGSA-DAI Polar Polar* my Grid Sponsors