Extensible Framework for Data Access & Integration Malcolm Atkinson Director 10 th November 2004
Database Growth PDB Content Growth
Wellcome Trust: Cardiovascular Functional Genomics Glasgow Edinburgh Leicester Oxford London Netherlands Shared data Public curated data BRIDGES IBM
Biochemical Pathway Simulator (Computing Science, Bioinformatics, Beatson Cancer Research Labs) DTI Bioscience Beacon Project Harnessing Genomics Programme Slide from Muffy Calder, Glasgow Now largest EU project in the Life Sciences – see Walter Kolch
eDiaMoND – Compute Mammograms have different appearances, depending on image settings and acquisition systems Standard Mammo Format Standard Mammo Format Temporal mammography Computer Aided Detection 3D View Provided by eDiamond project: Prof. sir Mike Brady et al.
Automatic registration technology Rigid registration of MR and CT images of the head Inter-subject image warping Provided by IXI project: Prof. Derek Hill et al.
Move Computation to Data Code scale Depends on wet-ware No noticeable rate of improvement Data scale Grows Moore’s Law or Moore’s Law 2 Analysis of data Extracts & derivatives used Often smaller – more value for current investigation Implies move code to data SQL, Xquery, Java code, … Extensibility mechanisms used by OGSA-DAIers Java mobility (e.g. DataCutter), database procedures, … Increasingly necessary Application control or higher-level service decisions
Integration is Everything Motivation No business or research team is satisfied with one data resource Data Curation Expertise Human Centred Integration Human centred Domain-specialist driven Dynamic specification of combination function Iterative processes Revised request minutes later Revised request after months of thought Sources inevitably heterogeneous Time-varying content, structure & policies Robust, stable steerable integration services Higher-level services over multiple resources Fundamental requirements for (re)negotiation Federation or Virtualisation preceding integration or kit of integration tools to be interwoven with an application?
OGSA Infrastructure Architecture Grid or Web Service Infrastructure Data Intensive Applications for Science X Compute, Data & Storage Resources Distributed Simulation, Analysis & Integration Technology for Science X Data Intensive X Scientists Virtual Integration Architecture Generic Virtual Data Access and Integration Layer Structured Data Integration Structured Data Access Structured Data Relational XML Semi-structured- Transformation Registry Job Submission Data TransportResource Usage Banking BrokeringWorkflow Authorisation OGSA-DAI
Database (Xindice, MySQL Oracle, DB2) Request to Registry for sources of data about “x” Registry responds with Factory handle Request to Factory for access to database Factory creates GridDataService Factory returns handle of GDS to client Client queries GDS with SQL, XPath, XQuery etc GDS interacts with database Query results returned XML SOAP/HTTP service creation API interactions Analyst Registry GDSR Factory GDSF Grid Data Service GDS Consumer OR delivered to consumer as XML OGSA-DAI
OGSA-DAI Downloads R4 690 downloads since May 04 -Actual user downloads not search engine crawlers -Does not include downloads as part of GT3.2 releases Total of 838 registered users R1.0 (Jan 03)104 R1.5 (Feb 03)108 R2.0 (Apr 03)250 R2.5 (Jun 03)291 R3.0 (Jul 03)792 R3.1 (Feb 04)630 Total2865 United Kingdom 21% China 26% United States 13% Japan 5% Unknown 7% Germany 5% Italy 5% Austria 2% Australia 2% France 3% Taiwan 2% Downloads by Country – OGSA-DAI R4.0
Multiple tasks / request Ident Type Value Ident Type Value Ident Type Value Ident Type Value Ident Type Value Ident Type Value Ident Type Value Ident Type Value
Be Direct Double Handling costs too much Memory cycles, bus capacity, cache disruption, … Double Handling via discs pathologically bad Data translation expensive Avoid Deliver as stored, … Compose Stream Main memory is not big enough Stream or use Disk Couple generator & consumer directly Stream from RAM to RAM Requires coupled computation execution Breaks down boundaries and merges data, execution & transport requirements. Demands smart workflow enactment service & foundation services Models for process transformation and optimisation
Take Home Message Data Access & Integration Two Models kit of parts Virtualisation Ubiquitous Needs Pervasive and growing number and diversity of data collections Opportunity and power to integrate and mine OGSA-DAI Pioneering Talk by Amrey Krause - 5:15 Today Growing Community Implementation Standards Users Join the party of users, contributors & researchers