December 2009 Data Integration in Grid Environments Alex Poulovassilis, Birkbeck, U. of London
Lecture Overview Part I: Grid Computing Grid Architectures Grid Standards Part II: The ISPIDER project – a grid application in Bioinformatics The AutoMed project OGSA-DAI and OGSA-DQP middleware DAI/DQP/AutoMed interoperability in ISPIDER
Part I Grid Computing Grid Architectures Grid Standards
What is Grid Computing? the term first arose in the mid 1990s and it is also known as utility computing the world is full of computing resources connected by networks, but their distribution, heterogeneity and autonomy make it hard for such resources to be shared the development of grid computing has been motivated by the need for flexible, secure and coordinated resource sharing to solve large-scale computing problems this resource sharing is between dynamic collections of individuals, institutions and resources, which collectively form a Virtual Organisation (VO)
What is Grid Computing? resource sharing includes shared usage of hardware, software, data resources, sensor networks, etc. this sharing is necessary in order to solve in a collaborative fashion large scale computing problems arising in science, engineering and business the sharing is controlled, with providers of resources defining what may be shared and under what conditions
How did grid computing come about? arose in academia (e-science), with Ian Foster and Carl Kesselman leading the development of the Globus toolkit was then picked up by industry e.g. SUN, IBM, HP, Oracle driving forces were: (a) computationally-intensive scientific problems e.g. simulations (b) scientific problems involving huge quantities of data e.g. analysis of large data sets leading to so-called Computational Grids and Data Grids grid computing may be viewed as an extension of the WWW (information sharing) to sharing of general computing resources
The international grid community the Global Grid Forum (GGF) was formed in 2000 as a community of researchers, users and vendors aiming to exchange ideas on grid development and deployment, and to develop specifications for grid standards the Enterprise Grid Alliance (EGA) was formed in 2004 as a non- profit, vendor organisation formed to develop grid computing in industry GGF and EGA merged in 2006 into the new Open Grid Forum there has been much funding of grid research and development in the EU, regionally, and nationally over the past decade
How is grid different from distributed computing? in grid computing, the focus is on large-scale resource sharing, requiring authentication, authorisation, resource discovery, resource scheduling and costing in grid applications, presentation services access the functionality provided by service-oriented grid middleware, which virtualises the dynamic deployment of the actual computing resources by contrast, in client-server or multi-tier applications, the presentation, application and back-end services and resources are separated, but are fixed and their deployment is known
What is a Grid Architecture? was originally envisaged as being organised into layers: Application Collective: see next slide Resource: protocols for initiation, monitoring and control of the computing and data resources Connectivity: communication and authentication protocols Fabric: the physical computing and data resources the components in each layer share common characteristics and build on the services provided by lower layers the Resource and Connectivity services can be implemented over lower-level resources at the Fabric layer and can in turn support a range of higher-level services at the Collective and Application layers
Collective Layer comprises global protocols and services that capture interactions across collections of resources e.g. directory services to search for resources by name or attributes such as type, availability, load allocation, scheduling and brokering services to request allocation of one or more resources for a specific task, and scheduling of tasks on resources monitoring and diagnostic services to monitor the execution of tasks workload management systems - for specifying and executing workflows consisting of multiple tasks accounting and payment services gathering resource usage information
Subsequent developments in Grid Architectures no longer a layered architecture but a service oriented architecture (SOA) e.g. the Open Grid Services Architecture services are loosely coupled peers that can interact with each other to achieve a given capability e.g. a service may extend the capabilities of another service in order to provide its own functionality a service may compose the capabilities of other services to provide higher-level functionality
This leads to a 3-tier Architecture Applications Service pool Resources the service pool is the grid middleware below it are the actual physical resources above it are the applications which access the service pool the location and nature of the actual resources is transparent to the applications thus the grid middleware allows resource virtualisation applications use services as and when needed and pay only for this usage, hence also the term utility computing
Different types of grids departmental grids: built on clusters or groups of clusters owned by one department of an enterprise enterprise grids: sharing of common resources by many departments of an enterprise partner grids: involving several partner institutions, known to each other and with common goals open grids: anyone can join and become a resource provider and/or resource user
Status of grid computing today numerous commercial departmental and enterprise grids are in operation today e.g. for drug discovery, stock market trading, integrated circuit design, enterprise resource planning also many international partner grids e.g. in high energy physics, life sciences, design and engineering, computational chemistry, astrophysics, earth sciences software to support open grids is also emerging
Grid Standards Numerous Grid products are available from various organisations and vendors these need to be able to interoperate, and this is made possible by the development of standards e.g. the Open Grid Services Architecture (OGSA) OGSA is based on Web Services web standards of relevance to the grid include: HTTP (transport), XML (data format), SOAP (message syntax), WSDL (web service definition), UDDI (web service registry), WS-Security, BPEL (workflow definition)
Grid Standards a key area in which Grid requirements have motivated new WS standards is in the representation and manipulation of state standard WSs are stateless from the point of view of the requester of the service OGSA assumes that service interfaces and service behaviours are defined as in WSRF (Web Service Resource Framework) WSRF defines how state should be modelled, accessed and managed; how services should be grouped; and how faults should be modelled Also, WS-Notification defines notification mechanisms that support subscription to, and notification of, changes to services and to their state
OGSA vs other Web Service environments apart from state, the other major difference of OGSA compared with other WS environments is that grid environments are not static: in contrast to standard WSs, grid services can be created and deployed dynamically the set of available resources and their load at any time may be highly variable, while still requiring application requirements and SLAs (service-level agreements) to be met failures to meet SLAs or occurrences of faults may require dynamic restart of executions on other alternative resources there is thus a need for monitoring of grid applications and for responding dynamically to their needs until completion
Implementing Grid services associated with a Grid Service are a set of Service Data Elements (SDEs) these are XML documents and represent information about grid service instances, allowing their discovery and management each Grid Service port type has an associated set of SDEs different types of Grid Service are realised by providing different sets of port types
Background Reading for Part I Grid Cafe, The EGEE project, Worldwide LHC (Large Hadron Collider) Computing Grid, Open Grid Forum, Globus toolkit,
Part II The ISPIDER project – a grid application in Bioinformatics The AutoMed project OGSA-DAI and OGSA-DQP middleware DAI/DQP/AutoMed interoperability in ISPIDER
The ISPIDER Project Partners: Birkbeck, EBI, Manchester, UCL Requirements: There are vast amounts of heterogeneous proteomics data being produced via a variety of new techniques Proteomics is the study of the protein complement of the genome It is targeted at the elucidation of biological function from genomic data There is a need for interoperability between autonomous proteomics data resources And for complex analyses over integrated virtual resources
Genome: DNA sequences of 4 bases (A,C,G,T) RNA: copy of DNA sequence Protein: sequence of 20 amino acids A gene Biological data: Genes Proteins Biological Function Permanent copyTemporary copyProduct (each triple of RNA bases encodes an amino acid) FUNCTION Job Biological Processes This slide is adapted from Nigel Martins Lecture Notes on Bioinformatics
Aims of ISPIDER Hence, the development of a Proteomics Grid Infrastructure, using existing proteomics resources and developing new ones; also developing new proteomics clients for querying, visualisation, workflow etc. The development of such a system is beneficial for a number of reasons: Access to more data sources yields more reliable analyses Integrating resources increases the breadth of information available for the biologist Enables new analyses to be undertaken which would have been prohibitively difficult or impossible with just the individual resources
ISPIDER Architecture
Some ISPIDER data resources gpmDB See a publicly available database with more than 2 million proteins and almost 470,000 unique peptide identifications provides access to a wealth of peptide identifications from a range of different laboratories and instruments PEDRo provides access to a collection of descriptions of experimental data sets in proteomics PepSeeker developed as part of the ISPIDER project and targeted at the identification stage of the proteomics pipeline currently holds over 50,000 proteins and 50,000 unique peptide identifications
my Grid / DQP / AutoMed Middleware my Grid: provides a workflow environment over web/grid services, allowing high-level integration of data and applications for in-silico experiments in biology OGSA-DQP: provides distributed query processing over Grid enabled data resources AutoMed: provides heterogeneous data integration functionality over distributed data sources (the AutoMed project partners are Birkbeck and Imperial College) ISPIDER research: integration of AutoMed and DAI/DQP (topic of this lecture); also integration of AutoMed and my Grid workflows
Motivation for AutoMed Data Integration (DI) is the process of creating an integrated resource which combines data from a variety of autonomous data sources in order to support new queries and analyses the data sources may be heterogeneous in terms of their: data model, query interfaces, query processing capabilities, database schema or data exchange format, data types used, nomenclature adopted this poses several challenges, leading to several methodologies, architectures and systems being developed to support DI these aim to abstract out data transformation and aggregation logic from application programs into generic data integration software
AutoMed Supports a metamodel, the Hypergraph Data Model (HDM), in terms of which higher-level modelling languages can be defined – so extensible with new modelling languages After a modelling language has been specified in terms of the HDM, a set of primitive schema transformations become available for schemas expressed in that language Schemas can be incrementally transformed and integrated by applying to them a sequence of primitive transformations Schemas may or may not have data associated with them: so virtual, materialised (data warehousing) or hybrid integration can be supported Transformations are accompanied by queries, allowing data and query translation between source and target schemas
AutoMed Architecture Global Query Processor Global Query Optimiser Schema Evolution Tool Schema Transformation and Integration Tools Model Definition Tool Schema and Transformation Repository Model Definitions Repository Wrapper Distributed Data Sources
Global Query Processing in AutoMed We handle query language heterogeneity by translation into/from a intermediate query language – IQL A query Q expressed in a high-level query language such as SQL on a global schema S would first be translated into IQL For example, the following IQL query on a global schema retrieves all identifications for the protein with accession number ENSP : [id | {id,an} >; an=`ENSP '] View definitions are then derived from the transformation pathways between S and the data source schemas (in this case gpmDB, PEDRo and PepSeeker) These view definitions are substituted into Q, reformulating it into an IQL query over source schema constructs
Global Query Processing in AutoMed (contd) E.g. for Q as above the reformulated query is: [id | {id,an} <- [{id2lsid [`pepseeker.proteinhit:', toString d], x}| {d,x}<- distinct [{k,x}|{k,x}<- >]] ++ [{id2lsid [`pedro.protein:', toString d], x}| {d,x} >] ++ [{id2lsid [`gpmdb.proseq:', toString d],x}| {d,x} >]; an=`ENSP ']
Global Query Processing (contd) Query optimisation then occurs One goal of this is to generate the largest possible sub-queries that can be submitted to data source Wrappers for translation into the data source query languages and evaluation by the data sources Query evaluation then follows, during which the AutoMed Evaluator submits to Wrappers sub-queries that they are able to translate into the data source query language (currently, AutoMed supports wrappers for SQL, OQL, XPath, XQuery and flat-file data resources) The Wrappers submit sub-queries to data sources, and translate sub-query results back into the IQL type system The Evaluator then undertakes any further necessary query evaluation to combine sub-query results
OGSA-DAI and OGSA-DQP OGSA-DAI (Data Access and Integration) delivers data access, transport and metadata services for the grid there are other OGSA services that focus on data derivation, consistency and replication services OGSA-DQP (Distributed Query Processing) provides services for the compilation, optimisation and distributed evaluation of queries over grid data resources accessed via OGSA-DAI
OGSA-DAI functionality provides a consistent interface to data resources regardless of the underlying technology e.g. relational (Oracle, DB2, MySQL) or XML (Xindice; eXist) OGSA-DAI extends standard Grid Services with several new port types, including Grid Data Service (GDS)
OGSA-DAI functionality Grid Data Service (GDS): accepts requests, in the form of XML documents, instructing the Grid Service instance to interact with a database in order to create, retrieve, update or delete data its primary operation is perform through which such requests are passed to the GS a request may consist of a collection of linked activities e.g. a data access, followed by a data translation, followed by a data delivery all of these can be bundled into one request in order to reduce the number of round trips required between the client and the service
OGSA-DQP functionality This implements the GDS and GDT port types from OGSA-DAI and also adds two new port types: GDQS and GQES. Grid Distributed Query Service (GDQS): can interact with known registries to obtain the schemas of data resources and also information about computational resources this set-up phase occurs once in the lifetime of a GDQS instance clients can then submit a query to the GDQS via the GDS port- type, using a perform call this is compiled, optimised and partitioned into a distributed query execution plan each of whose partitions will be scheduled for execution at different GQESs (see below) the GDQS uses this information to create the necessary GQES instances on their designated execution nodes, and hands over to each GQES the partition assigned to it
OGSA-DQP functionality Grid Query Evaluation Service (GQES): each GQES instance is an execution node in a distributed query execution plan it is responsible for that part of the execution plan allocated to it by the GDQS it implements a physical algebra over other Grid Data Services encapsulated within these other GDSs are the data resources whose schemas were imported during the GDQS set-up phase
DAI/DQP/AutoMed Interoperability Data sources wrapped with OGSA-DAI AutoMed-DAI wrappers extract data sources metadata Semantic integration of data sources using transformation pathways IQL queries submitted to an integrated schema are reformulated to IQL queries on the data sources, using the transformation pathways Submitted to DQP for evaluation (not AutoMed)
The AutoMed-DAI Wrapper The AutoMed-DAI wrapper requests the schema of the data source using an OGSA- DAI service The service replies with the source schema encoded an in XML response document The AutoMed-DAI wrapper creates the corresponding schema in the AutoMed repository
The AutoMed-DQP Wrapper The AutoMed-DQP wrapper undertakes two tasks: needs to inform AutoMed of the subset of IQL that it is capable of translating into OQL is responsible for making interactions with OGSA-DQP transparent to the remainder of the AutoMed infrastructure On receiving an IQL query, the AutoMed-DQP wrapper first translates it into the equivalent OQL query The OQL query is then sent to OGSA-DQP for evaluation The reply from OGSA-DQP is in the form of an XML response document containing the query results The AutoMed-DQP wrapper translates these results into the IQL type system, and returns the result to AutoMed's evaluator for any further necessary evaluation
Background Reading for Part II OGSA Version 1.0 document, January 2005 Service-Based Distributed Querying on the Grid by Alpdemir et al., Proc. of the 1st International Conference on Service Oriented Computing", 2003, pp The design and implementation of grid database services in OGSA-DAI by Antonioletti et al., Concurrency - Practice and Experience, Vol 17, No 2-4, 2005, pp