Download presentation
Presentation is loading. Please wait.
Published byAubrey Fisher Modified over 8 years ago
1
Data and the Grid: From Databases to Global Knowledge Communities Ian Foster Argonne National Laboratory University of Chicago www.mcs.anl.gov/~foster Keynote Talk, 15 th Intl Conf on Scientific and Statistical Database Management, Boston, July 11, 2003 Image Credit: Electronic Visualization Lab, UIC
2
2 www.mcs.anl.gov/~foster ARGONNE CHICAGO My Presentation 1) Data integration as a new opportunity –Driven by advances in technology & science –The need to discover, access, explore, analyze diverse distributed data sources –Grid technologies as a substrate for essential management functions 2) Science as collaborative workflow –The need to organize, archive, reuse, explain, and schedule scientific workflows –Virtual data as a unifying concept
3
3 www.mcs.anl.gov/~foster ARGONNE CHICAGO It’s Easy to Forget How Different 2003 is From 1993 l Enormous quantities of data: Petabytes –For an increasing number of communities, gating step is not collection but analysis l Ubiquitous Internet: 100+ million hosts –Collaboration & resource sharing the norm l Ultra-high-speed networks: 10+ Gb/s –Global optical networks l Huge quantities of computing: 100+ Top/s –Moore’s law gives us all supercomputers
4
4 www.mcs.anl.gov/~foster ARGONNE CHICAGO Consequence: The Emergence of Global Knowledge Communities l Teams organized around common goals –Communities: “Virtual organizations” l With diverse membership & capabilities –Heterogeneity is a strength not a weakness l And geographic and political distribution –No location/organization possesses all required skills and resources l Must adapt as a function of the situation –Adjust membership, reallocate responsibilities, renegotiate resources
5
5 www.mcs.anl.gov/~foster ARGONNE CHICAGO The Emergence of Global Knowledge Communities
6
6 www.mcs.anl.gov/~foster ARGONNE CHICAGO Global Knowledge Communities Often Driven by Data: E.g., Astronomy No. & sizes of data sets as of mid-2002, grouped by wavelength 12 waveband coverage of large areas of the sky Total about 200 TB data Largest catalogs near 1B objects Data and images courtesy Alex Szalay, John Hopkins
7
7 www.mcs.anl.gov/~foster ARGONNE CHICAGO Data Integration as a Fundamental Challenge R Discovery Many sources of data, services, computation R Registries organize services of interest to a community Access Data integration activities may require access to, & exploration of, data at many locations Exploration & analysis may involve complex, multi-step workflows RM Resource management is needed to ensure progress & arbitrate competing demands Security service Security service Policy service Policy service Security & policy must underlie access & management decisions
8
8 www.mcs.anl.gov/~foster ARGONNE CHICAGO Performance Requirements Demand Whole-System Management l Assume –Remote data at 1 GB/s –10 local bytes per remote –100 operations per byte Local Network Wide area link (end-to-end switched lambda?) 1 GB/s Parallel I/O: 10 GB/s Parallel computation: 1000 Gop/s Remote data >1 GByte/s achievable today (FAST, 7 streams, LA Geneva)
9
9 www.mcs.anl.gov/~foster ARGONNE CHICAGO Data Integration: Key Challenges l Of course, familiar issues: data organization, schema definition/mediation, etc., etc. l But also new challenges relating to dynamic, distributed communities –Establishment, negotiation, management, & evolution of multi-organizational federations l And to the sheer number of resources, speed of networks, and volume of data –Coordination, management, provisioning, & monitoring of workflows & required resources
10
10 www.mcs.anl.gov/~foster ARGONNE CHICAGO Enter Grid Technologies l Infrastructure (“middleware”) for establishing, managing, and evolving multi-organizational federations –Dynamic, autonomous, domain independent –On-demand, ubiquitous access to computing, data, and services l Mechanisms for creating and managing workflow within such federations –New capabilities constructed dynamically and transparently from distributed services –Service-oriented, virtualization
11
11 www.mcs.anl.gov/~foster ARGONNE CHICAGO Increased functionality, standardization Custom solutions 1990199520002005 Open Grid Services Arch Real standards Multiple implementations Web services, etc. Managed shared virtual systems Computer science research Globus Toolkit Defacto standard Single implementation Internet standards The Emergence of Open Grid Standards 2010
12
12 www.mcs.anl.gov/~foster ARGONNE CHICAGO Open Grid Services Architecture l Service-oriented architecture –Key to virtualization, discovery, composition, local-remote transparency l Leverage industry standards –In particular, Web services l Distributed service management –A “component model for Web services” l A framework for the definition of composable, interoperable services “The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration”, Foster, Kesselman, Nick, Tuecke, 2002
13
13 www.mcs.anl.gov/~foster ARGONNE CHICAGO Web Services l XML-based distributed computing technology l Web service = a server process that exposes typed ports to the network l Described by the Web Services Description Language, an XML document that contains –Type of message(s) the service understands & types of responses & exceptions it returns –“Methods” bound together as “port types” –Port types bound to protocols as “ports” l A WSDL document completely defines a service and how to access it
14
14 www.mcs.anl.gov/~foster ARGONNE CHICAGO OGSA Structure l A standard substrate: the Grid service –Standard interfaces and behaviors that address key distributed system issues: naming, service state, lifetime, notification –A Grid service is a Web service l … supports standard service specifications –Agreement, data access & integration, workflow, security, policy, diagnostics, etc. –Target of current & planned GGF efforts l … and arbitrary application-specific services based on these & other definitions
15
15 www.mcs.anl.gov/~foster ARGONNE CHICAGO Open Grid Services Infrastructure Implementation Service data element Other standard interfaces: factory, notification, collections Hosting environment/runtime (“C”, J2EE,.NET, …) Service data element Service data element GridService (required) Data access Lifetime management Explicit destruction Soft-state lifetime Introspection: What port types? What policy? What state? Client Grid Service Handle Grid Service Reference handle resolution
16
16 www.mcs.anl.gov/~foster ARGONNE CHICAGO Open Grid Services Infrastructure GWD-R (draft-ggf-ogsi- gridservice-23) Editors: Open Grid Services Infrastructure (OGSI) S. Tuecke, ANL http://www.ggf.org/ogsi-wg K. Czajkowski, USC/ISI I. Foster, ANL J. Frey, IBM S. Graham, IBM C. Kesselman, USC/ISI D. Snelling, Fujitsu Labs P. Vanderbilt, NASA February 17, 2003 Open Grid Services Infrastructure (OGSI) “The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration”, Foster, Kesselman, Nick, Tuecke, 2002
17
17 www.mcs.anl.gov/~foster ARGONNE CHICAGO Example: Reliable File Transfer Service Performance Policy Faults service data elements Pending File Transfer Internal State Grid Service Notf’n Source Policy interfaces Query &/or subscribe to service data Fault Monitor Perf. Monitor Client Request and manage file transfer operations Data transfer operations
18
18 www.mcs.anl.gov/~foster ARGONNE CHICAGO OGSA and Data Integration l OGSI provides key enabling mechanisms for distributed data integration –Introspect on distributed system elements –Create and manage distributed state l We need more than OGSI, of course, e.g., –WS-Agreement: negotiate agreements between service provider and consumer –OGSA-DAI: Data Access and Integration –WS-Management: service management –Security and policy
19
19 www.mcs.anl.gov/~foster ARGONNE CHICAGO OGSA Infrastructure Architecture OGSI: Interface to Grid Infrastructure Data Intensive Applications for X-ology Research Compute, Data & Storage Resources Distributed Simulation, Analysis & Integration Technology for X-ology Data Intensive X-ology Researchers Virtual Integration Architecture Generic Virtual Data Access and Integration Layer Structured Data Integration Structured Data Access Structured Data Relational XML Semi-structured- Transformation Registry Job Submission Data TransportResource Usage Banking BrokeringWorkflow Authorisation Slide Courtesy Malcolm Atkinson, UK eScience Center
20
20 www.mcs.anl.gov/~foster ARGONNE CHICAGO Data as Service: OGSA Data Access & Integration l Service-oriented treatment of data appears to have significant advantages –Leverage OGSI introspection, lifetime, etc. –Compatibility with Web services l Standard service interfaces being defined –Service data: e.g., schema –Derive new data services from old (views) –Externalize to e.g. file/database format –Perform queries or other operations
21
21 www.mcs.anl.gov/~foster ARGONNE CHICAGO 1a. Request to Registry for sources of data about “x” 1b. Registry responds with Factory handle 2a. Request to Factory for access to database 2c. Factory returns handle of GDS to client 3a. Client queries GDS with XPath, SQL, etc 3b. GDS interacts with database 3c. Results of query returned to client as XML SOAP/HTTP service creation API interactions RegistryFactory 2b. Factory creates GridDataService to manage access Grid Data Service Client XML / Relationa l database Data Access & Integration Services Slide Courtesy Malcolm Atkinson, UK eScience Center
22
22 www.mcs.anl.gov/~foster ARGONNE CHICAGO Globus Toolkit v3 (GT3) Open Source OGSA Technology l Implements and builds on OGSI interfaces l Supports primary GT2 interfaces –Public key authentication –Scalable service discovery –Secure, reliable resource access –High-performance data movement (GridFTP) l Numerous new services included or planned –SLA negotiation, service registry, community authorization, data access & integration, … l Rapidly growing adoption and contributions –E.g., OGSA-DAI from U.K. eScience program
23
23 www.mcs.anl.gov/~foster ARGONNE CHICAGO My Presentation 1) Data integration as a new opportunity –Driven by advances in technology & science –The need to discover, access, explore, analyze diverse distributed data sources –Grid technologies as a substrate for essential management functions 2) Science as collaborative workflow –The need to organize, archive, reuse, explain, & schedule scientific workflows –Virtual data as a unifying concept
24
24 www.mcs.anl.gov/~foster ARGONNE CHICAGO Science as Workflow l Data integration = the derivation of new data from old, via coordinated computation(s) –May be computationally demanding –The workflows used to achieve integration are often valuable artifacts in their own right l Thus we must be concerned with how we –Build workflows –Share and reuse workflows –Explain workflows –Schedule workflows
25
25 www.mcs.anl.gov/~foster ARGONNE CHICAGO Sloan Digital Sky Survey Production System
26
26 www.mcs.anl.gov/~foster ARGONNE CHICAGO Virtual Data Concept l Capture and manage information about relationships among –Data (of widely varying representations) –Programs (& their execution needs) –Computations (& execution environments) l Apply this information to, e.g. –Discovery: Data and program discovery –Workflow: Structured paradigm for organizing, locating, specifying, & requesting data –Explanation: provenance –Planning and scheduling –Other uses we haven’t thought of
27
27 www.mcs.anl.gov/~foster ARGONNE CHICAGO TransformationDerivation Data created-by execution-of consumed-by/ generated-by “I’ve detected a calibration error in an instrument and want to know which derived data to recompute.” “I’ve come across some interesting data, but I need to understand the nature of the corrections applied when it was constructed before I can trust it for my purposes.” “I want to search an astronomical database for galaxies with certain characteristics. If a program that performs this analysis exists, I won’t have to write one from scratch.” “I want to apply an astronomical analysis program to millions of objects. If the results already exist, I’ll save weeks of computation.” Motivations
28
28 www.mcs.anl.gov/~foster ARGONNE CHICAGO l Virtual data catalog –Transformations, derivations, data l Virtual data language –Catalog definitions l Query tool l Applications include browsers and data analysis applications Chimera Virtual Data System (www.griphyn.org/chimera)
29
29 www.mcs.anl.gov/~foster ARGONNE CHICAGO Chimera Virtual Data Schema Metadata describes
30
mass = 200 decay = WW stability = 1 event = 8 mass = 200 decay = WW stability = 1 plot = 1 mass = 200 decay = WW plot = 1 mass = 200 decay = WW event = 8 mass = 200 decay = WW stability = 1 mass = 200 decay = WW stability = 3 mass = 200 decay = WW mass = 200 decay = ZZ mass = 200 decay = bb mass = 200 event = 8 Virtual Data in CMS HEP Analysis Work and slide by Rick Cavanaugh and Dimitri Bourilkov, University of Florida Define a virtual data space for exploration by other scientists mass = 200 plot = 1 Knowledge capture
31
mass = 200 decay = WW stability = 1 event = 8 mass = 200 decay = WW stability = 1 plot = 1 mass = 200 decay = WW plot = 1 mass = 200 decay = WW event = 8 mass = 200 decay = WW stability = 1 mass = 200 decay = WW stability = 3 mass = 200 decay = WW mass = 200 decay = ZZ mass = 200 decay = bb mass = 200 event = 8 Virtual Data in CMS HEP Analysis Work and slide by Rick Cavanaugh and Dimitri Bourilkov, University of Florida Search for WW decays of the Higgs Boson for which only stable, final state particles are recorded? stability = 1 mass = 200 plot = 1 Knowledge capture On-demand data gen Workload mgmt
32
mass = 200 decay = WW stability = 1 event = 8 mass = 200 decay = WW stability = 1 plot = 1 mass = 200 decay = WW plot = 1 mass = 200 decay = WW event = 8 mass = 200 decay = WW stability = 1 mass = 200 decay = WW stability = 3 mass = 200 decay = WW mass = 200 decay = ZZ mass = 200 decay = bb mass = 200 event = 8 Virtual Data in CMS HEP Analysis Work and slide by Rick Cavanaugh and Dimitri Bourilkov, University of Florida Search for WW decays of the Higgs Boson and where only stable, final state particles are recorded: stability = 1 Scientist discovers an interesting result – wants to know how it was derived. mass = 200 plot = 1 Knowledge capture On-demand data gen. Workload mgmt Explain provenance
33
mass = 200 decay = WW stability = 1 event = 8 mass = 200 decay = WW stability = 1 plot = 1 mass = 200 decay = WW plot = 1 mass = 200 decay = WW event = 8 mass = 200 decay = WW stability = 1 mass = 200 decay = WW stability = 3 mass = 200 decay = WW mass = 200 decay = ZZ mass = 200 decay = bb mass = 200 plot = 1 mass = 200 event = 8 Virtual Data in CMS HEP Analysis Work and slide by Rick Cavanaugh and Dimitri Bourilkov, University of Florida Search for WW decays of the Higgs Boson and where only stable, final state particles are recorded: stability = 1 Scientist discovers an interesting result – wants to know how it was derived. mass = 200 decay = WW stability = 1 LowPt = 20 HighPt = 10000... The scientist adds a new derived data Branch...... and continues to Investigate … Knowledge capture On-demand data gen. Workload mgmt Explain provenance Collaboration
34
34 www.mcs.anl.gov/~foster ARGONNE CHICAGO Virtual Data “Explorations” Can be Long-Lived Computations l Production Run on the Integration Testbed –Simulate 1.5 million full CMS events for physics studies: ~500 sec per event on 850 MHz processor –2 months continuous running across 5 testbed sites –Managed by a single person at the US-CMS Tier 1
35
35 www.mcs.anl.gov/~foster ARGONNE CHICAGO Galaxy cluster size distribution DAG Virtual Data in Sloan Galaxy Cluster Analysis Sloan Data Jim Annis, Steve Kent, Vijay Sehkri, Fermilab, Michael Milligan, Yong Zhao, Chicago
36
36 www.mcs.anl.gov/~foster ARGONNE CHICAGO Figure 1. GADU data flow DOESG Resource Virtual Data in Genome Analysis
37
37 www.mcs.anl.gov/~foster ARGONNE CHICAGO Bringing it All Together: A Virtual Data Grid
38
38 www.mcs.anl.gov/~foster ARGONNE CHICAGO AWFEWF DesignExecution monitoring WF-Pilot Abstract Task (AT) Repository AAV rules C C C Data & Parameter Ontologies ET schemas User ET GenbankBLAST query rewriting web service invocation Executable Task (ET) Repository web service invocation semantic type checking conversion rules data type conversion Datatype & Conversion Repository web service matching WF-Compiler AWF EWF Translation WF-Engine Scheduling and execution B. Ludäscher, I. Altintas, A. Gupta – http://kbi.sdsc.edu/SciDAC-SDM/scidac-tn-02-01.pdf Also Very Relevant: Workflow & Web Services
39
39 www.mcs.anl.gov/~foster ARGONNE CHICAGO Summary 1) Data integration as a new opportunity –Driven by advances in technology & science –The need to discover, access, explore, analyze diverse distributed data sources –Grid technologies as a substrate for essential management functions 2) Science as collaborative workflow –The need to organize, archive, reuse, explain, and schedule scientific workflows –Virtual data as a unifying concept
40
40 www.mcs.anl.gov/~foster ARGONNE CHICAGO l The Globus Project™ –www.globus.org l Technical articles –www.mcs.anl.gov/~foster l Open Grid Services Arch. –www.globus.org/ogsa l Chimera –www.griphyn.org/chimera l Global Grid Forum –www.ggf.org For More Information 2nd Edition: November 2003
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.