Data and the Grid: From Databases to Global Knowledge Communities Ian Foster Argonne National Laboratory University of Chicago www.mcs.anl.gov/~foster.

Slides:



Advertisements
Similar presentations
Abstraction Layers Why do we need them? –Protection against change Where in the hourglass do we put them? –Computer Scientist perspective Expose low-level.
Advertisements

The Anatomy of the Grid: An Integrated View of Grid Architecture Carl Kesselman USC/Information Sciences Institute Ian Foster, Steve Tuecke Argonne National.
Agreement-based Distributed Resource Management Alain Andrieux Karl Czajkowski.
Virtual Data and the Chimera System* Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer Science.
High Performance Computing Course Notes Grid Computing.
This product includes material developed by the Globus Project ( Introduction to Grid Services and GT3.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
NextGRID & OGSA Data Architectures: Example Scenarios Stephen Davey, NeSC, UK ISSGC06 Summer School, Ischia, Italy 12 th July 2006.
Milos Kobliha Alejandro Cimadevilla Luis de Alba Parallel Computing Seminar GROUP 12.
4b.1 Grid Computing Software Components of Globus 4.0 ITCS 4010 Grid Computing, 2005, UNC-Charlotte, B. Wilkinson, slides 4b.
Ian Foster Argonne National Laboratory University of Chicago Open Grid Services Architecture Plenary Talk at CHEP 2003,
Knowledge Environments for Science: Representative Projects Ian Foster Argonne National Laboratory University of Chicago
©Ferenc Vajda 1 Open Grid Services Architecture Ferenc Vajda Computer and Automation Research Institute Hungarian Academy of Sciences.
Designing and Building Grid Services GGF9 Chicago October 8, 2003 Organizers: Ian Foster, Marty Humphrey, Kate Keahey, Norman Paton, David Snelling.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Grid Computing Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer Science The University of Chicago.
Grid Computing and the Open Grid Service Architecture Ian Foster Argonne National Laboratory University of Chicago 2nd IEEE.
The Grid as Infrastructure and Application Enabler Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer.
1 Dr. Markus Hillenbrand, ICSY Lab, University of Kaiserslautern, Germany A Generic Database Web Service for the Venice Service Grid Michael Koch, Markus.
DISTRIBUTED COMPUTING
ARGONNE  CHICAGO Ian Foster Discussion Points l Maintaining the right balance between research and development l Maintaining focus vs. accepting broader.
Ian Foster Argonne National Lab University of Chicago Globus Project The Grid and Meteorology Meteorology and HPN Workshop, APAN.
GT Components. Globus Toolkit A “toolkit” of services and packages for creating the basic grid computing infrastructure Higher level tools added to this.
The Anatomy of the Grid Mahdi Hamzeh Fall 2005 Class Presentation for the Parallel Processing Course. All figures and data are copyrights of their respective.
Virtual Data Grid Architecture Ewa Deelman, Ian Foster, Carl Kesselman, Miron Livny.
1 4/23/2007 Introduction to Grid computing Sunil Avutu Graduate Student Dept.of Computer Science.
Open Grid Services as an Enabler of Future Networked Applications Ian Foster Argonne National Laboratory University of Chicago
Service - Oriented Middleware for Distributed Data Mining on the Grid ,劉妘鑏 Antonio C., Domenico T., and Paolo T. Journal of Parallel and Distributed.
Middleware for Grid Computing and the relationship to Middleware at large ECE 1770 : Middleware Systems By: Sepehr (Sep) Seyedi Date: Thurs. January 23,
Data and storage services on the NGS Mike Mineter Training Outreach and Education
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
GRID ARCHITECTURE Chintan O.Patel. CS 551 Fall 2002 Workshop 1 Software Architectures 2 What is Grid ? "...a flexible, secure, coordinated resource- sharing.
1 ARGONNE  CHICAGO Grid Introduction and Overview Ian Foster Argonne National Lab University of Chicago Globus Project
Grid Services I - Concepts
Authors: Ronnie Julio Cole David
Presented by Scientific Annotation Middleware Software infrastructure to support rich scientific records and the processes that produce them Jens Schwidder.
ICCS WSES BOF Discussion. Possible Topics Scientific workflows and Grid infrastructure Utilization of computing resources in scientific workflows; Virtual.
GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.
Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.
Interoperability from the e-Science Perspective Yannis Ioannidis Univ. Of Athens and ATHENA Research Center
Prof S.Ramachandram Dept of CSE,UCE Osmania University
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
The Grid Enabling Resource Sharing within Virtual Organizations Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department.
7. Grid Computing Systems and Resource Management
Globus and PlanetLab Resource Management Solutions Compared M. Ripeanu, M. Bowman, J. Chase, I. Foster, M. Milenkovic Presented by Dionysis Logothetis.
© 2004 IBM Corporation ICSOC2004 Panel Discussion: Grid Systems: What is needed from web service standards? Jeffrey Frey IBM.
Providing web services to mobile users: The architecture design of an m-service portal Minder Chen - Dongsong Zhang - Lina Zhou Presented by: Juan M. Cubillos.
Securing the Grid & other Middleware Challenges Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer.
GRID ANATOMY Advanced Computing Concepts – Dr. Emmanuel Pilli.
Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.
Data and storage services on the NGS.
Japanese & UK N+N Data, Data everywhere and … Prof. Malcolm Atkinson Director 3 rd October 2003.
MSF and MAGE: e-Science Middleware for BT Applications Sep 21, 2006 Jaeyoung Choi Soongsil University, Seoul Korea
ACGT Architecture and Grid Infrastructure Juliusz Pukacki ‏ EGEE Conference Budapest, 4 October 2007.
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
Clouds , Grids and Clusters
SuperComputing 2003 “The Great Academia / Industry Grid Debate” ?
UK e-Science OGSA-DAI November 2002 Malcolm Atkinson
Grid Computing.
University of Technology
Grid Computing B.Ramamurthy 9/22/2018 B.Ramamurthy.
Grid Introduction and Overview
Grid Services B.Ramamurthy 12/28/2018 B.Ramamurthy.
Introduction to Grid Technology
Large Scale Distributed Computing
Service Oriented Architecture (SOA)
Resource and Service Management on the Grid
The Anatomy and The Physiology of the Grid
The Anatomy and The Physiology of the Grid
Facilitating Science Collaborations for the LHC: Grid Technologies
Presentation transcript:

Data and the Grid: From Databases to Global Knowledge Communities Ian Foster Argonne National Laboratory University of Chicago Keynote Talk, 15 th Intl Conf on Scientific and Statistical Database Management, Boston, July 11, 2003 Image Credit: Electronic Visualization Lab, UIC

2 ARGONNE  CHICAGO My Presentation 1) Data integration as a new opportunity –Driven by advances in technology & science –The need to discover, access, explore, analyze diverse distributed data sources –Grid technologies as a substrate for essential management functions 2) Science as collaborative workflow –The need to organize, archive, reuse, explain, and schedule scientific workflows –Virtual data as a unifying concept

3 ARGONNE  CHICAGO It’s Easy to Forget How Different 2003 is From 1993 l Enormous quantities of data: Petabytes –For an increasing number of communities, gating step is not collection but analysis l Ubiquitous Internet: 100+ million hosts –Collaboration & resource sharing the norm l Ultra-high-speed networks: 10+ Gb/s –Global optical networks l Huge quantities of computing: 100+ Top/s –Moore’s law gives us all supercomputers

4 ARGONNE  CHICAGO Consequence: The Emergence of Global Knowledge Communities l Teams organized around common goals –Communities: “Virtual organizations” l With diverse membership & capabilities –Heterogeneity is a strength not a weakness l And geographic and political distribution –No location/organization possesses all required skills and resources l Must adapt as a function of the situation –Adjust membership, reallocate responsibilities, renegotiate resources

5 ARGONNE  CHICAGO The Emergence of Global Knowledge Communities

6 ARGONNE  CHICAGO Global Knowledge Communities Often Driven by Data: E.g., Astronomy No. & sizes of data sets as of mid-2002, grouped by wavelength 12 waveband coverage of large areas of the sky Total about 200 TB data Largest catalogs near 1B objects Data and images courtesy Alex Szalay, John Hopkins

7 ARGONNE  CHICAGO Data Integration as a Fundamental Challenge R Discovery Many sources of data, services, computation R Registries organize services of interest to a community Access Data integration activities may require access to, & exploration of, data at many locations Exploration & analysis may involve complex, multi-step workflows RM Resource management is needed to ensure progress & arbitrate competing demands Security service Security service Policy service Policy service Security & policy must underlie access & management decisions

8 ARGONNE  CHICAGO Performance Requirements Demand Whole-System Management l Assume –Remote data at 1 GB/s –10 local bytes per remote –100 operations per byte Local Network Wide area link (end-to-end switched lambda?) 1 GB/s Parallel I/O: 10 GB/s Parallel computation: 1000 Gop/s Remote data >1 GByte/s achievable today (FAST, 7 streams, LA  Geneva)

9 ARGONNE  CHICAGO Data Integration: Key Challenges l Of course, familiar issues: data organization, schema definition/mediation, etc., etc. l But also new challenges relating to dynamic, distributed communities –Establishment, negotiation, management, & evolution of multi-organizational federations l And to the sheer number of resources, speed of networks, and volume of data –Coordination, management, provisioning, & monitoring of workflows & required resources

10 ARGONNE  CHICAGO Enter Grid Technologies l Infrastructure (“middleware”) for establishing, managing, and evolving multi-organizational federations –Dynamic, autonomous, domain independent –On-demand, ubiquitous access to computing, data, and services l Mechanisms for creating and managing workflow within such federations –New capabilities constructed dynamically and transparently from distributed services –Service-oriented, virtualization

11 ARGONNE  CHICAGO Increased functionality, standardization Custom solutions Open Grid Services Arch Real standards Multiple implementations Web services, etc. Managed shared virtual systems Computer science research Globus Toolkit Defacto standard Single implementation Internet standards The Emergence of Open Grid Standards 2010

12 ARGONNE  CHICAGO Open Grid Services Architecture l Service-oriented architecture –Key to virtualization, discovery, composition, local-remote transparency l Leverage industry standards –In particular, Web services l Distributed service management –A “component model for Web services” l A framework for the definition of composable, interoperable services “The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration”, Foster, Kesselman, Nick, Tuecke, 2002

13 ARGONNE  CHICAGO Web Services l XML-based distributed computing technology l Web service = a server process that exposes typed ports to the network l Described by the Web Services Description Language, an XML document that contains –Type of message(s) the service understands & types of responses & exceptions it returns –“Methods” bound together as “port types” –Port types bound to protocols as “ports” l A WSDL document completely defines a service and how to access it

14 ARGONNE  CHICAGO OGSA Structure l A standard substrate: the Grid service –Standard interfaces and behaviors that address key distributed system issues: naming, service state, lifetime, notification –A Grid service is a Web service l … supports standard service specifications –Agreement, data access & integration, workflow, security, policy, diagnostics, etc. –Target of current & planned GGF efforts l … and arbitrary application-specific services based on these & other definitions

15 ARGONNE  CHICAGO Open Grid Services Infrastructure Implementation Service data element Other standard interfaces: factory, notification, collections Hosting environment/runtime (“C”, J2EE,.NET, …) Service data element Service data element GridService (required) Data access Lifetime management Explicit destruction Soft-state lifetime Introspection: What port types? What policy? What state? Client Grid Service Handle Grid Service Reference handle resolution

16 ARGONNE  CHICAGO Open Grid Services Infrastructure GWD-R (draft-ggf-ogsi- gridservice-23) Editors: Open Grid Services Infrastructure (OGSI) S. Tuecke, ANL K. Czajkowski, USC/ISI I. Foster, ANL J. Frey, IBM S. Graham, IBM C. Kesselman, USC/ISI D. Snelling, Fujitsu Labs P. Vanderbilt, NASA February 17, 2003 Open Grid Services Infrastructure (OGSI) “The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration”, Foster, Kesselman, Nick, Tuecke, 2002

17 ARGONNE  CHICAGO Example: Reliable File Transfer Service Performance Policy Faults service data elements Pending File Transfer Internal State Grid Service Notf’n Source Policy interfaces Query &/or subscribe to service data Fault Monitor Perf. Monitor Client Request and manage file transfer operations Data transfer operations

18 ARGONNE  CHICAGO OGSA and Data Integration l OGSI provides key enabling mechanisms for distributed data integration –Introspect on distributed system elements –Create and manage distributed state l We need more than OGSI, of course, e.g., –WS-Agreement: negotiate agreements between service provider and consumer –OGSA-DAI: Data Access and Integration –WS-Management: service management –Security and policy

19 ARGONNE  CHICAGO OGSA Infrastructure Architecture OGSI: Interface to Grid Infrastructure Data Intensive Applications for X-ology Research Compute, Data & Storage Resources Distributed Simulation, Analysis & Integration Technology for X-ology Data Intensive X-ology Researchers Virtual Integration Architecture Generic Virtual Data Access and Integration Layer Structured Data Integration Structured Data Access Structured Data Relational XML Semi-structured- Transformation Registry Job Submission Data TransportResource Usage Banking BrokeringWorkflow Authorisation Slide Courtesy Malcolm Atkinson, UK eScience Center

20 ARGONNE  CHICAGO Data as Service: OGSA Data Access & Integration l Service-oriented treatment of data appears to have significant advantages –Leverage OGSI introspection, lifetime, etc. –Compatibility with Web services l Standard service interfaces being defined –Service data: e.g., schema –Derive new data services from old (views) –Externalize to e.g. file/database format –Perform queries or other operations

21 ARGONNE  CHICAGO 1a. Request to Registry for sources of data about “x” 1b. Registry responds with Factory handle 2a. Request to Factory for access to database 2c. Factory returns handle of GDS to client 3a. Client queries GDS with XPath, SQL, etc 3b. GDS interacts with database 3c. Results of query returned to client as XML SOAP/HTTP service creation API interactions RegistryFactory 2b. Factory creates GridDataService to manage access Grid Data Service Client XML / Relationa l database Data Access & Integration Services Slide Courtesy Malcolm Atkinson, UK eScience Center

22 ARGONNE  CHICAGO Globus Toolkit v3 (GT3) Open Source OGSA Technology l Implements and builds on OGSI interfaces l Supports primary GT2 interfaces –Public key authentication –Scalable service discovery –Secure, reliable resource access –High-performance data movement (GridFTP) l Numerous new services included or planned –SLA negotiation, service registry, community authorization, data access & integration, … l Rapidly growing adoption and contributions –E.g., OGSA-DAI from U.K. eScience program

23 ARGONNE  CHICAGO My Presentation 1) Data integration as a new opportunity –Driven by advances in technology & science –The need to discover, access, explore, analyze diverse distributed data sources –Grid technologies as a substrate for essential management functions 2) Science as collaborative workflow –The need to organize, archive, reuse, explain, & schedule scientific workflows –Virtual data as a unifying concept

24 ARGONNE  CHICAGO Science as Workflow l Data integration = the derivation of new data from old, via coordinated computation(s) –May be computationally demanding –The workflows used to achieve integration are often valuable artifacts in their own right l Thus we must be concerned with how we –Build workflows –Share and reuse workflows –Explain workflows –Schedule workflows

25 ARGONNE  CHICAGO Sloan Digital Sky Survey Production System

26 ARGONNE  CHICAGO Virtual Data Concept l Capture and manage information about relationships among –Data (of widely varying representations) –Programs (& their execution needs) –Computations (& execution environments) l Apply this information to, e.g. –Discovery: Data and program discovery –Workflow: Structured paradigm for organizing, locating, specifying, & requesting data –Explanation: provenance –Planning and scheduling –Other uses we haven’t thought of

27 ARGONNE  CHICAGO TransformationDerivation Data created-by execution-of consumed-by/ generated-by “I’ve detected a calibration error in an instrument and want to know which derived data to recompute.” “I’ve come across some interesting data, but I need to understand the nature of the corrections applied when it was constructed before I can trust it for my purposes.” “I want to search an astronomical database for galaxies with certain characteristics. If a program that performs this analysis exists, I won’t have to write one from scratch.” “I want to apply an astronomical analysis program to millions of objects. If the results already exist, I’ll save weeks of computation.” Motivations

28 ARGONNE  CHICAGO l Virtual data catalog –Transformations, derivations, data l Virtual data language –Catalog definitions l Query tool l Applications include browsers and data analysis applications Chimera Virtual Data System (

29 ARGONNE  CHICAGO Chimera Virtual Data Schema Metadata describes

mass = 200 decay = WW stability = 1 event = 8 mass = 200 decay = WW stability = 1 plot = 1 mass = 200 decay = WW plot = 1 mass = 200 decay = WW event = 8 mass = 200 decay = WW stability = 1 mass = 200 decay = WW stability = 3 mass = 200 decay = WW mass = 200 decay = ZZ mass = 200 decay = bb mass = 200 event = 8 Virtual Data in CMS HEP Analysis Work and slide by Rick Cavanaugh and Dimitri Bourilkov, University of Florida Define a virtual data space for exploration by other scientists mass = 200 plot = 1 Knowledge capture

mass = 200 decay = WW stability = 1 event = 8 mass = 200 decay = WW stability = 1 plot = 1 mass = 200 decay = WW plot = 1 mass = 200 decay = WW event = 8 mass = 200 decay = WW stability = 1 mass = 200 decay = WW stability = 3 mass = 200 decay = WW mass = 200 decay = ZZ mass = 200 decay = bb mass = 200 event = 8 Virtual Data in CMS HEP Analysis Work and slide by Rick Cavanaugh and Dimitri Bourilkov, University of Florida Search for WW decays of the Higgs Boson for which only stable, final state particles are recorded? stability = 1 mass = 200 plot = 1 Knowledge capture On-demand data gen Workload mgmt

mass = 200 decay = WW stability = 1 event = 8 mass = 200 decay = WW stability = 1 plot = 1 mass = 200 decay = WW plot = 1 mass = 200 decay = WW event = 8 mass = 200 decay = WW stability = 1 mass = 200 decay = WW stability = 3 mass = 200 decay = WW mass = 200 decay = ZZ mass = 200 decay = bb mass = 200 event = 8 Virtual Data in CMS HEP Analysis Work and slide by Rick Cavanaugh and Dimitri Bourilkov, University of Florida Search for WW decays of the Higgs Boson and where only stable, final state particles are recorded: stability = 1 Scientist discovers an interesting result – wants to know how it was derived. mass = 200 plot = 1 Knowledge capture On-demand data gen. Workload mgmt Explain provenance

mass = 200 decay = WW stability = 1 event = 8 mass = 200 decay = WW stability = 1 plot = 1 mass = 200 decay = WW plot = 1 mass = 200 decay = WW event = 8 mass = 200 decay = WW stability = 1 mass = 200 decay = WW stability = 3 mass = 200 decay = WW mass = 200 decay = ZZ mass = 200 decay = bb mass = 200 plot = 1 mass = 200 event = 8 Virtual Data in CMS HEP Analysis Work and slide by Rick Cavanaugh and Dimitri Bourilkov, University of Florida Search for WW decays of the Higgs Boson and where only stable, final state particles are recorded: stability = 1 Scientist discovers an interesting result – wants to know how it was derived. mass = 200 decay = WW stability = 1 LowPt = 20 HighPt = The scientist adds a new derived data Branch and continues to Investigate … Knowledge capture On-demand data gen. Workload mgmt Explain provenance Collaboration

34 ARGONNE  CHICAGO Virtual Data “Explorations” Can be Long-Lived Computations l Production Run on the Integration Testbed –Simulate 1.5 million full CMS events for physics studies: ~500 sec per event on 850 MHz processor –2 months continuous running across 5 testbed sites –Managed by a single person at the US-CMS Tier 1

35 ARGONNE  CHICAGO Galaxy cluster size distribution DAG Virtual Data in Sloan Galaxy Cluster Analysis Sloan Data Jim Annis, Steve Kent, Vijay Sehkri, Fermilab, Michael Milligan, Yong Zhao, Chicago

36 ARGONNE  CHICAGO Figure 1. GADU data flow DOESG Resource Virtual Data in Genome Analysis

37 ARGONNE  CHICAGO Bringing it All Together: A Virtual Data Grid

38 ARGONNE  CHICAGO AWFEWF DesignExecution monitoring WF-Pilot Abstract Task (AT) Repository AAV rules C C C Data & Parameter Ontologies ET schemas User ET GenbankBLAST query rewriting web service invocation Executable Task (ET) Repository web service invocation semantic type checking conversion rules data type conversion Datatype & Conversion Repository web service matching WF-Compiler AWF  EWF Translation WF-Engine Scheduling and execution B. Ludäscher, I. Altintas, A. Gupta – Also Very Relevant: Workflow & Web Services

39 ARGONNE  CHICAGO Summary 1) Data integration as a new opportunity –Driven by advances in technology & science –The need to discover, access, explore, analyze diverse distributed data sources –Grid technologies as a substrate for essential management functions 2) Science as collaborative workflow –The need to organize, archive, reuse, explain, and schedule scientific workflows –Virtual data as a unifying concept

40 ARGONNE  CHICAGO l The Globus Project™ – l Technical articles – l Open Grid Services Arch. – l Chimera – l Global Grid Forum – For More Information 2nd Edition: November 2003