The Global Storage Grid Or, Managing Data for “Science 2.0” Ian Foster Computation Institute Argonne National Lab & University of Chicago.

Slides:



Advertisements
Similar presentations
The Anatomy of the Grid: An Integrated View of Grid Architecture Carl Kesselman USC/Information Sciences Institute Ian Foster, Steve Tuecke Argonne National.
Advertisements

ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.
High Performance Computing Course Notes Grid Computing.
The ADAMANT Project: Linking Scientific Workflows and Networks “Adaptive Data-Aware Multi-Domain Application Network Topologies” Ilia Baldine, Charles.
Data Grids Darshan R. Kapadia Gregor von Laszewski
Ian Foster Computation Institute Argonne National Lab & University of Chicago Service-Oriented Science: Scaling eScience Impact.
Service Oriented Grid Architecture Hui Li ICT in Business Colloquium, LIACS Mar 1 st, 2006 Note: Part of this presentation is based on Dr. Ian Foster’s.
MTA SZTAKI Hungarian Academy of Sciences Grid Computing Course Porto, January Introduction to Grid portals Gergely Sipos
Seminar Grid Computing ‘05 Hui Li Sep 19, Overview Brief Introduction Presentations Projects Remarks.
Ian Foster Computation Institute Argonne National Lab & University of Chicago Education in the Science 2.0 Era.
Ian Foster Computation Institute Argonne National Lab & University of Chicago Cyberinfrastructure and the Role of Grid Computing Or, “Science 2.0”
Application of GRID technologies for satellite data analysis Stepan G. Antushev, Andrey V. Golik and Vitaly K. Fischenko 2007.
Open Science Grid June 28, 2006 Bill Kramer Chair of the Open Science Grid Council NERSC Center General Manager, LBNL.
Globus Toolkit 4 hands-on Gergely Sipos, Gábor Kecskeméti MTA SZTAKI
Seminar Grid Computing ‘06 Hui Li Sep 18, Overview Brief Introduction Presentations –Architecture –Functionality/Middleware –Applications Projects.
Milos Kobliha Alejandro Cimadevilla Luis de Alba Parallel Computing Seminar GROUP 12.
4b.1 Grid Computing Software Components of Globus 4.0 ITCS 4010 Grid Computing, 2005, UNC-Charlotte, B. Wilkinson, slides 4b.
GT4 Introductory and Advanced Practicals Rachana Ananthakrishnan, Charles Bacon, Lisa Childers Argonne National Laboratory University of Chicago.
Ian Foster Computation Institute Argonne National Lab & University of Chicago Service-Oriented Science: Scaling eScience Impact Or, “Science 2.0”
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
1 Globus Developments Malcolm Atkinson for OMII SC 18 th January 2005.
Globus 4 Guy Warner NeSC Training.
Kate Keahey Argonne National Laboratory University of Chicago Globus Toolkit® 4: from common Grid protocols to virtualization.
1 The Application-Infrastructure Gap Dynamic and/or Distributed Applications A 1 B Shared Distributed Infrastructure.
Building Data-intensive Pipelines Ravi K Madduri Argonne National Lab University of Chicago.
The Data Replication Service Ann Chervenak Robert Schuler USC Information Sciences Institute.
TeraGrid Information Services December 1, 2006 JP Navarro GIG Software Integration.
Ian Foster Argonne National Laboratory University of Chicago Univa Corporation Service-Oriented Science Scaling eScience Application & Impact.
Ian Foster Argonne National Laboratory University of Chicago Univa Corporation Grid Dynamics.
Cancer Bioinformatics Grid (caBIG) CANS 2006 Chicago, Illinois Shannon Hastings Department of Biomedical Informatics Ohio State University.
Data Management Kelly Clynes Caitlin Minteer. Agenda Globus Toolkit Basic Data Management Systems Overview of Data Management Data Movement Grid FTP Reliable.
ARGONNE  CHICAGO Ian Foster Discussion Points l Maintaining the right balance between research and development l Maintaining focus vs. accepting broader.
Ian Foster Argonne National Lab University of Chicago Globus Project The Grid and Meteorology Meteorology and HPN Workshop, APAN.
SAN DIEGO SUPERCOMPUTER CENTER NUCRI Advisory Board Meeting November 9, 2006 Science Gateways on the TeraGrid Nancy Wilkins-Diehr TeraGrid Area Director.
Ian Foster Computation Institute Argonne National Lab & University of Chicago Globus and Service Oriented Architecture.
GT Components. Globus Toolkit A “toolkit” of services and packages for creating the basic grid computing infrastructure Higher level tools added to this.
1 School of Computer, National University of Defense Technology A Profile on the Grid Data Engine (GridDaEn) Xiao Nong
INFSO-RI Enabling Grids for E-sciencE The US Federation Miron Livny Computer Sciences Department University of Wisconsin – Madison.
Ian Foster Argonne National Laboratory University of Chicago Univa Corporation Service-Oriented Science Scaling Science Services APAC Conference, September.
Virtual Communities and Science in the Large Dr. Carl Kesselman ISI Fellow Director, Center for Grid Technologies Information Sciences Institute Research.
ESP workshop, Sept 2003 the Earth System Grid data portal presented by Luca Cinquini (NCAR/SCD/VETS) Acknowledgments: ESG.
Grid Services Overview & Introduction Ian Foster Argonne National Laboratory University of Chicago Univa Corporation OOSTech, Baltimore, October 26, 2005.
Service Oriented Science Ian Foster Argonne National Laboratory University of Chicago Univa Corporation.
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.
Virtual Data Grid Architecture Ewa Deelman, Ian Foster, Carl Kesselman, Miron Livny.
Perspectives on Grid Technology Ian Foster Argonne National Laboratory The University of Chicago.
Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
Authors: Ronnie Julio Cole David
The Earth System Grid (ESG) Computer Science and Technologies DOE SciDAC ESG Project Review Argonne National Laboratory, Illinois May 8-9, 2003.
Ian Foster Computation Institute Argonne National Lab & University of Chicago Cyberinfrastructure and the Role of Grid Computing Or, “Science 2.0”
GRIDS Center Middleware Overview Sandra Redman Information Technology and Systems Center and Information Technology Research Center National Space Science.
GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.
Ian Foster Computation Institute Argonne National Lab & University of Chicago Scaling eScience Impact.
Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
Cole David Ronnie Julio. Introduction Globus is A community of users and developers who collaborate on the use and development of open source software,
Leveraging the InCommon Federation to access the NSF TeraGrid Jim Basney Senior Research Scientist National Center for Supercomputing Applications University.
Globus online Software-as-a-Service for Research Data Management Steve Tuecke Deputy Director, Computation Institute University of Chicago & Argonne National.
1 Overall Architectural Design of the Earth System Grid.
Globus and PlanetLab Resource Management Solutions Compared M. Ripeanu, M. Bowman, J. Chase, I. Foster, M. Milenkovic Presented by Dionysis Logothetis.
Securing the Grid & other Middleware Challenges Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer.
Ian Foster Computation Institute Argonne National Lab & University of Chicago Grid Enabling Open Science.
Ian Foster Computation Institute Argonne National Lab & University of Chicago Application Hosting Services — Enabling Science 2.0 —
Grid Execution Management for Legacy Code Architecture Exposing legacy applications as Grid services: the GEMLCA approach Centre.
INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.
Parallel Computing Globus Toolkit – Grid Ayaka Ohira.
Accessing the VI-SEEM infrastructure
Virtual Organizations By the Rules
Argonne National Laboratory
Presentation transcript:

The Global Storage Grid Or, Managing Data for “Science 2.0” Ian Foster Computation Institute Argonne National Lab & University of Chicago

2 “Web 2.0” l Software as services u Data- & computation-rich network services l Services as platforms u Easy composition of services to create new capabilities (“mashups”)—that themselves may be made accessible as new services l Enabled by massive infrastructure buildout u Google projected to spend $1.5B on computers, networks, and real estate in 2006 u Dozens of others are spending substantially l Paid for by advertising Declan Butler, Nature

3 Science 2.0: E.g., Cancer Bioinformatics Grid Data uchicago.edu <BPEL Workflow Doc> BPEL Engine Analytic osu.edu Analytic duke.edu <Workflow Results> <Workflow Inputs> link caBiG: BPEL work: Ravi Madduri et al.

4 Science 2.0: E.g., Virtual Observatories Data Archives User Analysis tools Gateway Figure: S. G. Djorgovski Discovery tools

5 Science 1.0  Science 2.0: For Example, Digital Astronomy Tell me about this star Tell me about these 20K stars Support 1000s of users E.g., Sloan Digital Sky Survey, ~40 TB; others much bigger soon

6 Global Data Requirements l Service consumer u Discover data u Specify compute-intensive analyses u Compose multiple analyses u Publish results as a service l Service provider u Host services enabling data access/analysis u Support remote requests to services u Control who can make requests u Support time-varying, data- and compute- intensive workloads

7 l But: u Amount of computation can be enormous u Load can vary tremendously u Users want to compose distributed services  data must sometimes be moved, anyway l Fortunately u Networks are getting much faster (in parts) u Workloads can have significant locality of reference Analyzing Large Data: “Move Computation to the Data”

8 “Move Computation to the Core” Poorly connected “periphery” Highly connected “core”

9 Highly Connected “Core”: For Example, TeraGrid Starlight LA Atlanta SDSC TACC PU IU ORNL NCSA ANL PSC 16 Supercomputers - 9 different types, multiple sizes World’s fastest network Globus Toolkit and other middleware providing single login, application management, data movement, web services 30 Gigabits/s to large sites = times major uni links = 30,000 times my home broadband = 1 full length feature film per sec 75 Teraflops (trillion calculations/s) = 12,500 faster than all 6 billion humans on earth each doing one calculation per second

10 Open Science Grid May 7-14, jobs

11 The Two Dimensions of Science 2.0 l Decompose across network Clients integrate dynamically u Select & compose services u Select “best of breed” providers u Publish result as new services l Decouple resource & service providers Function Resource Data Archives Analysis tools Discovery tools Users Fig: S. G. Djorgovski

12 Provisioning Technology Requirements: Integration & Decomposition l Service-oriented Grid infrastructure u Provision physical resources to support application workloads Appln Service Users Workflows Composition Invocation l Service-oriented applications u Wrap applications & data as services u Compose services into workflows “The Many Faces of IT as Service”, ACM Queue, Foster, Tuecke, 2005

13 Technology Requirements: Within the Core … l Provide “service hosting services” that allow consumers to negotiate the hosting of arbitrary data analysis services l Dynamically manage resources (compute, storage, network) to meet diverse computational demands l Provide strong internal security, bridging to diverse external sources of attributes

14 Globus Software Enables Grid Infrastructure l Web service interfaces for behaviors relating to integration and decomposition u Primitives: resources, state, security u Services: execution, data movement, … l Open source software that implements those interfaces u In particular, Globus Toolkit (GT4) l All standard Web services u “Grid is a use case for Web services, focused on resource management”

15 Open Source Grid Software Data Mgmt Security Common Runtime Execution Mgmt Info Services GridFTP Authentication Authorization Reliable File Transfer Data Access & Integration Grid Resource Allocation & Management Index Community Authorization Data Replication Community Scheduling Framework Delegation Replica Location Trigger Java Runtime C Runtime Python Runtime WebMDS Workspace Management Grid Telecontrol Protocol Globus Toolkit v4 Credential Mgmt Globus Toolkit Version 4: Software for Service-Oriented Systems, LNCS 3779, 2-13, 2005

16 GT4 Data Services l Data movement u GridFTP—secure, reliable, performant u Reliable File Transfer: managed transfers u Data Replication Service—managed replication l Replica Location Service u Scales to 100s of millions of replicas l Data Access & Integration services u Access to, and server-side processing, of structured data Disk-to-disk on TeraGrid

17 Security Services l Attribute Authority (ATA) u Issue signed attribute assertions (incl. identity, delegation & mapping) l Authorization Authority (AZA) u Decisions based on assertions & policy VO A Service VO ATA VO AZA Mapping ATA VO B Service VO User A Delegation Assertion User B can use Service A VO-A Attr  VO-B Attr VO User B Resource Admin Attribute VO Member Attribute VO Member Attribute

18 Service Hosting Policy Client Environment Activity Allocate/provision Configure Initiate activity Monitor activity Control activity Interface Resource provider WSRF (or WS-Transfer/WS-Man, etc.), Globus GRAM, Virtual Workspaces

19 Virtual OSG Clusters OSG cluster Xen hypervisors TeraGrid cluster OSG “Virtual Clusters for Grid Communities,” Zhang et al., CCGrid 2006

20 Managed Storage: GridFTP with NeST (Demoware) GridFTP Server NeST Module Disk Storage NeST Server Chirp Custom Application (Lot operations, etc.) (chirp) (File transfers) (GSI-FTP) (File transfer) (chirp) GT4NeST Bill Allcock, Nick LeRoy, Jeff Weber, et al.

21 Community Services Provider Content Services Capacity Hosting Science Services 1) Integrate services from other sources u Virtualize external services as VO services 2) Coordinate & compose u Create new services from existing ones Capacity Provider “Service-Oriented Science”, Science, 2005

22 Virtualizing Existing Services l Establish service agreement with service u E.g., WS-Agreement l Delegate use to community (“VO”) user User A VO Admin User B VO User Existing Services

23 Birmingham The Globus-Based LIGO Data Grid Replicating >1 Terabyte/day to 8 sites >40 million replicas so far MTBF = 1 month LIGO Gravitational Wave Observatory  Cardiff AEI/Golm

24 l Pull “missing” files to a storage system List of required Files GridFTP Local Replica Catalog Replica Location Index Data Replication Service Reliable File Transfer Service Local Replica Catalog GridFTP Data Replication Service “Design and Implementation of a Data Replication Service Based on the Lightweight Data Replicator System,” Chervenak et al., 2005 Replica Location Index Data Movement Data Location Data Replication

25 Hypervisor/OS Deploy hypervisor/OS Data Replication Service: Dynamic Deployment Physical machine Procure hardware VM Deploy virtual machine State exposed & access uniformly at all levels Provisioning, management, and monitoring at all levels JVM Deploy container DRS Deploy service GridFTP LRC VO Services GridFTP

26 Decomposition Enables Separation of Concerns & Roles User Service Provider “Provide access to data D at S1, S2, S3 with performance P” Resource Provider “Provide storage with performance P1, network with P2, …” D S1 S2 S3 D S1 S2 S3 Replica catalog, User-level multicast, … D S1 S2 S3

27 Example: Biology Public PUMA Knowledge Base Information about proteins analyzed against ~2 million gene sequences Back Office Analysis on Grid Millions of BLAST, BLOCKS, etc., on OSG and TeraGrid Natalia Maltsev et al.,

28 Genome Analysis & Database Update (GADU) on OSG April 24, 2006 GADU 3,000 jobs

29 Example: Earth System Grid l Climate simulation data u Per-collection control u Different user classes u Server-side processing l Implementation (GT) u Portal-based User Registration (PURSE) u PKI, SAML assertions u GridFTP, GRAM, SRM l >2000 users l >100 TB downloaded — DOE OASCR

30 “Inside the Core” of ESG

31 Example: Astro Portal Stacking Service l Purpose u On-demand “stacks” of random locations within ~10TB dataset l Challenge u Rapid access to K “random” files u Time-varying load l Solution u Dynamic acquisition of compute, storage = + S4S4 Sloan Data Web page or Web Service Joint work with Ioan Raicu & Alex Szalay

32 Preliminary Performance (TeraGrid, LAN GPFS) Joint work with Ioan Raicu & Alex Szalay

33 Example: Cybershake Calculate hazard curves by generating synthetic seismograms from estimated rupture forecast Rupture Forecast Synthetic Seismogram Strain Green Tensor Hazard Curve Spectral Acceleration Hazard Map Tom Jordan et al., Southern California Earthquake Center

34 20 TB, 1.8 CPU-year Enlisting TeraGrid Resources Workflow Scheduler/Engine VO Service Catalog Provenance Catalog Data Catalog SCEC Storage TeraGrid Compute TeraGrid Storage VO Scheduler Ewa Deelman, Carl Kesselman, et al., USC Information Sciences Institute

35 Guidelines (Apache) Infrastructure (CVS, , bugzilla, Wiki) Projects Include … dev.globus — Community Driven Improvement of Globus Software, NSF OCI

36 Summary l “Science 2.0”—science as service, & service as platform—demands u New infrastructure—service hosting u New technology—hosting & management u New policy—hierarchically controlled access l Data & storage management cannot be separated from computation management u And increasingly become community roles l A need for new technologies, skills, & roles u Creating, publishing, hosting, discovering, composing, archiving, explaining … services

37 Science 1.0  Science 2.0 Gigabytes Tarballs Journals Individuals Community codes Supercomputer centers Makefile Computational science Physical sciences Computational scientists NSF-funded  Terabytes  Services  Wikis  Communities  Science gateways  TeraGrid, OSG, campus  Workflow  Science as computation  All sciences (& humanities)  All scientists  NSF-funded

38 Science 2.0 Challenges l A need for new technologies, skills, & roles u Creating, publishing, hosting, discovering, composing, archiving, explaining … services l A need for substantial software development u “30-80% of modern astronomy projects is software”—S. G. Djorgovski l A need for more & different infrastructure u Computers & networks to host services l Can we leverage commercial spending? u To some extent, but not straightforward

39 Acknowledgements l Carl Kesselman for many discussions l Many colleagues, including those named on slides, for research collaborations and/or slides l Colleagues involved in the TeraGrid, Open Science Grid, Earth System Grid, caBIG, and other Grid infrastructures l Globus Alliance members for Globus software R&D l DOE OASCR, NSF OCI, & NIH for support

40 For More Information l Globus Alliance u l Dev.Globus u dev.globus.org l Open Science Grid u l TeraGrid u l Background u 2nd Edition Thanks for DOE, NSF, and NIH for research support!!