Download presentation
Presentation is loading. Please wait.
Published byPrudence Barker Modified over 9 years ago
1
The Grid - Multi-Domain Distributed Computing Kai Rasmussen Paul Ruggieri
2
Topic Overview The Grid Types Virtual Organizations Security Real Examples Grid Tools Condor Cactus Cactus-G Globus OGSA
3
The Grid What is a Grid system? Highly heterogeneous set of resources that may or may not be maintained by multiple administrative domains Early idea Computational resources would be universally available as electric power
4
“A hardware and software infrastructures that provides dependable, consistent, pervasive and inexpensive access to high-end computational capabilities” - Ian Foster Resources are distributed across sites and organizations with no centralized point of control Resources are distributed across sites and organizations with no centralized point of control What constitutes a Grid? What constitutes a Grid? Resources coordinated without being subjected to a centralized control Resources coordinated without being subjected to a centralized control Uses standard, open source protocols and interfaces Uses standard, open source protocols and interfaces Delivers non-trivial qualities of service Delivers non-trivial qualities of service
5
Grid Types Computations Grids Resource pure CPU Strength: Computational Intensive applications Data Grids Shared storage and data Terabytes of storage space. Sharing of data among collaborators Fault Tolerance Equipment Grids Set of resources that surround shared equipments, such as a telescope
6
Virtual Organizations Grids are Multi-domain Resources are administrated by separate departments or institutions All wish to maintain individual control There is a cross site grouping of collaborators sharing resources “Virtual Organization”
7
Virtual Organizations Users of VO’s share a common goal and trust Collection of resources, users and rules governing sharing Highly controlled - What is Shared? Who is Sharing? How can resources be used? One global domains acting over individual collaborating domains
8
Grid Security Highly distributed nature VOs spread over many security domains Authentication Proving identity Authorization Obtaining privileges Confidentiality & Integrity Identity and privileges can be trusted
9
Authentication Certificate Authority (CA) Entity that signs certificate that proves users identity Certificate then used as credentials to use system Typically several CAs to prevent single point of failure/attack Globus Grid Security Infrastructure (GSI) Globus’s Authentication component Global security credential later mapped to local Kerberos tickets or local username and password Typically generate short-term proxy certificate with long-term certificate
10
Authentication Certification Authority Coordination Group Maintains a global infrastructure of trusted CA agents CA must meet standards Physically secure Must validate identity with Registration Authorities using official documents or photographic identification Private Keys must be minimum of 1020 Bits and have max 1 year life 28 approved CAs is European union
11
Security Issues Delegation User entrusts separate entity to perform task Entity must be given certification and trusted to behave Limit proxies strength Endow proxy with specific purpose
12
Grid Projects EGEE - Enabling Grids for eScience 70 sites in over 27 countries Mostly European 40 Virtual Organizations GENIUS Grid-Portal is used for submission Individual collaborators use own middle-ware tools to group resources
13
LCG Large Hadron Collider Computation Grid Developed distributed systems needed to support computation and data needs of LHC physics experiments EGEE Collaborator 100 Sites Worlds largest Grid
15
Grid 2003 US effort 27 National sites 28000 Processors, 13000 Simultaneous Jobs Infrastructure for Particle Physics Grid Virtual Data Grid Laboratory Develop Application Grid Laboratory - Grid3 Platform for experimental CS Research Built on Virtual Data Toolkit Collection of Globus, Condor and other middleware tools
16
TeraGrid 40 Teraflops of Computational Power 8 National Sites with strong backbone Used for NSF sponsored High Performance Computing Mapping the human arterial tree model TeraShake - Earthquake simulation
18
Applications Climate Monitoring + Simulation Network Weather Service Climate Data-Analysis Tool Both run on the Earth System Grid running on Globus MEANDER nowcast meteorology Run on Hungarian Supergrid ATLAS Challenge Simulate high energy proton-proton collisions Computational Science Simulations Biology, Fluid Dynamics
19
Grid Tools Many middleware implementations Globus Condor Condor-G Cactus-G OGSA Solves common Grid problems Resource discovery/management/allocation Security/Authentication
20
Condor Initially developed in 1983 at University of Wisconsin Pre-Grid tool A Local Resource Management System Allows creation of communities with distributed resources Communities should grown naturally Sharing as much or as little as they care too Sounds like Virtual Organizations
21
Condor Responsibilities Job Management, Scheduling Resource monitoring and management Checkpointing and Migration Utilize idle CPU Cycle ‘Scavenge
22
Condor Pool Full set of users and resources in community Composed of three Entities Agent Finds resources and executes jobs Resource Advertise itself and how it can be used in pool Matchmaker Knows of all agents and resources Puts together compatible pairs Pool is defined by single matchmaker
24
Matchmaking Problem of centralized Scheduling Resources have multiple owners Unique use requirements Matchmaking finds balance between user and resource needs ClassAds Agents advertise requirements Resources advertise how it can be used
25
Matchmaking Matchmaker scans all known ClassAds Creates matching pairs of agents and resources Informs both parties Individually responsible to negotiate job and initiating execution of job Separation of matching and claiming Matchmaker unaware of complicated allocation Stale information may exist. Resource can deny match
26
Condor Flocking Linking condor pools necessary for collaboration Sharing of resources beyond the organizational level Individuals belonging to multiple communities Gateway Flocking Entire communities are linked Direct Flocking Individual collaborators belong to many pools
27
Gateway Flocking Gateway entity serves as a singular point of access for cross pool communication Matchmakers talk to Gateways Gateways talk to Gateways Transparent to user Organizational level sharing Powerful, but difficult to setup and maintain
28
Gateway Flocking
29
Direct Flocking Agents report to multiple matchmakers Individual collaboration Natural idea for users Less powerful but simpler to build and deploy Eventually used in favor Gateway Flocking
30
Direct Flocking
31
Cactus General-purpose, open-source parallel computation framework Developed for numerical solution to Einstein’s equation Two main components flesh and thorns Flesh – central core Thorns – application modules Provides simple abstract API Hides MPI parallel driver, I/O (thorns)
32
Cactus-G “Grid-enabled” Cactus Combines Cactus and MPICH-G2 (more later) Layered approach Application thorns Grid-aware infrastructure thorns Grid-enabled communication library (MPICH-G2 in this case)
33
Globus Condor Pre-Grid tool applied to Grid Systems Multi-domain possible but limited No security. Focus primarily on resource management Globus Set of Grid specific tools Extendable and Hierarchical
34
The Toolkit Globus Toolkit Components for basic security, resource management, etc Well defined interfaces - “Hour-glass” architecture Local services sit behind API Global services built on top of these local services Interfaces useful to manage heterogeneity Information Service integral component Information-rich environment needed
35
Globus Services
36
Resource Management Globus Resource Allocation Manager (GRAM) Responsible for set of local resources Single domain Implemented with set a local RM tools Condor, NQE, Fork, Easy-LL, etc… Resource requests expressed in Resource Specification Language (RSL
37
Resource Broker Manages RSL requests Uses Information services to discover GRAMS Transforms abstract RSLs into more specific requirements Sends allocation requests to appropriate GRAM
39
Information Service Grid always in flux Information rich system produces information users find useful Enhances flexibility and performance Necessity for administration Globus Metacomputing Directory Service (MDS) Stores and makes accessible Grid information Lightweight Directory Access Protocol (LDAP) Extensible representation for information Stores component information in directory information tree
40
Security Local Heterogeneity Resources operated in multiple security domains All use different authentication techniques N-Way authentication Job may be any number of processes on any number of resources One logical entity. User should only authenticate once.
41
Security Globus Security Infrastructure (GSI) Modular design constructed on top of local services Solves local heterogeneity Globus Identity Mapped into local user identities by local GSI Allows for n-way authorization
43
OGSA Open Grid Services Architecture Defines a Grid Service Provides standard interface for naming, creating, discovering a Grid Service Location Transparent Globus Toolkit GRAM – resource allocation/management MDS-2 – information discovery GSI – authentication (single sign-on) Web services Widely used Language/system independent
44
OGSA – Grid Service Interface
45
OGSA – VO Structure
46
Condor-G Hybrid Condor-Globus System Local Condor agent (Condor-G) Communicates with Globus GRAM, MDS, GSI, etc Optimized Globus’s GRAM to work with Condor better
47
Specific Testbed Grid2003 Organized into 6 VOs (one for each application) At each VO site, middleware installed with grid certificate databases GSI, GRAM, and GridFTP used from Globus MDS MonALISA Agent-based monitoring used in conjunction with MDS
48
MPICH-G2: A Grid-Enabled Implementation of the Message Passing Interface Nicholas Karonis, Brian Toonen, Ian Foster
49
Abstract Grid Enabled MPI implementation Extends MPICH Utilizes Globus Toolkit Authentication, Authorization, Resource Allocation, Executable Staging, I/O, Process management creation and control Hide/Expose critical aspects of heterogeneous environment
50
The Problem Grids difficult to program for… heterogeneous, highly distributed Build on existing MPI API MPICH specifically Can we implement MPI constructs in a highly heterogeneous environment efficiently and transparently? Yes, use Globus! Can we also allow users to manage heterogeneity? Yes, existing MPI Communicator Construct!
51
MPICH-G2 Global Security Infrastructure (GSI) Single sign-on authentication Monitoring and Discovery Service (MDS) Select nodes to execute on Resource Specification Language Generated by mpirun Specifies job resource requirements Dynamically-Updated Request Online Coallocator (DUROC)
52
MPICH-G2 Flow Diagram
53
MPICH-G2 Improvements Replaces MPICH-G Replace use of Nexus (Globus) for all communication with optimized code Increased Bandwidth Cutout extra layer (Nexus) Reduce intra-machine vendor MPI messaging latency Eliminate unnecessary polling based on source rank info (for Recv) Specified, Specified-pending, multimethod (more later) Only poll TCP (expensive) when necessary (ie using TCP not vendor MPI)
54
MPICH-G2 Improvements 2 More efficient use of sockets Uses one socket for both directions Multilevel topology-aware collective operations Collective operations originally implemented assuming equidistance Not likely in Grid scenario
55
App Heterogeneity Management Topology Discovery Need method of discovering topology to minimize expensive transfers intra-site communication vs intra-machine communication Use existing MPI communicator construct Associate attributes with communicators Topology depths and colors Allow MPI developers to create communicators which group processes topologically
56
Example MPICH-G2 App
57
Performance Groupings Specified MPI_Recv explicitly specifies process on same machine No outstanding asynchronous operations Explicitly call vendor MPI Specified-pending MPI_Recv explicitly specifies process on same machine Outstanding recv requests on same machine Forced to continuously poll vendor MPI Multimethod MPI_Recv source rank is MPI_ANY_SOURCE OR outstanding recv requests which may require TCP Forced to continuously poll vendor MPI and TCP
58
Vendor MPI Results Increased performance compared to MPICH-G Relatively close performance to straight vendor MPI
59
Vendor MPI Results
60
TCP/IP Results Similar results as Vendor MPI (less interesting) Authors explicitly say they did not attempt to modify the TCP code
61
TCP/IP Results
62
Conclusions Good performance Improved performance opposed to previous version “good enough” performance to justify use Eases transition of MPI applications to the context of a Grid Just works Provides developer with a relatively simply means of writing “smart” apps which are aware of their topology
63
P-GRADE Portal
64
MTA SZTAKI Computer and Automation Research Institute of the Hungarian Academy of Sciences Laboratory of Parallel and Distributed Computing Peter Kacsuk Joszef Patvarczki HunGrid Member of both SEE-Grid and EGEE
65
Two Grid Problems Middleware tools build together into a Grid Too many complex parts Confusing for users with little experience Mostly research scientists PVM and MPI allow for Parallel execution Executed within a Globus or Condor site shows good performance Performance decreases when executed in multiple sites
66
P-GRADE Portal A Web based Portal for accessing Grid High level tools hide complexity of middleware Can be accessed anywhere Workflow solution Complex problems are broken into several parts treated as single framework Executed as an acyclic graph Parallelism at two levels Independent branches run on several grid sites Individual nodes can be parallel programs (MPI or PVM)
67
Portal Fully functional; built upon middleware tools Grid Certificate management Setting up Grid environment Creation and modification of workflow apps Management and parallel execution of workflow apps on grid resources Visualization of workflow progress
68
Grid Certificate Security done through Globus GSI Connect to Proxy server; download Certificate Monitor status
69
Resource Management Use Globus tools to attach jobs to resources Two Strategies Static Allocation Connect Directly to GRAM Servers Dynamic Allocation Connect to MDS service Allocate through Grid resource broker
70
Workflow Creation & Monitoring P-GRADE Java app for creating parallel workflows Directed input and output files
71
Parameter Study Singular job run under varying input parameters Outputs later compared against each other Logical Grid Application Each job independent and can be run in parallel
72
P-GRADE Portal w/ PStudy Adaped Portal to create and manage Parametric studies New workflow Editor Creation of parameterized input file Manage parameter values Workflow Management Submit workflows by parameter ranges Compare outputs Monitor individual job status
73
Pstudy Manager
74
Visualization
75
PGRADE Demo http://hgportal.hpcc.sztaki.hu:8080/gridsph ere/gridsphere http://hgportal.hpcc.sztaki.hu:8080/gridsph ere/gridsphere http://hgportal.hpcc.sztaki.hu:8080/gridsph ere/gridsphere
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.