Introduction to the Grid Peter Kacsuk MTA SZTAKI www.lpds.sztaki.hu
Agenda From Metacomputers to the Grid Grid Applications Job Managers in the Grid - Condor Grid Middleware – Globus Grid Application Environments © Peter Kacsuk
Grid Computing in the News © Peter Kacsuk Credit to Fran Berman
Real World Distributed Applications SETI@home 3.8M users in 226 countries 1200 CPU years/day 38 TF sustained (Japanese Earth Simulator is 40 TF peak) 1.7 ZETAflop over last 3 years (10^21, beyond peta and exa …) Highly heterogeneous: >77 different processor types © Peter Kacsuk Credit to Fran Berman
Progress in Grid Systems Supercomputing (PVM/MPI) Network Computing (sockets) Clusters Cluster computing Web Computing (scripts) OO Computing (CORBA) Client/server High-throughput computing High-performance computing Object Web Condor Globus Web Services OGSA © Peter Kacsuk Semantic Grid Grid Systems
Progress to the Grid Meta-computer GFlops Cluster Super-computer 2100 Cluster Super-computer Single processor Computers © Peter Kacsuk
Original motivation for metacomputing Grand challenge problems run weeks and months even on supercomputers and clusters Various supercomputers/clusters must be connected by wide area networks in order to solve grand challenge problems in reasonable time © Peter Kacsuk
Original meaning of metacomputing = Super computing + Wide area network Original goal of metacomputing: Distributed supercomputing to achieve higher performance than individual supercomputers/clusters can provide © Peter Kacsuk
Distributed Supercomputing Caltech Exemplar Issues: Resource discovery, scheduling Configuration Multiple comm methods Message passing (MPI) Scalability Fault tolerance NCSA Origin Maui SP Argonne SP © Peter Kacsuk SF-Express Distributed Interactive Simulation: Caltech, USC/ISI
Technologies for metacomputers Super-computing WAN technology Distributed computing Metacomputers © Peter Kacsuk
What is a Metacomputer? A metacomputer is a collection of computers that are heterogeneous in every aspects geographically distributed connected by a wide-area network form the image of a single computer Metacomputing means: network based distributed supercomputing © Peter Kacsuk
Further motivations for metacomputing Better usage of computing and other resources accessible via wide area network Various computers must be connected by wide area networks in order to exploit their spare cycles Various special devices must be accessed by wide area networks for collaborative work © Peter Kacsuk
Motivations for grid-computing To form a computational grid similar to the information data access on the web. Any computers/devices must be connected by wide area networks in order to form a universal source of computing power. Grid = generalised metacomputing © Peter Kacsuk
Technologies that led to the Grid Super-computing Network technology Web technology Grid © Peter Kacsuk
What is a Grid? A Grid is a collection of computers, storage and other devices that are heterogeneous in every aspects geographically distributed connected by a wide-area network form the image of a single computer Generalised metacomputing means: network based distributed computing © Peter Kacsuk
Application areas of the Grid Disributed supercomputing High throughput computing Parameter studies Virtual laboratory Collaborative design Data intensive applications Sky survey, particle physics Geographic Information systems Teleimmersion Enterprise architectures © Peter Kacsuk
Distributed Supercomputing Caltech Exemplar Issues: Resource discovery, scheduling Configuration Multiple comm methods Message passing (MPI) Scalability Fault tolerance NCSA Origin Maui SP Argonne SP © Peter Kacsuk SF-Express Distributed Interactive Simulation: Caltech, USC/ISI
High-Throughput Computing Schedule many independent tasks Parameter studies Data analysis Issues: Resource discovery Data Access Scheduling Reservation Security Accounting Code management Deadline Cost Available Machines © Peter Kacsuk Nimrod-G: Monash University
High throughput Computing: Condor jobs High throughput Computing: Condor your workstation personal Condor Goal: Exploit the spare cycles of computers in the Grid Realization steps (1): Turn your desktop into a personal Condor machine © Peter Kacsuk Credit to Miron Livny
High throughput Computing: Condor jobs High throughput Computing: Condor SZTAKI cluster Condor pool your workstation personal Condor Realization steps (2): Create your institute level Condor pool © Peter Kacsuk Credit to Miron Livny
High throughput Computing: Condor jobs High throughput Computing: Condor Realization steps (3): Connect “friendly” Condor pools. SZTAKI cluster Condor pool personal Condor your workstation Friendly BME Condor pool © Peter Kacsuk Credit to Miron Livny
Condor jobs Realization steps (4): Temporary exploitation of Grid resources Hungarian Grid PBS LSF Condor SZTAKI cluster Condor pool personal Condor your workstation glide-ins Friendly BME Condor pool © Peter Kacsuk Credit to Miron Livny
NUG30 - Solved!!! Number of workers Credit to Miron Livny Solved in 7 days instead of 10.9 years The first 600K seconds … Number of workers © Peter Kacsuk Credit to Miron Livny
The Condor model Your program moves to resource(s) ClassAdds Match-maker Publish Resource requirement (configuration description) Resource Resource requestor TCP/IP provider Your program moves to resource(s) Security is a serious problem! © Peter Kacsuk
Generic Grid Architecture Appl. Dev. Environments Analysis & Visualisation Collaboratories Problem Solving Environments Grid Portals Application Environments Application Support Grid Common Services Grid Fabric - local resources MPI CONDOR CORBA JAVA/JINI OLE DCOM Other... Information Services Global Sceduling Data Access Caching Resource Co-Allocation Authentication Authorisation Monitoring Fault Management Policy Accounting Resource Management CPUs Tertiary Storage Online Storage Communications Scientific Instruments © Peter Kacsuk
Middleware concepts Goal of the middleware: Three main concepts: to turn a radically heterogeneous environment into a virtual homogeneous one Three main concepts: Toolkit (mix-and-match) approach Globus Object-oriented approach Legion, Globe Commodity Internet-www approach Web services © Peter Kacsuk
Globus Layered Architecture Applications Application Toolkits GlobusView Testbed Status DUROC MPI Condor-G HPC++ Nimrod/G globusrun Grid Services Nexus GRAM I/O MDS-2 GSI GSI-FTP HBM GASS Grid Fabric Condor MPI TCP UDP © Peter Kacsuk LSF PBS NQE Linux NT Solaris DiffServ
Globus Approach: Hourglass High-level services GRAM protocol Condor, LSF, NQE, PBS, etc. Resource brokers, Resource co-allocators Internet protocol TCP, FTP, HTTP, etc. Ethernet, ATM, FDDI, etc. Low-level tools © Peter Kacsuk
Globus hierarchical resource management architecture RSL Application Brokers Run DIS with 100K entities Ground RSL 80 nodes on Arg SP-2, 256 nodes on CIT Exemplar Information service (MDS-2) Co-allocators Simple ground RSL GRAM Run SF-express on 80 nodes Run SF-express on 256 nodes Argonne Resource Manager SDSC Resource Manager Local resource managers © Peter Kacsuk
The Globus Model Your program moves to resource(s) description Info system (MDS-2) Publish MDS-2 API (configuration description) Resource Resource requestor GRAM API provider Your program moves to resource(s) Security is a serious problem! © Peter Kacsuk
“Standard” MDS Architecture (MDS-2) Resources run a standard information service (GRIS) which speaks LDAP and provides information about the resource (no searching). GIIS provides a “caching” service much like a web search engine. Resources register with GIIS and GIIS pulls information from them when requested by a client and the cache is expired. GIIS provides the collective-level indexing/searching function. Resource A GRIS Client 1 Clients 1 and 2 request info directly from resources. Resource B GRIS GIIS is an index node. Index nodes can be designed and optimized for various requirements GIIS requests information from GRIS services as needed. Client 2 Client 3 uses GIIS for searching collective information. Client 3 GIIS Cache contains info from A and B © Peter Kacsuk
Grid Security Infrastructure (GSI) Proxies and delegation (GSI Extensions) for secure single Sign-on PKI for credentials Proxies and Delegation SSL (Secure Socket Layer) for Authentication and message protection PKI (CAs and Certificates) SSL © Peter Kacsuk
Grid application environments Integrated environments Cactus P-GRADE (Parallel Grid Run-time and Application Development Environment) Application specific environments NetSolve Problem solving environments Grid portals © Peter Kacsuk
A Collaborative Grid Environment based on Cactus Viz of data from previous simulations in Vienna café Remote Viz in St Louis Remote steering and monitoring from airport Remote Viz and steering from Berlin DataGrid/DPSS Downsampling IsoSurfaces http HDF5 T3E: Garching Origin: NCSA Globus Simulations launched from Cactus Portal Grid enabled Cactus runs on distributed machines © Peter Kacsuk Credit to Ed Seidel
P-GRADE: Software Development and Execution Edit, debugging Performance-analysis Execution © Peter Kacsuk Grid
Nowcast Meteorology Application in P-GRADE 25 x 10 x 25 x 5 x © Peter Kacsuk
Performance visualisation in P-GRADE © Peter Kacsuk
Nowcast Meteorology Application in P-GRADE 25 x 1st job 10 x 25 x 5 x 2nd job 3rd job 4th job 5th job © Peter Kacsuk
Layers of TotalGrid P-GRADE PERL-GRID Condor or SGE PVM or MPI Internet Ethernet © Peter Kacsuk
PERL-GRID A thin layer for Application in the Hungarian Cluster Grid Grid level job management between P-GRADE and various local job managers like Condor SGE, etc. file staging Application in the Hungarian Cluster Grid © Peter Kacsuk
Hungarian Cluster Grid Initiative Goal: To connect 99 new clusters of the Hungarian higher education institutions into a Grid Each cluster contains 20 PCs and a network server PC. Day-time: the components of the clusters are used for education At night: all the clusters are connected to the Hungarian Grid by the Hungarian Academic network (2.5 Gbit/sec) Total Grid capacity by the end of 2003: 2079 PCs Current status: About 400 PCs are already connected at 8 universities Condor-based Grid system VPN (Virtual Private Network) Open Grid: other clusters can join at any time © Peter Kacsuk
Structure of the Hungarian Cluster Grid Condor => TotalGrid 2003: 99*21 PC Linux clusters, total 2079 PCs Condor => TotalGrid 2.5 Gb/s Internet Condor => TotalGrid © Peter Kacsuk
Problem Solving Environments Examples: Problem solving env. for computational chemistry Application web portals Issues: Remote job submission, monitoring, and control Resource discovery Distributed data archive Security Accounting © Peter Kacsuk ECCE’: Pacific Northwest National Laboratory
Grid Portals GridPort (https://gridport.npaci.edu) Grid Resource Broker (GRB) (http://sara.unile.it/grb) Grid Portal Development Kit (GPDK) (http://www.doesciencegrid.org/Grid) Genius (http://www.infn.it/grid) © Peter Kacsuk
GPDK © Peter Kacsuk
Genius © Peter Kacsuk
Summary Grid is a new technology which integrates: Supercomputing Wide-area network technology WWW technology The computational Grid will lead to a new infrastructure similar to the electrical grid This infrastructure will have a tremendous influence on the Information Society © Peter Kacsuk