Albert-Einstein-Institut Exploring Distributed Computing Techniques with Ccactus and Globus Solving Einstein’s Equations, Black Holes, and Gravitational Wave Astronomy Cactus, a new community simulation code framework: Grid enabling capabilities Previous Metacomputing experiments What we learned form those Current work, improvements –The present state Future development, goals Thomas Dramlitsch Albert-Einstein-Institut MPI-Gravitationsphysik (and AEI-ANL-NCSA-LBL team) Thomas Dramlitsch Albert-Einstein-Institut MPI-Gravitationsphysik (and AEI-ANL-NCSA-LBL team)
Albert-Einstein-Institut What is Cactus?: new concept in community developed simulation code infrastructure Numerical/computational infrastructure to solve PDE’s Freely available, open community source code: spirit of gnu/linux Developed as Response to Needs of these projects It’s production-software Cactus Divided in “Flesh” (core) and “Thorns” (modules or collections of subroutines) –User choice between Fortran, C, C ++ ; automated interface between them –Parallelism largely automatic and hidden (if desired) from user –Checkpointing / Restart capabilities Many parallel utilities / features enabled by Cactus –Parallel IO: FlexIO, HDF5; Data streaming, remote visualization/steering –Elliptic solvers: PETSc –And of course Metacomputing A Vision: any application can plug into Cactus to be Grid enabled Demo tomorrow night at HPDC
Albert-Einstein-Institut Modularity of Cactus... Application 1a Cactus Flesh Application 2... Sub-app AMR (Grace, etc) MPI layer 1I/O layer 2 Remote Steer 3 Globus Metcomputing Services User selects desired functionality... Application 1b
Albert-Einstein-Institut Metacomputing: harnessing power when and where it is needed Einstein equations typical of apps that require extreme memory, speed –many Flops per grid zone (~ ) –Finite differences on regular grids –Communications of variables through derivatives: ghost zones Largest supercomputers too small! Networks very fast! –OC-12 and higher very common in US –G-Win: 622 Mbits Potsdam-Berlin-Garching, connect multiple supercomputers –Gigabit networking to US possible “Seamless computing and visualization from anywhere” Many metacomputing experiments in progress
Albert-Einstein-Institut High performance: Full 3D Einstein Equations solved on NCSA NT Supercluster, Origin 2000, T3E Excellent scaling on many architectures –Origin up to 256 processors –T3E up to 1024 –NCSA NT cluster up to 128 processors Achieved 142 Gflops/s on 1024 node T3E-1200 (benchmarked for NASA NS Grand Challenge) But, of course, we want much more… metacomputing, meaning connected computers...
Albert-Einstein-Institut Metacomputing the Einstein Equations: Connecting T3E’s in Berlin, Garching, San Diego Want to migrate this technology to the generic user...
Albert-Einstein-Institut Scaling of Cactus on two T3Es on different continents San Diego & Berlin Berlin & Munich
Albert-Einstein-Institut Scaling of Cactus on Multiple SGIs at Remote Sites Argonne & NCSA
Albert-Einstein-Institut Analysis of previous metacomputing experiments It worked! (That’s the main thing we wanted at SC98…) Cactus was not optimized for metacomputing: messages too small, latency etc.. Mpich-G could perform better, e.g. intra-machine communication one order of magnitude slower than native MPI –Mpich-G2 improves this... Communication is non-trivial (not “embarrassingly parallel”) and very intensive Experiments showed: –For some problems, this is feasible –We to improve performance significantly with work on optimization of Cactus and Mpich-G –That’s what we did!
Albert-Einstein-Institut Optimizing Cactus Communication Layers for Metacomputing Made the communication layer(s) much more flexible: –Can specify size and number of messages, in order to achieve best performance with the underlying network (bandwith, latency) –Reduced communication to a bare minimum –Overlapping of communication with other cpu’s –Overlapping of communication and Computation Made the load balancing of cactus more flexible (Matei Ripeanu): –Cactus now allows to decompose the total problem into pieces of different size, according to cpu-power, number of cpu’s used on one machine etc... Cactus compiles (out of the box) with globus and mpich on most common architectures (T3e, Irix, SP-2,…?)
Albert-Einstein-Institut Optimizing Mpich-G: Used Mpich-G2 MPICH-G2 is a completely rewritten communication layer Can distinguish between inter- and intra-machine communication –It uses the vendor’s supplied mpi for intra-machine communication –Uses TCP/IP between machines This means optimal performance in a metacomputing environment Works with Cactus and Globus on all major unix-systems TCP/IP MPI_COMM_WORLD
Albert-Einstein-Institut Current experiments and future plans Current Experiment –Complete testing and production of tightly coupled simulation between different sites in the USA (NCSA, NERSC, ANL, SDSC and others) –Want to use advanced software (Portal, co-scheduling systems etc..) –Want to run across many sites and nodes as possible More General Grid Computing problems –Distribution of multiple grids –Dynamic resource acquisition Aquiring more memory when needed (AMR) Spawning off connected jobs on remote machines Cactus thorn would have access to MDS …
Albert-Einstein-Institut Cactus Computational Toolkit Science, Autopilot, AMR, Petsc, HDF, MPI, GrACE, Globus, Remote Steering... A Portal to Computational Science: The Cactus Collaboratory 1. User has science idea Selects Appropriate Resources Collaborators log in to monitor Steers simulation, monitors performance Composes/Builds Code Components w/Interface... Want to integrate and migrate this technology to the generic user...
Albert-Einstein-Institut German Gigabit Project supported by DFN-Verein Developing Techniques to Exploit High Speed Networks Focus on Remote Steering and Visualization OC-12 Testbed between AEI, ZIB, RZG with built-in application groups ready to use it! Already closely connected to ANL, NCSA, KDI projects AEI
Albert-Einstein-Institut Metacomputing Experiments, Production SC93: remote CM-5 simulation with live viz in CAVE SC95: Heroic I-Way experiments leads to development of Globus. Cornell SP-2, Power Challenge, with live viz in San Diego CAVE SC97: Garching 512 node T3E, launched, controlled, visualized in San Jose SC98: HPC Challenge. SDSC, ZIB, and Garching T3E compute collision of 2 Neutron Stars, controlled from Orlando SC99: Colliding Black Holes using Garching, ZIB T3E’s, with remote collaborative interaction and viz at ANL and NCSA booths April 2000: Attempting to use LANL, NCSA, NERSC, SDSC, ZIB, Garching, NASA-Ames, Maui?, +…? for single simulation! All this technology is available to in main production code for different applications!