Scenarios for Grid Applications Ed Seidel Max Planck Institute for Gravitational Physics
Scenarios 2 PHYSICS TOMORROW n Remote visualization, monitoring and steering n Distributed computing n Grid portals n More dynamic scenarios Outline
Scenarios 3 Why Grids? (1) eScience n A biochemist exploits 10,000 computers to screen 100,000 compounds in an hour n 1,000 physicists worldwide pool resources for peta- op analyses of petabytes of data n Civil engineers collaborate to design, execute, & analyze shake table experiments n Climate scientists visualize, annotate, & analyze terabyte simulation datasets n An emergency response team couples real time data, weather model, population data Source: Ian Foster
Scenarios 4 Why Grids? (2) eBusiness n Engineers at a multinational company collaborate on the design of a new product n A multidisciplinary analysis in aerospace couples code and data in four companies n An insurance company mines data from partner hospitals for fraud detection n An application service provider offloads excess load to a compute cycle provider n An enterprise configures internal & external resources to support eBusiness workload Source: Ian Foster
Scenarios 5 Current Grid App Types n Community Driven l Serving the needs of distributed communities l Video Conferencing l Virtual Collaborative Environments –Code sharing to “experiencing each other” at a distance… n Data Driven l Remote access of huge data, data mining l Weather Information systems l Particle Physics n Process/Simulation Driven l Demanding Simulations of Science and Engineering l Get less attention in the Grid World, yet drive HPC! n Remote, steered, etc… Present Examples: “simple but very difficult” Future Examples: Dynamic Interacting combinations of all types
Scenarios 6 From Telephone Conference Calls to Access Grid Intern’l Video Meetings Access Grid Lead-Argonne NSF STARTAP Lead-UIC’s Elec. Vis. Lab Creating a Virtual Global Research Lab Internet Linked Pianos New Cyber Arts Humans Interacting with Virtual Realities Source: Smarr
Scenarios 7 NSF’s EarthScope -- USArray Explosions of Data! Typical! n High Resolution of Crust & Upper Mantle Structure n Transportable Array l Broadband Array –400 Broadband Seismometers ~70 Km Spacing ~1500 X 1500 Km Grid l ~2 Year Deployments at Each Site –Rolling Deployment Over More Than 10 Years n Permanent Reference Network l Geodetic Quality GPS Receivers n All Data to Community in Near Real Time l Bandwidth Will Be Driven by Visual Analysis in Repositories l Realtime simulations work as data come in Source: Smarr/Frank Vernon (IGPP SIO, UCSD)
Scenarios 8 Rollout Over 14 Years Starting With Existing Broadband Stations Source: Smarr EarthScope Rollout
Scenarios 9 Common Infrastructure Wide range of scientific disciplines will require a common infrastructure: Common Needs Driven by the Science/Engineering l Large Number of Sensors / Instruments –Data to community in real time! l Daily Generation of Large Data Sets –Growth in Computing power from TB ---> PB machines –Experimental data –Data is on Multiple Length and Time Scales l Automatic Archiving in Distributed Repositories l Large Community of End Users l Multi-Megapixel and Immersive Visualization l Collaborative Analysis From Multiple Sites l Complex Simulations Needed to Interpret Data Source: Smarr
Scenarios 10 Common Infrastructure Wide range of scientific disciplines will require a common infrastructure Some will need Optical Networks l Communications Dedicated Lambdas l Data Large Peer-to-Peer Lambda Attached Storage Source: Smarr
Scenarios 11 Huge Black Hole Collision Simulation 3000 frames of volume rendering TB of data
Scenarios 12 Issues for Complex Simulations n Huge amounts of data needed/generated across different machines l How to retrieve, track, manage data across Grid l In this case, had to fly Berlin-NCSA, bring data back on disks! n Many components developed by distributed collaborations l How to bring communities together? l How to find/load/execute different components? n Many computational resources available l How to find best ones to start? l How to distribute work effectively? n Needs of computations change with time! l How to adapt to changes? How to monitor system? n How to interact with Experiments? Coming!
Scenarios 13 n Scale of computations much larger: Need Grids to keep up… l Much larger processes, task farm many more processes (e.g. Monte Carlo) l Also: how to simply make better use of current resources n Complexity approaching that of Nature l Simulations of the Universe and its constituents –Black holes, neutron stars, supernovae, Human genome, human behavior n Teams of computational scientists working together: the future of science and engineering l Must support efficient, high level problem description l Must support collaborative computational science: llok at Grand Challenges l Must support all different languages n Ubiquitous Grid Computing l Resources, applications, etc replaced by abstract notion: Grid Services –E.g., OGSA l Very dynamic simulations, even deciding their own future l Apps may find the services themselves: distributed, spawned, etc... l Must be tolerant of dynamic infrastructure l Monitored, viz’ed, controlled from anywhere, with colleagues elsewhere Future View: Much is Here
Scenarios 14 Much already here: n Scale of computations much larger: Need Grids to keep up… l Much larger processes, task farm many more processes (e.g. Monte Carlo) l Also: how to simply make better use of current resources n Complexity approaching that of Nature l Simulations of the Universe and its constituents –Black holes, neutron stars, supernovae –Human genome, human behavior –Climate modeling, environmental effects Future View
Scenarios 15 Future View n Teams of computational scientists working together: the future of science and engineering l Must support efficient, high level problem description l Must support collaborative computational science: Grand Challenges l Must support all different languages n Ubiquitous Grid Computing l Resources, applications, etc replaced by abstract notion: Grid Services E.g., OGSA l Very dynamic simulations, even deciding their own future l Apps may find services themselves: distributed, spawned, etc... l Must be tolerant of dynamic infrastructure l Monitor, viz, control from anywhere, with colleagues elsewhere
Scenarios 16 n SC93 - SC2000 n Typical scenario l Find remote resource (often using multiple computers) l Launch job (usually static, tightly coupled) l Visualize results (usually in-line, fixed) n Need to go far beyond this l Make it much, much easier –Portals, Globus, standards l Make it much more dynamic, adaptive, fault tolerant l Migrate this technology to general user Metacomputing Einsteins Equations: Connecting T3E’s in Berlin, Garching, SDSC Grid Applications So Far
Scenarios 17 Cactus Computational Toolkit Science, Autopilot, AMR, Petsc, HDF, MPI, GrACE, Globus, Remote Steering 1. User has Science idea Selects Appropriate Resources Collaborators log in to monitor Steers simulation, monitors performance Composes/Builds Code Components w/Interface... Want to integrate and migrate this technology to the generic user… The Vision, Part I: “ASC”
Scenarios 18 Start simple: sit here, compute there Accounts for one user (real case): n berte.zib.de n denali.mcs.anl.gov n golden.sdsc.edu n gseaborg.nersc.gov n harpo.wustl.edu n horizon.npaci.edu n loslobos.alliance.unm.edu n mcurie.nersc.gov n modi4.ncsa.uiuc.edu n ntsc1.ncsa.uiuc.edu n origin.aei-potsdam.mpg.de n pc.rzg.mpg.de n pitcairn.mcs.anl.gov n quad.mcs.anl.gov n rr.alliance.unm.edu n sr8000.lrz-muenchen.de 16 machines, 6 different usernames, 16 passwords,... This is hard, but it gets much worse from here… n Portals provide: l Single access to all resources l Locate/build executables l Central/collaborative parameter files l Job submission/tracking l Access to new Grid Technologies n Use any Web Browser !!
Scenarios 19 Start simple: sit here, compute there
Scenarios 20 Distributed Computation: Harnessing Multiple Computers n Why would anyone want to do this? l Capacity: computers can’t keep up with needs l Throughput n Issues l Bandwidth (increasing faster than computation) l Latency l Communication needs, Topology l Communication/computation l Techniques to be developed –Overlapping communication/computation –Extra ghost zones to reduce latency –Compression –Algorithms to do this for scientist
Scenarios 21 Remote Visualization & Steering Remote Viz data HTTP Streaming HDF5 Autodownsample Amira Any Viz Client: LCA Vision, OpenDX Changing any steerable parameter Parameters Physics, algorithms Performance Remote Viz data
Scenarios 22 Remote Data Storage TB Data File Remote File Access grr,psi on timestep 2 Lapse for r<1, every other point HDF5 VFD/ GridFTP: Clients use file URL (downsampling,hyperslabbing) Network Monitoring Service NCSA (USA) Analysis of Simulation Data AEI (Germany) Visualization SC02 (Baltimore) More Bandwidth Available
Scenarios 23 Vision, Part II: Dynamic Distributed Computing n Many new ideas l Consider: the Grid IS your computer: –Networks, machines, devices come and go –Dynamic codes, aware of their environment, seek out resources –Distributed and Grid-based thread parallelism l Begin to change the way you think about problems: think global, solve much bigger problems n Many old ideas l 1960’s all over again l How to deal with dynamic processes l processor management l memory hierarchies, etc Make apps able to respond to changing Grid environment...
Scenarios 24 n Code/User/Infrastructure should be aware of environment l What Grid Sevices are available?? –Discover resources available NOW, and their current state? –What is my allocation? –What is the bandwidth/latency between sites? n Code/User/Infrastructure should be able to make decisions l A slow part of my simulation can run asynchronously…spawn it off! l New, more powerful resources just became available…migrate there! l Machine went down…reconfigure and recover! l Need more memory (or less!)…get it by adding (dropping) machines! New Paradigms for Dynamic Grids
Scenarios 25 n Code/User/Infrastructure should be able to publish to central server for tracking, monitoring, steering… l Unexpected event…notify users! l Collaborators from around the world all connect, examine simulation. n Rethink your Algorithms: Task farming, Vectors, Pipelines, etc all apply on Grids… The Grid IS your Computer! New Paradigms for Dynamic Grids
Scenarios 26 New Grid Applications: examples n Intelligent Parameter Surveys, Monte Carlos l May control other simulations! n Dynamic Staging: move to faster/cheaper/bigger machine (“Grid Worm”) l Need more memory? Need less? n Multiple Universe: create clone to investigate steered parameter (“Gird Virus”) n Automatic Component Loading l Needs of process change, discover/load/execute new calc. component on approp.machine n Automatic Convergence Testing l from initial data or initiated during simulation n Look Ahead l spawn off and run coarser resolution to predict likely future n Spawn Independent/Asynchronous Tasks l send to cheaper machine, main simulation carries on n Routine Profiling l best machine/queue, choose resolution parameters based on queue n Dynamic Load Balancing: inhomogeneous loads, multiple grids
Scenarios 27 Issues Raised by Grid Scenarios n Infrastructure: l Is it ubiquitous? Is it reliable? Does it work? n Security: l How does user or process authenticate as moves from site to site? l Firewalls? Ports? n How does user/application get information about Grid? l Need reliable, ubiquitous Grid information services l Portal, Cell phone, PDA n What is a file? Where does it live? l Crazy Grid apps will leave pieces of files all over the world n Tracking l How does user track the Grid simulation hierarchies?
Scenarios 28 What can be done now? n Some Current Examples, work Now: Building blocks for the future l Dynamic, Adaptive Distributed Computing –Increase scaling from % l Migration: Cactus Worm l Spawning l Task Farm/Steering Combination n If these can be done now, think what you could do tomorrow n We are developing tools to enable such scenarios for any applications
Scenarios 29 Dynamic Adaptive Distributed Computation (T.Dramlitsch, with Argonne/U.Chicago) SDSC IBM SP 1024 procs 5x12x17 =1020 NCSA Origin Array x12x(4+2+2) =480 OC-12 line (But only 2.5MB/sec) GigE:100MB/sec These experiments: n Einstein Equations (but could be any Cactus application) Achieved: n First runs: 15% scaling n With new techniques: 70-85% scaling, ~ 250GF Won “Gordon Bell Prize” (Supercomputing 2001, Denver) Dynamic Adaptation: Number of ghostzones, compression, …
Scenarios 30 Adapt 2 ghosts3 ghosts Compress on! n Automatically Load Balance at t=0 n Automatically adapt to bandwidth latency issues n Application has NO KNOWLEDGE of machines(s) it is on, networks, etc n Adaptive techniques make NO assumptions about network n Issues: if network conditions change faster than adaption… Dynamic Adaption
Scenarios 31 n Cactus simulation (could be anything) starts, launched from a portal n (Queries a Grid Information Server, finds available resources) n Migrates itself to next site, according to some criterion n Registers new location to GIS, terminates old simulation n User tracks/steers, using http, streaming data, etc...… n Continues around Europe… n If we can do this, much of what we want can be done! Cactus Worm: Basic Scenario
Scenarios 32 Determining When to Migrate: Contract Monitor n GrADS project activity: Foster, Angulo, Cactus n Establish a “Contract” l Driven by user-controllable parameters –Time quantum for “time per iteration” –% degradation in time per iteration (relative to prior average) before noting violation –Number of violations before migration n Potential causes of violation l Competing load on CPU l Computation requires more processing power: e.g., mesh refinement, new sub-computation l Hardware problems l Going too fast! Using too little memory? Why waste a resource??
Scenarios 33 Load applied 3 successive contract violations Running At UIUC (migration time not to scale ) Resource discovery & migration Running At UC (Foster, Angulo, GrADS, Cactus Team…) Migration Experiments: Contract Based
Scenarios 34 Spawning: SC2001 Demo n Black hole collision simulation l Every n timesteps, time consuming analysis tasks done l Process output data, find gravitational waves, horizons l Take much time l Processes do not run well in parallel n Solution: User invokes “Spawner” n Analysis tasks outsourced l Globus enabled resource l Resource Discovery (only mocked up last year) l login, data transfer l Remote jobs started up n Main simulation can keep going without pausing l Except to spawn: may be time consuming itself n It worked!
Scenarios 35 Main Cactus BH Simulation starts here User only has to invoke “Spawner” thorn… Spawning on ARG Testbed All analysis tasks spawned automatically to free resources worldwide
Scenarios 36 n Need to run large simulation, with dozens of input parameters l Selection is trail and error (resolution, boundary, etc) n Remember look ahead scenario? Run at lower resolution predict likely outcome! l Task farm dozens of smaller jobs across grid to set initial parameters for big run l Task farm manager sends out jobs across resources, collects results l Lowest error parameter set chosen n Main simulation starts with “best” parameters n If possible, low resolution jobs with different parameters can be cloned off at various intervals, run for awhile, and results returned to steer coordinate parameters for next phase of evolution Task Farming/Steering Combo
Scenarios 37 Main Cactus BH Simulation starts here Dozens of low resolution jobs sent out to test parameters Data returned for main job Huge job generates remote data to be visualized in Baltimore SC2002 Demo
Scenarios 38 Physicist has new idea ! S1S1 S2S2 P1P1 P2P2 S1S1 S2S2 P2P2 P1P1 S Brill Wave We see something, but too weak. Please simulate to enhance signal ! Future Dynamic Grid Computing
Scenarios 39 Found a black hole, Load new component Look for horizon Calculate/Output Grav. Waves Calculate/Output Invariants Find best resources Free CPUs!! NCSA SDSC RZG LRZ Archive data SDSC Add more resources Clone job with steered parameter Queue time over, find new machine Further Calculations AEI Archive to LIGO experiment Future Dynamic Grid Computing
Scenarios 40 Grid ApplicationToolkit (GAT) n Application developer should be able to build simulations with tools that easily enable dynamic grid capabilities n Want to build programming API to easily allow: l Query information server (e.g. GIIS) –What’s available for me? What software? How many processors? l Network Monitoring l Decision Routines –How to decide? Cost? Reliability? Size? l Spawning Routines –Now start this up over here, and that up over there l Authentication Server –Issues commands, moves files on your behalf l Data Transfer –Use whatever method is desired (Gsi-ftp, Streamed HDF5, scp…) l Etc…Apps themselves find, become, services
Scenarios 41 Grid Related Projects n GridLab l Enabling these scenarios n ASC: Astrophysics Simulation Collaboratory l NSF Funded (WashU, Rutgers, Argonne, U. Chicago, NCSA) l Collaboratory tools, Cactus Portal n Global Grid Forum (GGF) l Applications Working Group n GrADs: Grid Application Development Software l NSF Funded (Rice, NCSA, U. Illinois, UCSD, U. Chicago, U. Indiana...) n TIKSL/GriKSL l German DFN funded: AEI, ZIB, Garching l Remote online and offline visualization, remote steering/monitoring n Cactus Team l Dynamic distributed computing …
Scenarios 42 Summary n Grid computing has many promises l Making today’s computing easier by managing resources better l Creating an environment for advanced apps of tomorrow n In order to do this, your applications must be portable! n Rethink your problems for the Grid Computer l Distributed computing, grid threads, etc l Other forms of parallelism: Task farming, spawning, migration l Data access, data mining etc: all much more powerful n Next, we’ll show you l Demos of what you can do now l How to prepare for the future