1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©
2 ©2004 Board of Trustees of the University of Illinois Computer Science Projects: Posters n Rocketeer l Home-grown visualizer l John, Fiedler n Rocpanda l Parallel I/O l Winslett et al n Novel Linear System Solvers l de Sturler, Heath, Saylor n Performance monitoring l Campbell, Zheng, Lee n Parallel Mesh support l “FEM” Framework l Parallel remeshing l Parallel Solution transfer l Adaptive mesh refinement Compute I/O Disk
3 ©2004 Board of Trustees of the University of Illinois Computer Science Projects: Talks n Kale: l Processor virtualization via migratable objects n Jiao: l Integration Framework l Surface propagation l Mesh adaptation
4 ©2004 Board of Trustees of the University of Illinois Migratable Objects and Charm++ n Charm++ l Parallel C++ l “Arrays” of objects l Automatic load balancing l Prioritization l Mature system l Available on all parallel machines we know n Rocket Center Collaborations l It was clear that Charm++ won’t be adopted by the whole application community l It was equally clear to us that it was a unique technology that will improve programmer productivity substantially n Led to the development of AMPI l Adaptive MPI
5 ©2004 Board of Trustees of the University of Illinois Processor Virtualization n Software engineering l Number of virtual processors can be independently controlled l Separate VPs for different modules n Message driven execution l Adaptive overlap of communication l Predictability : â Automatic out-of-core l Asynchronous reductions n Dynamic mapping l Heterogeneous clusters â Vacate, adjust to speed, share l Automatic checkpointing l Change set of processors used l Automatic dynamic load balancing l Communication optimization â Collectives Benefits Real Processors MPI processes Virtual Processors (user-level migratable threads) Programmer : [Over] decomposition into virtual processors Runtime: Assigns VPs to processors Enables adaptive runtime strategies Implementations: Charm++, AMPI
6 ©2004 Board of Trustees of the University of Illinois Highly Agile Dynamic load balancing n Needed, for example, for handling Advent of plasticity around a crack n Here a simple example l Plasticity in a bar
7 ©2004 Board of Trustees of the University of Illinois Optimizing all-to-all via Mesh Organize processors in a 2D (virtual) grid Phase 1: Each processor sends messages within its row Phase 2: Each processor sends messages within its column Message from (x1,y1) to (x2,y2) goes via (x1,y2) 2. messages instead of P-1
8 ©2004 Board of Trustees of the University of Illinois Optimized All-to-all “Surprise” 76 bytes all-to-all on Lemieux Completion time vs. computation overhead Led to the development of Asynchronous Collectives now supported in AMPI CPU is free during most of the time taken by a collective operation
9 ©2004 Board of Trustees of the University of Illinois Latency Tolerance: Multi-Cluster Jobs n Job co-scheduled to run across two clusters to provide access to large numbers of processors n But cross cluster latencies are large! n Virtualization within Charm++ masks high inter-cluster latency by allowing overlap of communication with computation Cluster A Cluster B Intra-cluster latency (microseconds ) Inter-cluster latency (milliseconds)
10 ©2004 Board of Trustees of the University of Illinois Hypothetical Timeline of a Multi-Cluster Computation A B C cross-cluster boundary n Processors A and B are on one cluster, Processor C on a second cluster n Communication between clusters via high-latency WAN n Processor Virtualization allows latency to be masked
11 ©2004 Board of Trustees of the University of Illinois Multi-cluster Experiments n Experimental environment l Artificial latency environment: VMI “delay device” adds a pre-defined latency between arbitrary pairs of nodes l TeraGrid environment: Experiments run between NCSA and ANL machines (~1.725 ms one-way latency) n Experiments l Five-point stencil (2D Jacobi) for matrix sizes 2048x2048 and 8192x8192 l LeanMD molecular dynamics code running a 30,652 atom system
12 ©2004 Board of Trustees of the University of Illinois Five-Point Stencil Results (P=64)
13 ©2004 Board of Trustees of the University of Illinois Fault Tolerance n Automatic Checkpointing for AMPI and Charm++ l Migrate objects to disk! l Automatic fault detection and restart l Now available in distribution version of AMPI and Charm++ n New work l In-memory checkpointing l Scalable fault tolerance n “Impending Fault” Response l Migrate objects to other processors l Adjust processor-level parallel data structures
14 ©2004 Board of Trustees of the University of Illinois In-memory Double Checkpoint n In-memory checkpoint l Faster than disk n Co-ordinated checkpoint l Simple l User can decide what makes up useful state n Double checkpointing l Each object maintains 2 checkpoints: l Local physical processor l Remote “buddy” processor n For jobs with large memory l Use local disks! 32 processors with 1.5GB memory each
15 ©2004 Board of Trustees of the University of Illinois Scalable Fault Tolerance n Motivation: l When a processor out of 100,000 fails, all 99,999 shouldn’t have to run back to their checkpoints! n How? l Sender-side message logging l Latency tolerance mitigates costs l Restart can be speeded up by spreading out objects from failed processor n Long term project n Current progress l Basic scheme idea implemented and tested in simple programs l General purpose implementation in progress Only failed processor’s objects recover from checkpoints, while others “continue”
16 ©2004 Board of Trustees of the University of Illinois Parallel Objects, Adaptive Runtime System Libraries and Tools The enabling CS technology of parallel objects and intelligent Runtime systems has led to several collaborative applications in CSE Molecular Dynamics Crack Propagation Space-time meshes Computational Cosmology Rocket Simulation Protein Folding Dendritic Growth Quantum Chemistry (QM/MM) Develop abstractions in context of full-scale applications
17 ©2004 Board of Trustees of the University of Illinois Next… n Jim Jiao: l Integration Framework l Surface propagation l Mesh adaptation