Parallel Objects: Virtualization & In-Process Components Orion Sky Lawlor Univ. of Illinois at Urbana-Champaign POHLL-2002
Introduction Parallel Programming is hard: Communication takes time Message startup cost Bandwidth & contention Synchronization, race conditions Parallelism breaks abstractions Flatten data structures Hand off control between modules Harder than serial programming
Motivation Parallel Applications are either: Embarrassingly Parallel Trivial, 1 RA-week effort E.g. Monte Carlo, parameter sweep, SETI@home Communication totally irrelevant to performance
Motivation Parallel Applications are either: Embarrassingly Parallel Excruciatingly Parallel Massive, 1+ RA-year effort E.g. “Pure” MPI codes ≥10k lines Communication, synchronization totally determine performance
Motivation Parallel Applications are either: Embarrassingly Parallel Excruciatingly Parallel “We’ll be done in 6 months…” Several parallel libraries & codes & groups, dynamic & adaptive E.g. Multiphysics simulation
Serial Solution: Abstract! Build layers of software High-level: Libc, C++ STL, … Mid-level: OS Kernel Silently schedule processes Keep CPU busy even when some processes block Allows a process to ignore other processes Low-level: assembler
Parallel Solution: Abstract! Middle layers are missing High-level: ScaLAPACK, POOMA.. Mid-level: ? Kernel Silently schedule components Keep CPU busy even when some components block Allows a component to ignore other components Low-level: MPI
The missing middle layer: Provides dynamic computation and communication overlap, even across separate modules Handles inter-module handoff Pipelines communication Improves cache utilization—smaller components Provides nice layer for advanced features, like process migration
Examples: Multiprogramming
Examples: Pipelining
Middle Layer: Implementation Real OS processes/threads Robust, reliable, implemented High performance penalty No parallel features (migration!) Converse/Charm++ In-process components: efficient Piles of advanced features AMPI, MPI interface to Charm Application Framework
Charm++ Parallel library for Object-Oriented C++ applications Messaging via method calls Communication “proxy” objects Methods called by scheduler System determines who runs next Multiple objects per processor Object migration fully supported Even with broadcasts, reductions
Mapping Work to Processors System implementation User View
AMPI MPI interface, implemented on Charm++ Multiple “virtual processors” per physical processor Implemented as user-level threads Very fast context switching MPI_Recv only blocks virtual processor, not physical All the benefits of Charm++
Application Frameworks Domain-specific interfaces: unstructured grids, structured grids, particle-in-cell Provide natural interface to application scientists (Fortran!) “Encapsulate” communication Built on Charm++ Most popular interfaces to Charm++
Charm++ Features: Migration Automatic load balancing Balance load by migrating objects Application-independent Built-in data collection (cpu, net) Pluggable “strategy” modules Adaptive Job Scheduler Shrink/expand parallel job, by migrating objects Dramatic utilization improvment
Examples: Load Balancing 1. Adaptive Refinement 3. Chunks Migrated 2. Load Balancer Invoked
Examples: Expanding Job
Examples: Virtualization
Conclusions Parallel applications need something like a “kernel” Neutral party to mediate CPU use Significant utilization gains Easy to put good tools in kernel Work migration support Load balancing Consider using Charm++ http://charm.cs.uiuc.edu/