HPX-5 ParalleX in Action Martin Swany Associate Chair and Professor, Intelligent Systems Engineering Deputy Director, Center for Research in Extreme Scale Technology (CREST) Indiana University
ParalleX Execution Model Core Tenets Fine grained parallelism Hide latency with concurrency Runtime introspection and adaptation Formal components Global address space (shared memory programming) Processes Compute complexes Lightweight control objects Parcels Fully flexible but promotes fine-grained dataflow programs HPX-5 based on ParalleX and is part of the Center for Shock-Wave Processing of Advanced Reactive Materials (C-SWARM) effort in PSAAP-II
Model: Global Address Space Flat byte-addressable global addresses Put/get with local and remote completion Active message targets Array collectives Controls thread distribution and load balance Current implementation Block-based allocation Malloc/free with distribution (local, cyclic, user, etc) Traditional PGAS or directory-based AGAS High performance local allocation (high frequency LCO allocation) Soft core affinity for NUMA
Model: Parcels Active messages with continuations Target data action, global address, immediate data Continuation data action, global address lco_set, lco_delete, memput, free, etc Execute local to target address Unified local and remote execution model send() equiv to thread_create()
Model: User-level threads Cooperative threads Block on dynamic dependencies (lco_get, memput, etc) Continuation passing style Progenitor parcel specifies continuation target, action Thread “continues” value Call/cc “pushes” continuation parcel Isomorphic with parcels
Model: Local Control Objects Abstract synchronization interface Unified local/remote access Threads get, set, wait, reset, compound ops Parcel sends dependent on Built-in classes Futures, reductions, generation counts, semaphores, … User defined classes Initialize, set handler, predicate Colocates data with control and synchronization Implement dataflow with parcel continuations
Control: Parallel Parcels and Threads Serial work thread_continue thread_call/cc happens-before Thread 1 < Thread 2 Parallel work parcel_send unordered Thread 1 <> Thread 4 Higher level hpx_call local parfor hierarchical parfor Thread 2 Thread 1 thread_continue(x) parcel_send(q) parcel_send(r) Thread 4 parcel_send(p)
Control: LCO Synchronization Thread-thread synchronization Traditional monitor style synchronization Dynamic output dependencies Blocked threads as continuations Data-flow execution Pending parcels as continuations Execution ”consumes” output Can be manually regenerated for iterative execution Generic user-defined Any set of continuations Any function and predicate Lazy evaluation of function lco_set lco_get future parcel_send(p) and … lco_set(a) lco_get lco_set(b) f(a, b, …, x); pred(); … … parcel_send(p) lco_set(x)
Data Structures, Distribution Global linked data structures Graphs, trees, DAGs Global cyclic block arrays locality(block address) Global user-defined distributions locality[block address] Active GAS Distributed directory allows blocks to be dynamically remapped from their home localities. Application-specific explicit load balancing Automatic load balancing through GAS tracing and graph partitioning (slow)
Fibonacci fib(n) = fib(n-1) + fib(n-2) HPX_ACTION_DECL(fib); int fib_handler(int n) { if (n < 2) { return HPX_THREAD_CONTINUE(n); } // sequential int l = n - 1; int r = n - 2; hpx_addr_t lhs = hpx_lco_future_new(sizeof(int)); // GAS malloc hpx_addr_t rhs = hpx_lco_future_new(sizeof(int)); // GAS malloc hpx_call(HPX_HERE, fib, lhs, l); // parallel hpx_call(HPX_HERE, fib, rhs, r); // parallel hpx_lco_get(lhs, sizeof(int), &l); // LCO synchronization hpx_lco_get(rhs, sizeof(int), &r); // LCO synchronization hpx_lco_delete_sync(lhs); // GAS free hpx_lco_delete_sync(rhs); // GAS free int fn = l + r; return HPX_THREAD_CONTINUE(fn); // sequential } HPX_ACTION(HPX_DEFAULT, 0, fib, fib_handler, HPX_INT); fib(n) = fib(n-1) + fib(n-2)
Networking / Comms Internal interfaces Photon Isend/Irecv Preferred: put/get with remote completion Legacy: parcel send Photon rDMA put/get with remote completion operations Native PSM (libfabric), IB verbs, uGNI, sockets (libfabric) Parcel emulation through eager buffers Synchronized with fine-grained point-to-point locking Isend/Irecv MPI_THREAD_FUNNELED implementation PWC emulated through Isend/Irecv Portability, legacy upgrade path
Networking / Comms A key idea in the Photon library - Put/Get with Completion Minimal overhead to trigger waiting thread via LCO useful paradigm when combined with an “unexpected active message” capability Essentially attach parcel continuations (either already-running threads or yet-to-be-instantiated parcels) to both local and remote completion operations
Networking / Comms One of the key lessons in HPX-5 is the power of memget, memput with completion primitives (with associated low-level photon_pwc and photon_gwc) provides a very powerful abstraction One-sided operations in AMTs are not themselves that useful The ability to continue threads or spawn parcels provides performance improving functionality
Thank you hpx.crest.iu.edu