Mehmet Can Kurt, The Ohio State University Gagan Agrawal, The Ohio State University DISC: A Domain-Interaction Based Programming Model With Support for Heterogeneous Execution
Heterogeneity in HPC - Present and Future Present Use of accelerators e.g. a CPU+ MIC Cluster Future decreasing feature sizes will increase process variation power-efficient technologies such as NTV will compound process variation local power and thermal optimizations Relative speeds are application-specific Variations can even be dynamic
Application Development for Heterogeneous HPC Existing Programming Models Designed (largely) for homogeneous settings (MPI, PGAS) explicit partitioning and communication Explicit Partitioning Know the relative speed of CPU and MIC cores Code is not portable Only static variations Task Models Not suitable/popular for communication-oriented applications
Our Work DISC: a high-level programming model notion of domain and interactions between domain elements Suitable for most classes of popular scientific applications Abstractions to hide data distribution and communication captured through a domain-interaction API Key Features: automatic partitioning and communication heterogeneous execution support with work redistribution Automated Resilient Execution (Ongoing work)
Scientific Applications Structured and unstructured grids, N-body simulations Similarities iterative structure domain and interactions among domain elements interactions drive computations Programming involves bookkeeping partitions and task assignment identify data to send/receive prepare input/output buffers
DISC Abstractions Domain input space as a multidimensional domain data points as domain elements domain initialization by API leverages automatic partitioning
DISC Abstractions Interaction between Domain Elements grid-based interactions (inferred from domain type) radius-based interaction (by cutoff distance) explicit-list based interaction (by point connectivity)
compute-function and computation-space compute-function calculate new values for point attributes invoked by runtime at each iteration computation-space (for each subdomain) updates performed on computation-space leverages automatic repartitioning
Runtime Communication Generation from Domain-Interaction API Each subdomain needs updated attributes of interacted elements in other subdomains DISC runtime has the knowledge of partitioning (boundaries of each subdomain) nature of interaction among points Automatic communication identifies which elements should be sent where places received values in runtime structures
Runtime Communication Generation from Domain-Interaction API Grid-based Interactions: seen in stencil patterns acquires ghost rows and columns single exchange with immediate neighbors (east,west,north,south) Radius-based Interactions: seen in molecular dynamics (cutoff distance r c ) acquires all elements inside a sphere one or more exchanges (depending on r c ) with immediate neighbors Explicit-list Based Interactions: specified explicitly by disc_add_interaction() routine exchanges with any subdomains (not just imm. neighbors)
Work Redistribution for Heterogeneity Main idea: shrinking/expanding a subdomain changes processors’ workload t i : unit-processing time of processor i t i = T i / n i T i = total time spent on compute-functions n i = number of local points in assigned subdomain
Work Redistribution for Heterogeneity 1D Case size of each subdomain inversely proportional to its unit-processing time 2D/3D Case express as a non-linear optimization problem min T max s.t. x r1 * y r1 * t 1 <= T max x r2 * y r1 * t 2 <= T max … x r1 + x r2 + x r3 = x r y r1 + y r2 = y r
Example Scenario Before repartitioning After repartitioning
Implementation: Putting it
Other Benefits of DISC Can support restart with a different number of nodes Partition to a different number of processes Why? Failure and no replacement node Performance within a power budget Exploit cloud elasticity More flexible scheduling on HPC platforms Switching off nodes/cores for power/thermal reasons
DISC and Automated Resilient Execution Support automated application-level checkpointing Use notion of domains and computation spaces Can also help with soft errors Separates data and control Communication and synchronization can be protected Exposes iterative structure Applicable technique can depend on nature of interactions Ongoing work
Experiments Implemented with C language on MPICH2 Each node with two-quad core 2.53 GHz Intel(R) Xeon(R) processor with 12GB RAM Up to 128 nodes (by using a single core at each node) Applications Stencil (Jacobi, Sobel) Unstructured grid (Euler) Molecular dynamics (MiniMD)
Homogeneous Configurations Comparison against MPI implementations Average overheads: 2.7% (MiniMD), < 1% (Euler) MiniMDEuler
Homogeneous Configurations Average overheads: 0.5% (Jacobi), 3.8% (Sobel) JacobiSobel
Heterogeneous Configurations (varying number of cores slowed by 40%) MiniMDEuler Slowdown reduction: 54% 10-15%67-73% 41-47%
Heterogeneous Configurations (varying number of cores slowed by 40%) JacobiSobel Slowdown reduction: 47-51% 8-25% 56% 14%
Heterogeneous Configurations (64 cores slowed by varying percentages) disc-perfect: T disc x (P homogeneous /P heterogeneous ) 25%: 25-50%: 25% 9% 83% 18% 36% 25% 111% 55% MiniMD Euler
Charm++ Comparison Euler (4 nodes are slowed down out of 16) Diff. Load-Balancing Strategies for Charm++ (RefineLB) Load-balance once at the beginning (a) Homog.: Charm % slower than DISC (c) Heter. LB: Charm++, at 64-chares (best-case), 14.5% slower than DISC
Decomposition across CPU and Accelerator Process I (CPU), Process II (GPU) *,*,* show DISC’s decision
Conclusion A parallel programming model for scientific applications Automatic work partitioning and communication Automatic repartitioning for heterogeneity support
Thank you. Questions?