Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mehmet Can Kurt, The Ohio State University Gagan Agrawal, The Ohio State University DISC: A Domain-Interaction Based Programming Model With Support for.

Similar presentations


Presentation on theme: "Mehmet Can Kurt, The Ohio State University Gagan Agrawal, The Ohio State University DISC: A Domain-Interaction Based Programming Model With Support for."— Presentation transcript:

1 Mehmet Can Kurt, The Ohio State University Gagan Agrawal, The Ohio State University DISC: A Domain-Interaction Based Programming Model With Support for Heterogeneous Execution

2 Heterogeneity in HPC - Present and Future Present Use of accelerators e.g. a CPU+ MIC Cluster Future decreasing feature sizes will increase process variation power-efficient technologies such as NTV will compound process variation local power and thermal optimizations Relative speeds are application-specific Variations can even be dynamic

3 Application Development for Heterogeneous HPC Existing Programming Models Designed (largely) for homogeneous settings (MPI, PGAS) explicit partitioning and communication Explicit Partitioning Know the relative speed of CPU and MIC cores Code is not portable Only static variations Task Models Not suitable/popular for communication-oriented applications

4 Our Work DISC: a high-level programming model notion of domain and interactions between domain elements Suitable for most classes of popular scientific applications Abstractions to hide data distribution and communication captured through a domain-interaction API Key Features: automatic partitioning and communication heterogeneous execution support with work redistribution Automated Resilient Execution (Ongoing work)

5 Scientific Applications Structured and unstructured grids, N-body simulations Similarities iterative structure domain and interactions among domain elements interactions drive computations Programming involves bookkeeping partitions and task assignment identify data to send/receive prepare input/output buffers

6 DISC Abstractions Domain input space as a multidimensional domain data points as domain elements domain initialization by API leverages automatic partitioning

7 DISC Abstractions Interaction between Domain Elements grid-based interactions (inferred from domain type) radius-based interaction (by cutoff distance) explicit-list based interaction (by point connectivity)

8 compute-function and computation-space compute-function calculate new values for point attributes invoked by runtime at each iteration computation-space (for each subdomain) updates performed on computation-space leverages automatic repartitioning

9 Runtime Communication Generation from Domain-Interaction API Each subdomain needs updated attributes of interacted elements in other subdomains DISC runtime has the knowledge of partitioning (boundaries of each subdomain) nature of interaction among points Automatic communication identifies which elements should be sent where places received values in runtime structures

10 Runtime Communication Generation from Domain-Interaction API Grid-based Interactions: seen in stencil patterns acquires ghost rows and columns single exchange with immediate neighbors (east,west,north,south) Radius-based Interactions: seen in molecular dynamics (cutoff distance r c ) acquires all elements inside a sphere one or more exchanges (depending on r c ) with immediate neighbors Explicit-list Based Interactions: specified explicitly by disc_add_interaction() routine exchanges with any subdomains (not just imm. neighbors)

11 Work Redistribution for Heterogeneity Main idea: shrinking/expanding a subdomain changes processors’ workload t i : unit-processing time of processor i t i = T i / n i T i = total time spent on compute-functions n i = number of local points in assigned subdomain

12 Work Redistribution for Heterogeneity 1D Case size of each subdomain inversely proportional to its unit-processing time 2D/3D Case express as a non-linear optimization problem min T max s.t. x r1 * y r1 * t 1 <= T max x r2 * y r1 * t 2 <= T max … x r1 + x r2 + x r3 = x r y r1 + y r2 = y r

13 Example Scenario Before repartitioning After repartitioning

14 Implementation: Putting it All Together @application @runtime

15 Other Benefits of DISC Can support restart with a different number of nodes Partition to a different number of processes Why? Failure and no replacement node Performance within a power budget Exploit cloud elasticity More flexible scheduling on HPC platforms Switching off nodes/cores for power/thermal reasons

16 DISC and Automated Resilient Execution Support automated application-level checkpointing Use notion of domains and computation spaces Can also help with soft errors Separates data and control Communication and synchronization can be protected Exposes iterative structure Applicable technique can depend on nature of interactions Ongoing work

17 Experiments Implemented with C language on MPICH2 Each node with two-quad core 2.53 GHz Intel(R) Xeon(R) processor with 12GB RAM Up to 128 nodes (by using a single core at each node) Applications Stencil (Jacobi, Sobel) Unstructured grid (Euler) Molecular dynamics (MiniMD)

18 Homogeneous Configurations Comparison against MPI implementations Average overheads: 2.7% (MiniMD), < 1% (Euler) MiniMDEuler

19 Homogeneous Configurations Average overheads: 0.5% (Jacobi), 3.8% (Sobel) JacobiSobel

20 Heterogeneous Configurations (varying number of cores slowed by 40%) MiniMDEuler Slowdown reduction: 54%  10-15%67-73%  41-47%

21 Heterogeneous Configurations (varying number of cores slowed by 40%) JacobiSobel Slowdown reduction: 47-51%  8-25% 56%  14%

22 Heterogeneous Configurations (64 cores slowed by varying percentages) disc-perfect: T disc x (P homogeneous /P heterogeneous ) 25%: 25-50%: 25%  9% 83%  18% 36%  25% 111%  55% MiniMD Euler

23 Charm++ Comparison Euler (4 nodes are slowed down out of 16) Diff. Load-Balancing Strategies for Charm++ (RefineLB) Load-balance once at the beginning (a) Homog.: Charm++ 17.8% slower than DISC (c) Heter. LB: Charm++, at 64-chares (best-case), 14.5% slower than DISC

24 Decomposition across CPU and Accelerator Process I (CPU), Process II (GPU) *,*,* show DISC’s decision

25 Conclusion A parallel programming model for scientific applications Automatic work partitioning and communication Automatic repartitioning for heterogeneity support

26 Thank you. Questions?


Download ppt "Mehmet Can Kurt, The Ohio State University Gagan Agrawal, The Ohio State University DISC: A Domain-Interaction Based Programming Model With Support for."

Similar presentations


Ads by Google