Download presentation
Presentation is loading. Please wait.
Published byCynthia McCormick Modified over 9 years ago
1
Multilevel Parallelism using Processor Groups Bruce Palmer Jarek Nieplocha, Manoj Kumar Krishnan, Vinod Tipparaju Pacific Northwest National Laboratory
2
Processor Groups Many parallel applications require the execution of a large number of independent tasks. Examples include Numerical evaluation of gradients Monte Carlo sampling over initial conditions or uncertain parameter sets Free energy perturbation calculations (chemistry) Nudged elastic band calculations (chemistry and materials science) Sparse matrix-vector operations (NAS CG benchmark)
3
Processor Groups If the individual calculations are small enough then each processor can be used to execute one of the tasks (embarrassingly parallel algorithms). If the individual tasks are large enough that they must be distributed amongst several processors then the only option (usually) is to run each task sequentially on multiple processors. This limits the total number of processors that can be applied to the problem since parallel efficiency degrades as the number of processors increases. Speedup Processors
4
Processor Groups Alternatively the collection of processors can be decomposed into processor groups. These processor groups can be used to execute parallel algorithms independently of one another. This requires global operations that are restricted in scope to a particular group instead of over the entire domain of processors (world group) distributed data structures that are restricted to a particular group
5
Processor Groups (Schematic) World Group Group A Group B
6
Communication between Groups Copying to and from data structures in nested groups is well defined. The world group can be used as mechanism for communicating between groups (hierarchical data model). Copying directly between groups is more complicated. What is the programming model? World Group Group A Group B
7
Sequential Parallel Tasks 10 9 8 7 6 4 5 4 3 2 1 TasksProcessors Results
8
Embarrassingly Parallel 10 9 8 7 6 4 5 4 3 2 1 Tasks 3 5 Processors Results
9
Concurrent Execution on Groups 10 9 8 7 6 4 5 4 3 2 1 Tasks 3 5 Processors Results
10
MD Example P0P0 P1P1 P2P2 P3P3 P4P4 P5P5 P6P6 P7P7 Spatial Decomposition Algorithm: Partition particles among processors Update coordinates at every step Update partitioning after fixed number of steps
11
MD Parallel Scaling
12
MD Performance on Groups
13
Task Completion Times
14
Multilevel Parallelism Using CCA and Global Arrays (GA) zCombining SPMD and MPMD paradigms zMCMD – Multi Component Multiple Data yMPMD + Component zThe MCMD Driver launches multiple instances of NWChem components on subsets of processors (CCA) zEach NWChem QM component does multiple energy computations on subgroups (GA) MCMD Hessian Driver Go cProps ModelFactory NWChem_QM_1 ModelFacto ry cProps Param Port Energy NWChem_QM_0 ModelFacto ry cProps Param Port NWChem_QM_2 ModelFacto ry cProps Param Port NWChem_QM_n ModelFacto ry cProps Param Port
15
Numerical Hessian Scalability - I Single-level Parallelism native parallel code
16
Numerical Hessian Scalability - II Single Energy Calculations Two-level Parallelism Native parallel code – Energy level group-based single energy calculations – Gradient Level, using GA groups
17
Numerical Hessian Scalability - III Three-level Parallelism Native parallel code – Energy level group-based single energy calculations – Gradient Level, using GA groups Task-based gradient calculations – using CCA Application efficiency improved 10x times on 256 CPUs
18
New Opportunities zOverlapping functionality (IO, visualization) yAssign functions such as IO or visualization to separate processor groups so that these tasks do not block main computation zMultiphysics simulations ySeparate physics modules run on independent processor groups and are loosely coupled on the world group
19
Overlapping Functionality zMinimize effects of latency and blocking zAchieve optimal bandwidth to other resources (e.g file system or graphics device) by minimizing competition between processors zSimplify programming (less need to schedule access to limited resources) zPotential applications to Global Cloud Resolving Model SciDAC program, other applications that create large amounts of data for permanent storage
20
Overlapping Functionality World Group Computation Group Auxiliary Group Disk, graphics device, etc.
21
Multiphysics Simulations zDifferent physical models are run as separate parallel simulations on independent groups zLoose coupling of models through the world group zApplications to subsurface modeling through the Hybrid Multiscale Modeling of Subsurface Simulations SciDAC program, applications as well to climate modeling (combined ocean, atmospheric, surface water models, etc.)
22
Multiphysics Simulations World Group Microscopic Model Pore Scale Model Darcy Flow Model ??
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.