Presentation is loading. Please wait.

Presentation is loading. Please wait.

1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review 27-28 October 1999, Iowa City.

Similar presentations


Presentation on theme: "1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review 27-28 October 1999, Iowa City."— Presentation transcript:

1 1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review 27-28 October 1999, Iowa City

2 2CPSD Outline Quench and solidification codes Coarse grain parallelization of the quench code Adaptive parallelization techniques Dynamic variations Adaptive load balancing Finite element framework with adaptivity Preliminary results

3 3CPSD Coarse grain parallelization Structure of current sequential quench code: 2-D array of elements (each independently refined) Within row dependence Independent rows, but… —share global variables Parallelization using Charm++: 3 hours effort (after a false start) about 20 lines of change to F90 code A 100 line Charm++ wrapper Observations: —Global variables that are defined and used within inner loop iterations are easily dealt with in Charm++, in contrast to OpenMP —Dynamic load balancing is possible, but was unnecessary

4 4CPSD Performance results Contributors: Engineering: N. Sobh, R. Haber Computer Science: M. Bhandarkar, R. Liu, L. Kale

5 5CPSD OpenMP experience Work by: J. Hoeflinger, D. Padua, with N. Sobh, R. Haber, J. Dantzig, N. Provatas Solidification code: Parallelized using openMp Relatively straightforward, after a key decision —Parallelize by rows only

6 6CPSD OpenMP experience Quench code on Origin2000 Privatization of variables is needed —as outer loop was parallelized Unexpected initial difficulties with OpenMP —Led initially to large slowdown in parallelized code —Traced to unnecessary locking in MATMUL intrinsic

7 7CPSD Adaptive Strategies Advanced codes model dynamic and irregular behavior Solidification: adaptive grid refinement Quench: —Complex dependencies, —Parallelization within elements To parallelize these effectively, —adaptive runtime strategies are necessary

8 8CPSD Multi-partition decomposition: Idea: decompose the problem into a number of partitions, independent of the number of processors # Partitions > # Processors The system maps partitions to processors The system should be able to map and re-map objects as needed

9 9CPSD Charm++ A parallel C++ library Supports data driven objects —singleton objects, object arrays, groups, Many objects per processor, with method execution scheduled with availability of data System supports automatic instrumentation and object migration Works with other paradigms: MPI, openMP,..

10 10CPSD Data driven execution in Charm++ Scheduler Message Q

11 11CPSD Load Balancing Framework Aimed at handling... Continuous (slow) load variation Abrupt load variation (refinement) Workstation clusters in multi-user mode Measurement based Exploits temporal persistence of computation and communication structures Very accurate (compared with estimation) instrumentation possible via Charm++/Converse

12 12CPSD Object balancing framework

13 13CPSD Utility of the framework: workstation clusters Cluster of 8 machines, One machine gets another job Parallel job slows down on all machines Using the framework: Detection mechanism Migrate objects away from overloaded processor Restored almost original throughput!

14 14CPSD Performance on timeshared clusters Another user logged on at about 28 seconds into a parallel run on 8 workstations. Throughput dipped from 10 steps per second to 7. The load balancer intervened at 35 seconds,and restored throughput to almost its initial value.

15 15CPSD Utility of the framework: Intrinsic load imbalance To test the abilities of the framework A simple problem: Gauss-Jacobi iterations Refine selected sub-domains ConSpector: web based tool Submit parallel jobs Monitor performance and application behavior Interact with running jobs via GUI interfaces

16 16CPSD AppSpector view of Load balancer on the synthetic Jacobi relaxation benchmark. Imbalance is introduced by interactively refining a subset of cells around 9 seconds.. The resultant load imbalance brings the utilization down to 80% from the peak of 96%. The load balancer kicks in around t = 16, and restores utilization to around 94%.

17 17CPSD Charm++ Converse Load database + balancer MPI-on-CharmIrecv+ Automatic Conversion from MPI FEM Structured Cross module interpolation Migration path Framework path Using the Load Balancing Framework

18 18CPSD Example application: Crack propagation (P. Geubelle et al) Similar in structure to Quench components 1900 lines of F90 Rewritten using FEM framework in C++ 1200 lines of C++ code Framework: 500 lines of code, —reused by all applications Parallelization completely by the framework

19 19CPSD Crack Propagation Decomposition into 16 chunks (left) and 128 chunks, 8 for each PE (right). The middle area contains cohesive elements. Both decompositions obtained using Metis. Pictures: S. Breitenfeld, and P. Geubelle

20 20CPSD “Overhead” of multi-partition method

21 21CPSD Overhead study on 8 processors When running on 8 processors, the effect of using multiple partitions per processor is also beneficial, due to cache behavior.

22 22CPSD Cross-approach comparison MPI-F90 original Charm++ framework(all C++) F90 + charm++ library

23 23CPSD Load balancer in action

24 24CPSD Summary and Planned Research Use the adaptive FEM framework To parallelize Quench code further Quad tree based solidification code: —First phase: parallelize each phase separately —Parallelize across refinement phases Refine the FEM framework Use feedback from applications Support for implicit solvers and multigrid


Download ppt "1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review 27-28 October 1999, Iowa City."

Similar presentations


Ads by Google