Download presentation
Presentation is loading. Please wait.
Published byStanley Holt Modified over 9 years ago
1
1 Optimizing Quantum Chemistry using Charm++ Eric Bohm http://charm.cs.uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign
2
2 Overview CPMD 9 phases Charm applicability Overlap Decomposition Portability Communication Optimization Decomposition State Planes 3d FFT 3d matrix multiply Utilizing Charm++ Prioritized nonlocal Commlib Projections
3
3 Quantum Chemistry LeanCP Collaboration Glenn Martyna (IBM TJ Watson) Mark Tuckerman (NYU) Nick Nystrom (PSU) PPL: Kale, Shi, Bohm, Pauli, Kumar (now at IBM), Vadali CPMD Method Plane wave QM : 100s of atoms Charm++ Parallelization PINY MD Physics engine
4
4 CPMD on Charm++ 11 Charm Arrays 4 Charm Modules 13 Charm Groups 3 Commlib strategies BLAS FFTW PINY MD Adaptive Overlap Prioritized computation for phased application Communication optimization Load balancing Group caches Rth Threads
5
5 Practical Scaling Single Wall Carbon Nanotube Field Effect Transistor BG/L Performance
6
6 Computation Flow
7
7 Charm++ Uses the approach of virtualization Divide the work into VPs Typically much more than #proc Schedule each VP for execution Advantage: Computation and communication can be overlapped (between VPs) Number of VPs can be independent of #proc Other: load balancing, checkpointing, etc.
8
8 Decomposition Higher degree of virtualization better for Charm++ Real Space State Planes, Gspace State Planes, Rho Real and Rho G, S-Calculators for each gspace state plane. Tens of thousands of chares for a 32 mol problem Careful scheduling to maximize efficiency Most of the computation is in FFTs and Matrix Multiplies
9
9 3-D FFT Implementation “Sparse” 3-D FFT “Dense” 3-D FFT
10
10 Parallel FFT Library Slab-based parallelization We do not re-implement the sequential routine Utilize 1-D and 2-D FFT routines provided by FFTW Allow for Multiple 3-D FFTs simultaneously Multiple data sets within the same set of slab objects Useful as 3-D FFTs are frequently used in CP computations
11
11 Multiple Parallel 3-D FFTs
12
12 Matrix Multiply AKA Scalculator or Pair Calculator Decompose state-plane values into smaller objects. Use DGEMM on smaller sub-matrices Sum together via reduction back to Gspace
13
13 Matrix Multiply VP based approach
14
14 Charm++ Tricks and Tips Message driven execution and high degree of virtualization present tuning challenges Flow of control using Rth-Threads Prioritized messages Commlib framework Charm++ arrays vs groups Problem identification with projections Problem isolation techniques
15
15 Flow Control in Parallel Rth Threads Based on Duff's device these are user level threads with negligible overhead. Essentially Goto and Return without readability loss Allow for an event loop style of programming Makes flow of control explicit Uses familiar threading semantic
16
16 Rth Threads for Flow Control
17
17 Prioritized Messages for Overlap
18
18 Communication Library Fine grained decomposition can result in many small messages. Message combining via the Commlib framework in Charm++ addresses this problem. Streaming protocol optimizes many to many personalized. Forwarding protocols like Ring or Multiring can be beneficial. But not on BG/L
19
19 Commlib Strategy Selection
20
20 Streaming Commlib Saves time 610ms vs 480ms
21
21 Bound Arrays Why? Efficiency and clarity of expression. Two arrays of the same dimensionality where like indices are co-placed. Gspace and the non-local computation both have plane based computations and share many data elements. Use ck-local to access elements, like local functions and local function calls. Remain distinct parallel objects
22
22 Group Caching Techniques Group objects have 1 element per processor Making excellent cache points for arrays which may have many chares per processor Place low volatility data in the group Array elements use cklocal to access In CPMD: the Structure Factor for all chares which have plane P use the same memory
23
23 Charm++ Performance Debugging Complex parallel applications hard to debug Event based model with high degree of virtualization presents new challenges Projections and Charm++ debugger Tools Bottleneck identification: using the Projections Usage Profile tool
24
24 Old S->T Orthonormalization
25
25 After Parallel S->T
26
26 Problem isolation techniques Using Rth threads its easy to isolate phases by adding a barrier. Contribute to Reduction -> suspend Reduction proxy is broadcast client ->resume In the following example we break up the Gspace IFFT into computation and communication entry methods. We then insert a barrier between them to highlight a specific performance problem
27
27 Projections Timeline Analysis
28
28 Optimizations Motivated by BG/L Finer decomposition Structure Factor and non-local computation now operate on groups of atoms within a plane Improved scaling Avoid creating network bottlenecks No DMA or communication offload on BG/L's torus net Workarounds for MPI progress engine Set eager <1000 Add network probes inside inner loops Shift communication to avoid cross computation phase interference
29
29 After the fixes
30
30 Future Work Scaling to 20k processors on BG/L - density pencil ffts Rhospace real->complex doublepack optimization New FFT based algorithm for Structure Factor More systems Topology aware chare mapping HLL Orchestration expression
31
31 What time is it in Scotland? There is a 1024 node BG/L in Edinburg Time is 6 hours ahead of CT there. During this non production time we can run on the full rack at night Thank you EPCC!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.