Download presentation
Presentation is loading. Please wait.
1
Parallelization of CPAIMD using Charm++
Parallel Programming Lab
2
CPAIMD Collaboration with Glenn Martyna and Mark Tuckerman
MPI code – PINY Scalability problems When #procs >= #orbitals Charm++ approach Better scalability using virtualization Further divide orbitals
3
The Iteration
4
The Iteration (contd.) Start with 128 “states” FFT each of 128 states
State – spatial representation of electron FFT each of 128 states In parallel Planar decomposition => transpose Compute densities (DFT) Compute energies using density Compute Forces and move electrons Orthonormalize states Start over
5
Parallel View
6
Optimized Parallel 3D FFT
To perform 3D FFT 1d followed by 2d instead of 2d followed by 1d Lesser computation Lesser communication
7
Orthonormalization All-pairs operation Our approach (picture follows)
The data of each state has to meet with the data of all other states Our approach (picture follows) A virtual processor acts as meeting point for several pairs of states Create lots of these The number of pairs meeting at a VP: n Communication decreases with n Computation increases with n Balance required
8
VP based approach
9
Performance Existing MPI code – PINY Our performance:
Does not scale beyond 128 processors Best per-iteration: 1.7s Our performance: Processors Time(s) 128 2.07 256 1.18 512 0.65 1024 0.48 1536 0.39
10
Load balancing Load imbalance due to distribution of data in orbitals
Planes are sections of a sphere Hence imbalance Computation – more points Communication – more data to send
11
Load Imbalance Iteration time: 900ms on 1024 procs
12
Improvement - I Improvement by pairing heavily loaded planes with lightly loaded planes. Iteration time: 590ms
13
Charm++ Load Balancing
Load balancing provided by the system, iteration time: 600ms
14
Improvement - II Improvement by using a load vector based scheme to map planes to processors. The number of “light” planes per processor is corresponding lesser than that of the number of “heavy” planes. Iteration time: 480ms
15
Scope for Improvement Load balancing
Charm++ load balancer shows encouraging results on 512 pes Combination of automated and manual load-balancing Avoiding copying when sending messages In ffts Sending large read-only messages FFTs can be made more efficient Use double packing Make assumption about data distribution when performing FFTs Alternative implementation of orthonormalization
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.