Download presentation
Presentation is loading. Please wait.
Published byHarvey Griffith Modified over 9 years ago
1
Performance Evaluation of OhHelp'ed PIC Simulation Hiroshi Nakashima (ACCMS, Kyoto U.) cooperated by Yohei Miyake (ACCMS, Kyoto U.) Hideyuki Usui (Kobe U.) Yoshiharu Omura (RISH, Kyoto U.)
2
Contents Introduction: what I said last june PIC simulation: overview & problems OhHelp: load balancer for PIC simulation algorithm overview some detail of load balancing total view of OhHelp'ed PIC simulation performance evaluation Conclusion
3
Introduction: I Said Last June... Why Plasma Simulation ? power/money hungry large scale (128cores, 1TB, 1.28TFlops) shared memory nodes A big user group of plasma simulation insisted that our new system should include this power/money hungry subsystem for their memory hungry SM-parallel application. I failed to persuade them to build Open-Supercomputer-only system. So I swore revenge on them by coding a much more efficient DM-parallel program to run on Open Supercomputer. Now we are very friendly with each other and are pursuing tightly collaborating research work
4
Introduction: Also I Showed Last June... How Efficient on SMP & Small T2K performance @ 16-128 proc on HPC2500 x3.20 x11.71 x8.76 balanced unbalanced original x10.7 T2K Open Supercomputer 4 nodes (64 cores) x1.66 x4.02 wasis 2D benchmark3D production up to 128 procup to 1024 proc strong scaling+ weak scaling mild imbalanceextreme imbalance
5
PIC Simulation What to Do? a large number of (e.g. > 1 trillion) charged particles a large scale (e.g. 1000x1000x1000 grid) electromagnetic field (e.g. magnetosphere) simulate particle movement by
6
PIC Simulation What the Problem? Inherently has much parallelism but believed hardly scalable because... Particle decomposition copying fields cannot sustain large space domain and global operations on it. Domain decomposition cannot work when particles are distributed non- uniformly. Dynamic domain decomposition also fails when particles populate a small subdomain too densely. Need new idea!!
7
33 12 00 10 031323 01 32 3010 3303 OhHelp: Overview primary subdomainsecondary subdomain uniform block decomposition well-balanced : #particle-in-subdomain #p / #nodes (1 + ) simulate primary particles neighboring comm. only each node helps another node having dense subdomain balanced #particles balanced subdomain size well-balanced stable subdomain assignment 1323 02122232 01112131 00102030 0222 1121 0020 03 21 31 20 30 23 32 33 01 13 31 02 11 22 OhHelp: One-handed Help
8
give p even if becoming less than average get from somebody afterward OhHelp: Load Balancing 33003201301013032320310211211222 Secondary Subdomain Assignment move p from heaviest to lightest so that lightest has av. #p av. #p
9
OhHelp: Helper-Tree 12 22 2131 20 3332 01 30 10 13 0323 02 00 11 Helper-Tree is traversed each time-step (i.e. each particle movement) bottom-up: does helper assignment sustain the load variance ? top-down:how (re)distribute particles among family members ?
10
OhHelp: Balancing Check (2/2) 12 22 2131 20 3332 01 30 10 13 0323 02 00 11
11
OhHelp: Balancing Check (1/2) 12 22 2131 20 3332 01 30 10 13 0323 02 00 11
12
particle transfer OhHelp: Simulation Flow load balance & particle transfer particle push current scatter field solve secondary primary + all-reduce broadcast
13
OhHelp: Evaluation Setup weak scaling strong scaling #proc=1#proc=512 #proc=1#proc=1024 64 32x32x32 256 512
14
OhHelp: Performance 10 6 particle/s # of processes strong scaling domain size = 64 3 #particles = 32 x 2 20 weak scaling domain size = 32 3 x #proc #particles = 8 x 2 20 x #proc x293 x253 x721 x626 x35 thicken particles in a 32 3 area also well scalable almost linear speed- up over 16 proc. perf(1024) > perf(16) x 60 uniformly distributed good scalability particle decomp. not scalable
15
Conclusion We confirmed OhHelp'ed PIC simulator is scalable. 600-700 speedup with 1024 process (very near) Future work is to build OhHelp library. lev. 1:load balancing lev. 2:particle transfer lev. 3:inter-subdomain communication lev. 4:semi-automatic PIC code transformation for OhHelp'ing
16
Backup
17
Breakdown: Strong Scaling @ 256 [ms] exec. time/step comm. time/step
18
Breakdown: Weak Scaling @ 256 [ms] exec. time/step comm. time/step
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.