Download presentation
Presentation is loading. Please wait.
Published byWillis Hardy Modified over 8 years ago
1
Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn
2
Simulations and Scientific Discovery ● Help reconcile observation and theory – Calculate final states of theories of structure formation ● Direct observational programs – What should we look for in space? ● Help determine underlying structures and masses
3
Computational Challenges ● N ~10^12 – Direct summation forces would take ~10^10 Teraflop years – Need efficient, scalable algorithms ● Large dynamic ranges – Need multiple timestepping ● Irregular domains – Balance load across processors
4
ChaNGa ● Uses Barnes-Hut algorithm ● Based on Charm++ – Processor virtualization – Asynchronous message-driven model ● Computation and communication overlap – Intelligent, adaptive runtime system ● Load balancing
5
Barnes-Hut Algorithm Overview ● Space divided into cells ● Cells form nodes of Barnes-Hut tree – Particles grouped into buckets – Buckets assigned to TreePieces TreePiece 1 TreePiece 3 TreePiece 2
6
Computing Forces ● Collect relevant nodes/particles at TreePiece ● Traverse global tree to get force on each bucket – Nodes “opened” (too close) – or not (far enough) Not involved Involved in computation
7
Processor Algorithm Overview Local Work Pref (n-1) Comp (n-1) Pref (n) Comp (n) Pref (n+1) Global Work Yes Reply with Particles No Request Particles Receive Particles TreePieces CacheManager TreePiece Needs Remote Particles Have in Cache? Comp (n+1)
8
Major Optimizations ● Pipelined computation – Prefetch tree chunk before starting traversal ● Tree-in-Cache – Aggregate trees from all chares on processor ● Tunable computation granularity – Response time for data requests vs Scheduling overhead
9
Experimental Setup lambs 3 million particles hrwh_LCDMs 16 milllion particles dwarf 5 and 50 million particles drgas 700 million particles
10
Experimental Setup (contd.) ● Platforms
11
Parallel Performance A comparison of Parallel Performance with PKDGRAV. (`Dwarf' dataset on Tungsten.)
12
Scaling Tests Cray XT3 IBM BG/L Poor scaling
13
Towards Greater Scalability ● Load Imbalance causes poor scaling ● Static balancing not good enough – Even number of particles != Even work distribution ● Must balance both computation & communication
14
Balancing Load to Improve Performance LB algorithms must consider both computation and communication Increased communication Greater balance Computation Communication Time → Processor Activity →
15
● 1024 BG/L processors ● Dwarf dataset ● OrbLB Processors → Time → Accounting for Communication: OrbRefineLB ● Based on Charm++ OrbLB – ORB along object ident. line ● OrbRefineLB: `Refines' placement by exchanging load between processors in shifting window
16
Results with OrbRefineLB ● Different datasets ● OrbRefineLB
17
Time → Processors Multistepped Simulations for Greater Efficiency ● Group particles into `rungs' – Lower rung means higher acceleration – Different rungs active at different times ● Update particles on higher rungs less frequently ● Less work done than singlestepping 0000 0: rung 0 11 1: rungs 0,1 22 2: rungs 0,1,2 Computation split into phases
18
Imbalance in MS Simulations ● But load imbalance is even more severe Time → Processors Fig: Execution profile of 32 processors during a MS simulation Different particles active during different phases 0012 Computation split into phases 0012
19
Balancing Load in MS Runs ● Different strategies for different phases ● Multiphase instrumentation ● Model-based load estimation (first few small steps) 0012
20
Preliminary Results Singlestepped (613 s) Multistepped (429 s) Multistepped with load balancing (228 s) ● Dwarf dataset ● 32 BG/L processors ● Different timestepping schemes
21
Preliminary Results ● ~50% reduction in execution time: ● Multistepping and overdecomposition More TreePieces → greater load balance ● Lambb dataset ● 512 and 1024 BG/L processors ● Singlestepped vs load- balanced multistepped ● Lambb dataset ● 1024 BG/L processors ● Varying num. TreePieces
22
Future Work ● SPH ● Alternative decomposition schemes ● Runtime optimizations to reduce communication cost ● More sophisticated load balancing algorithms – Account for: ● Complete simulation space topology ● Processor topology (reduce hop-bytes)
23
Conclusions ● Introduced ChaNGa ● Optimizations to reduce simulation time ● Load imbalance issues tackled ● Multiple timestepping beneficial ● Balancing load in multistepped simulations
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.