Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn
Simulations and Scientific Discovery ● Help reconcile observation and theory – Calculate final states of theories of structure formation ● Direct observational programs – What should we look for in space? ● Help determine underlying structures and masses
Computational Challenges ● N ~10^12 – Direct summation forces would take ~10^10 Teraflop years – Need efficient, scalable algorithms ● Large dynamic ranges – Need multiple timestepping ● Irregular domains – Balance load across processors
ChaNGa ● Uses Barnes-Hut algorithm ● Based on Charm++ – Processor virtualization – Asynchronous message-driven model ● Computation and communication overlap – Intelligent, adaptive runtime system ● Load balancing
Barnes-Hut Algorithm Overview ● Space divided into cells ● Cells form nodes of Barnes-Hut tree – Particles grouped into buckets – Buckets assigned to TreePieces TreePiece 1 TreePiece 3 TreePiece 2
Computing Forces ● Collect relevant nodes/particles at TreePiece ● Traverse global tree to get force on each bucket – Nodes “opened” (too close) – or not (far enough) Not involved Involved in computation
Processor Algorithm Overview Local Work Pref (n-1) Comp (n-1) Pref (n) Comp (n) Pref (n+1) Global Work Yes Reply with Particles No Request Particles Receive Particles TreePieces CacheManager TreePiece Needs Remote Particles Have in Cache? Comp (n+1)
Major Optimizations ● Pipelined computation – Prefetch tree chunk before starting traversal ● Tree-in-Cache – Aggregate trees from all chares on processor ● Tunable computation granularity – Response time for data requests vs Scheduling overhead
Experimental Setup lambs 3 million particles hrwh_LCDMs 16 milllion particles dwarf 5 and 50 million particles drgas 700 million particles
Experimental Setup (contd.) ● Platforms
Parallel Performance A comparison of Parallel Performance with PKDGRAV. (`Dwarf' dataset on Tungsten.)
Scaling Tests Cray XT3 IBM BG/L Poor scaling
Towards Greater Scalability ● Load Imbalance causes poor scaling ● Static balancing not good enough – Even number of particles != Even work distribution ● Must balance both computation & communication
Balancing Load to Improve Performance LB algorithms must consider both computation and communication Increased communication Greater balance Computation Communication Time → Processor Activity →
● 1024 BG/L processors ● Dwarf dataset ● OrbLB Processors → Time → Accounting for Communication: OrbRefineLB ● Based on Charm++ OrbLB – ORB along object ident. line ● OrbRefineLB: `Refines' placement by exchanging load between processors in shifting window
Results with OrbRefineLB ● Different datasets ● OrbRefineLB
Time → Processors Multistepped Simulations for Greater Efficiency ● Group particles into `rungs' – Lower rung means higher acceleration – Different rungs active at different times ● Update particles on higher rungs less frequently ● Less work done than singlestepping : rung : rungs 0,1 22 2: rungs 0,1,2 Computation split into phases
Imbalance in MS Simulations ● But load imbalance is even more severe Time → Processors Fig: Execution profile of 32 processors during a MS simulation Different particles active during different phases 0012 Computation split into phases 0012
Balancing Load in MS Runs ● Different strategies for different phases ● Multiphase instrumentation ● Model-based load estimation (first few small steps) 0012
Preliminary Results Singlestepped (613 s) Multistepped (429 s) Multistepped with load balancing (228 s) ● Dwarf dataset ● 32 BG/L processors ● Different timestepping schemes
Preliminary Results ● ~50% reduction in execution time: ● Multistepping and overdecomposition More TreePieces → greater load balance ● Lambb dataset ● 512 and 1024 BG/L processors ● Singlestepped vs load- balanced multistepped ● Lambb dataset ● 1024 BG/L processors ● Varying num. TreePieces
Future Work ● SPH ● Alternative decomposition schemes ● Runtime optimizations to reduce communication cost ● More sophisticated load balancing algorithms – Account for: ● Complete simulation space topology ● Processor topology (reduce hop-bytes)
Conclusions ● Introduced ChaNGa ● Optimizations to reduce simulation time ● Load imbalance issues tackled ● Multiple timestepping beneficial ● Balancing load in multistepped simulations