Presentation is loading. Please wait.

Presentation is loading. Please wait.

Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn.

Similar presentations


Presentation on theme: "Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn."— Presentation transcript:

1 Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn

2 Simulations and Scientific Discovery ● Help reconcile observation and theory – Calculate final states of theories of structure formation ● Direct observational programs – What should we look for in space? ● Help determine underlying structures and masses

3 Computational Challenges ● N ~10^12 – Direct summation forces would take ~10^10 Teraflop years – Need efficient, scalable algorithms ● Large dynamic ranges – Need multiple timestepping ● Irregular domains – Balance load across processors

4 ChaNGa ● Uses Barnes-Hut algorithm ● Based on Charm++ – Processor virtualization – Asynchronous message-driven model ● Computation and communication overlap – Intelligent, adaptive runtime system ● Load balancing

5 Barnes-Hut Algorithm Overview ● Space divided into cells ● Cells form nodes of Barnes-Hut tree – Particles grouped into buckets – Buckets assigned to TreePieces TreePiece 1 TreePiece 3 TreePiece 2

6 Computing Forces ● Collect relevant nodes/particles at TreePiece ● Traverse global tree to get force on each bucket – Nodes “opened” (too close) – or not (far enough) Not involved Involved in computation

7 Processor Algorithm Overview Local Work Pref (n-1) Comp (n-1) Pref (n) Comp (n) Pref (n+1) Global Work Yes Reply with Particles No Request Particles Receive Particles TreePieces CacheManager TreePiece Needs Remote Particles Have in Cache? Comp (n+1)

8 Major Optimizations ● Pipelined computation – Prefetch tree chunk before starting traversal ● Tree-in-Cache – Aggregate trees from all chares on processor ● Tunable computation granularity – Response time for data requests vs Scheduling overhead

9 Experimental Setup lambs 3 million particles hrwh_LCDMs 16 milllion particles dwarf 5 and 50 million particles drgas 700 million particles

10 Experimental Setup (contd.) ● Platforms

11 Parallel Performance A comparison of Parallel Performance with PKDGRAV. (`Dwarf' dataset on Tungsten.)

12 Scaling Tests Cray XT3 IBM BG/L Poor scaling

13 Towards Greater Scalability ● Load Imbalance causes poor scaling ● Static balancing not good enough – Even number of particles != Even work distribution ● Must balance both computation & communication

14 Balancing Load to Improve Performance LB algorithms must consider both computation and communication Increased communication Greater balance Computation Communication Time → Processor Activity →

15 ● 1024 BG/L processors ● Dwarf dataset ● OrbLB Processors → Time → Accounting for Communication: OrbRefineLB ● Based on Charm++ OrbLB – ORB along object ident. line ● OrbRefineLB: `Refines' placement by exchanging load between processors in shifting window

16 Results with OrbRefineLB ● Different datasets ● OrbRefineLB

17 Time → Processors Multistepped Simulations for Greater Efficiency ● Group particles into `rungs' – Lower rung means higher acceleration – Different rungs active at different times ● Update particles on higher rungs less frequently ● Less work done than singlestepping 0000 0: rung 0 11 1: rungs 0,1 22 2: rungs 0,1,2 Computation split into phases

18 Imbalance in MS Simulations ● But load imbalance is even more severe Time → Processors Fig: Execution profile of 32 processors during a MS simulation Different particles active during different phases 0012 Computation split into phases 0012

19 Balancing Load in MS Runs ● Different strategies for different phases ● Multiphase instrumentation ● Model-based load estimation (first few small steps) 0012

20 Preliminary Results Singlestepped (613 s) Multistepped (429 s) Multistepped with load balancing (228 s) ● Dwarf dataset ● 32 BG/L processors ● Different timestepping schemes

21 Preliminary Results ● ~50% reduction in execution time: ● Multistepping and overdecomposition More TreePieces → greater load balance ● Lambb dataset ● 512 and 1024 BG/L processors ● Singlestepped vs load- balanced multistepped ● Lambb dataset ● 1024 BG/L processors ● Varying num. TreePieces

22 Future Work ● SPH ● Alternative decomposition schemes ● Runtime optimizations to reduce communication cost ● More sophisticated load balancing algorithms – Account for: ● Complete simulation space topology ● Processor topology (reduce hop-bytes)

23 Conclusions ● Introduced ChaNGa ● Optimizations to reduce simulation time ● Load imbalance issues tackled ● Multiple timestepping beneficial ● Balancing load in multistepped simulations


Download ppt "Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn."

Similar presentations


Ads by Google