ChaNGa: Design Issues in High Performance Cosmology

Slides:

Advertisements

Similar presentations

Christian Lauterbach COMP 770, 2/16/2009. Overview  Acceleration structures  Spatial hierarchies  Object hierarchies  Interactive Ray Tracing techniques.

Advertisements

Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.

CS 484. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.

Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies.

1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.

2 Less fish … More fish! Parallelism means doing multiple things at the same time: you can get more work done in the same time.

Parallel Strategies Partitioning consists of the following steps –Divide the problem into parts –Compute each part separately –Merge the results Divide.

Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.

Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.

Parallel System Performance CS 524 – High-Performance Computing.

Reference: Message Passing Fundamentals.

1 Friday, September 29, 2006 If all you have is a hammer, then everything looks like a nail. -Anonymous.

Daniel Blackburn Load Balancing in Distributed N-Body Simulations.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

CS 584. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.

Module on Computational Astrophysics Jim Stone Department of Astrophysical Sciences 125 Peyton Hall : ph :

Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.

A Parallelisation Approach for Multi-Resolution Grids Based Upon the Peano Space-Filling Curve Student: Adriana Bocoi Advisor: Dipl.-Inf.Tobias Weinzierl.

Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.

Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters,

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.

CS 584. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.

An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.

Data Structures and Algorithms in Parallel Computing

Barnes Hut N-body Simulation Martin Burtscher Fall 2009.

High-Performance Computing 12.2: Parallel Algorithms.

Barnes Hut – A Broad Review Abhinav S Bhatele The 27th day of April, 2006.

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University

Barnes Hut N-body Simulation Martin Burtscher Fall 2009.

Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.

Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn.

1 Scalable Cosmological Simulations on Parallel Machines Filippo Gioachin¹ Amit Sharma¹ Sayantan Chakravorty¹ Celso Mendes¹ Laxmikant V. Kale¹ Thomas R.

ChaNGa CHArm N-body GrAvity. Thomas Quinn Graeme Lufkin Joachim Stadel Laxmikant Kale Filippo Gioachin Pritish Jetley Celso Mendes Amit Sharma.

1 ChaNGa: The Charm N-Body GrAvity Solver Filippo Gioachin¹ Pritish Jetley¹ Celso Mendes¹ Laxmikant Kale¹ Thomas Quinn² ¹ University of Illinois at Urbana-Champaign.

ChaNGa CHArm N-body GrAvity. Thomas Quinn Graeme Lufkin Joachim Stadel Laxmikant Kale Filippo Gioachin Pritish Jetley Celso Mendes Amit Sharma.

Auburn University

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

2D AFEAPI Overview Goals, Design Space Filling Curves Code Structure

Ioannis E. Venetis Department of Computer Engineering and Informatics

Parallel Databases.

Parallel Programming By J. H. Wang May 2, 2017.

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.

The University of Adelaide, School of Computer Science

So far we have covered … Basic visualization algorithms

Parallel Algorithm Design

Introduction to Parallelism.

Real-Time Ray Tracing Stefan Popov.

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs (cont.) Dr. Xiao.

Parallel Programming in C with MPI and OpenMP

EE 193: Parallel Computing

Component Frameworks:

Efficient Distribution-based Feature Search in Multi-field Datasets Ohio State University (Shen) Problem: How to efficiently search for distribution-based.

Course Outline Introduction in algorithms and applications

Supported by the National Science Foundation.

All-to-All Pattern A pattern where all (slave) processes can communicate with each other Somewhat the worst case scenario! ITCS 4/5145 Parallel Computing,

Cosmology Applications N-Body Simulations

CS510 - Portland State University

Case Studies with Projections

PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.

Adaptivity and Dynamic Load Balancing

Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale

Database System Architectures

Parallel Programming in C with MPI and OpenMP

N-Body Gravitational Simulations

Presentation transcript:

ChaNGa: Design Issues in High Performance Cosmology Pritish Jetley Parallel Programming Laboratory

Overview Why Barnes-Hut? Domain decomposition Tree construction Tree traversal Overlapping remote and local work Remote data caching Prefetching remote data Increasing local work Efficient sequential traversal Load balancing Multistepping

Why Barnes-Hut? Gravity is a long-range force Every particle interacts with every other Do not need N(N-1)/2 interactions Groups of distant particles ≈ point masses O(N lg N) interactions Equivalent point mass Single interaction Target particle Source particles

Parallel Barnes-Hut: Decomposition Distribute particles among objects To lower communication costs: Keep particles that are close to each other on the same object Make spatial partitions regularly shaped Balance number of particles per partition

Decomposition strategies SFC: Linearize particle coordinates Convert floats/doubles to integers Interleave bits of integers Particle (-0.49, 0.29, 0.41) Key (0x16C12685AE69F0000) Scale to 21 bit unsigned integers Interleave bits, prepend 1 x: 0x4E20 y: 0x181BE0 z: 0x1BC560

SFC Interleaving leads to jagged line of particles Line is split among objects (TreePieces) TreePiece 0 TreePiece 1 TreePiece 2

Oct Recursively divide partition into quadrants if more than τ particles within it Iterative histogramming of particle counts τ = 3

Tree construction TreePieces construct Multipole moment information trees beneath themselves independently of each other Multipole moment information is passed up the tree so that every processor has it

Tree construction issues Must distribute TreePieces evenly across processors Particles stored as structures of arrays (Possibly) more cache friendly Easier to vectorize accessing code Tree data structure layout? new for each node - BAD! Better: allocate all children together Better still: allocate in a DFS manner 1 2 3 4 5 7 6 8 9 10 11 12 13 14 15 16 17 18 19 20

Tree traversal A TreePiece performs depth-first traversal of tree for each bucket of particles For each node encountered, Is node far enough? Compute forces on bucket due to node Pop node from stack Node too close? Push next child onto stack

Illustration Yellow circles Represent Opening criterion checks =

Tree traversal Cannot have entire tree on every processor Local nodes Remote nodes Remote nodes must be requested from other TreePieces Generate communication Give high priority to remote work Do local work when waiting for remote nodes to arrive: overlap

Overlapping remote and local work Receive remote requests Remote work Local work Activity across processors Send remote requests Time

Remote data caching reduces communication Reuse requested data to reduce number of requests Cache requested remote data on processor Data requested by one TreePiece used by others Fewer messages Less overhead for processing remote data requests Optimal cache line size (depth of tree beneath requested node) About 2 for Octrees

Remote data caching

Remote data prefetching Estimate remote data requirements of TreePieces, prefetch before traversal Reduces latency of node access during traversal

Increasing local work Division of tree into TreePieces reduces the amount of local work per piece Combine TreePieces in one processor to increase amount of local work Without combination, 16% local work per TreePiece With combination, 58%

Algorithmic efficiency Normally, walk entire tree once for each bucket However, proximal buckets have similar interactions with the rest of the universe Share lists between buckets as far as possible Check distance between Remote tree node Local ancestor of buckets (instead of buckets) Improvements of 7-10% over normal traversal

ChaNGa : a recent result

Clustered Dataset - Dwarf Animation 1 : Shows volume Animation 2 : Time profile Idle time due to message delays Also, load imbalances: solved by Hierarchical balancers Highly clustered Maximum request per processor: > 30K

Solution: Replication PE 1 PE 2 PE 3 PE 4 Replicate tree nodes to distribute requests Requester randomly selects a replica

Replication Impact Replication distributes requests Maximum request reduced from 30K to 4.5K Gravity time reduced from 2.4 s to 1.7 s, on 8k Gravity time reduced from 2.4 s to 1.7 s for 8k cores and from 2.1 s to 0.99 t 16K

Multistepping Group particles into rungs Faster rung → more speed Different rungs active at different times Update slower rung particles less frequently Less computation done than singlestepping Computation split into phases To get even better scaling, we turn to algorithmic improvements in the computation process itself The use multiple time scales, or multistepping can yield significant performance benefits It proceeds by grouping particles into rungs, according to their accelerations ... This leads to a processor load profile as shown below. The regions marked in green have only rung 0 particles active... The computation can thus be thought of in terms of phases Time → 2 2: rungs 0,1,2 0: rung 0 1 1: rungs 0,1 Processors

Load imbalance with multistepping Singlestepped (613 s) Dwarf dataset 32 BG/L processors Different timestepping schemes Multistepped (429 s) Putting these principles to practice in our multiphase load balancer, we obtained signifcant speedups both in comparison to plain multistepped and singlestepped runs Multistepped with load balancing (228 s)

Multistepping! Load (for the same object) changes across rungs Yet, there is persistence within the same rung! So, specialized phase-aware balancers were developed

Multistepping tradeoff Parallel efficiency is lower, but performance is improved significantly Single Stepping Multi Stepping

Thank you Questions?