Download presentation
Presentation is loading. Please wait.
Published byBridget Preston Modified over 9 years ago
1
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert
2
Barnes Hut Algorithm O(n log n) The space is recursively divided into cells Groups of distant bodies in the same cell can be treated as one in force calculation
3
Steps of the Algorithm
4
Global Optimizations Multiple field arrays instead of one object array Bodies are allocated to beginning of array, cells to end Constant kernel parameters (e.g. array starting addresses) are written to the GPU's constant memory No data transferred between CPU & GPU except at beginning and end
5
Kernel 1 Find the the root of the octree (the space) The data is split into equal chunks and assigned to each block Each block finds the min and max along each dimension and writes it to main mem The last block combines the results and calculates the bounding box of all bodies
6
Kernel 2 Constructs the BH Tree Round-robin assignment of bodies to blocks & threads For each body, the tree is traversed until a null or body pointer is found, and this pointer is locked. If null, insert the body there If a body already there, create a new cell containing that body and the new one and insert it.
7
Kernel 3 Fills in the CoM & mass in cell nodes In K2, the cells were allocated from the end of the array, so by stepping through the array forward, children are always visited before their parents Accelerates later kernels by: Counting bodies in a subtree and stores in the root cell of that subtree Moving null child pointers to the end of each cell's array
8
Kernel 4 Top-down traversal that places the bodies into an array in the order they would be in for an in-order tree transversal This places spatially close bodies near each other in the array, to speed up kernel 5
9
Kernel 5 Force Calculations Each thread traverses some portion of the tree, get the cells & bodies it needs to calculate the force Each CUDA warp (a group of threads that execute concurrently) has to traverse the union of all its threads' relevant portion By sorting in K4, this union is minimized, improving K5 by an order of magnitude.
10
Kernel 5 Cont Memory Accesses: Older GPU hardware doesn't optimize multiple reads of the same memory So all 32 threads in the warp might make 32 separate accesses of the same data Solution: Have one thread read the data and store it in a shared cache
11
Kernel 6 Updates the bodies' positions and velocities based from the forces calculated in K5
12
Optimization Principles Maximize parallelism & load balance Minimize thread divergence Minimize accesses of main memory Lightweight Locking Combine Operations Maximize coalescing Avoid CPU transfers
13
Results
14
References M. Burtscher and K. Pingali. An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-Body Algorithm. Chapter 6 in GPU Computing Gems Emerald Edition, pp. 75-92. January 2011.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.