An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.

An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert

Barnes Hut Algorithm O(n log n) The space is recursively divided into cells Groups of distant bodies in the same cell can be treated as one in force calculation

Steps of the Algorithm

Global Optimizations Multiple field arrays instead of one object array Bodies are allocated to beginning of array, cells to end Constant kernel parameters (e.g. array starting addresses) are written to the GPU's constant memory No data transferred between CPU & GPU except at beginning and end

Kernel 1 Find the the root of the octree (the space) The data is split into equal chunks and assigned to each block Each block finds the min and max along each dimension and writes it to main mem The last block combines the results and calculates the bounding box of all bodies

Kernel 2 Constructs the BH Tree Round-robin assignment of bodies to blocks & threads For each body, the tree is traversed until a null or body pointer is found, and this pointer is locked. If null, insert the body there If a body already there, create a new cell containing that body and the new one and insert it.

Kernel 3 Fills in the CoM & mass in cell nodes In K2, the cells were allocated from the end of the array, so by stepping through the array forward, children are always visited before their parents Accelerates later kernels by:  Counting bodies in a subtree and stores in the root cell of that subtree  Moving null child pointers to the end of each cell's array

Kernel 4 Top-down traversal that places the bodies into an array in the order they would be in for an in-order tree transversal This places spatially close bodies near each other in the array, to speed up kernel 5

Kernel 5 Force Calculations Each thread traverses some portion of the tree, get the cells & bodies it needs to calculate the force Each CUDA warp (a group of threads that execute concurrently) has to traverse the union of all its threads' relevant portion By sorting in K4, this union is minimized, improving K5 by an order of magnitude.

Kernel 5 Cont Memory Accesses:  Older GPU hardware doesn't optimize multiple reads of the same memory  So all 32 threads in the warp might make 32 separate accesses of the same data  Solution: Have one thread read the data and store it in a shared cache

Kernel 6 Updates the bodies' positions and velocities based from the forces calculated in K5

Optimization Principles Maximize parallelism & load balance Minimize thread divergence Minimize accesses of main memory Lightweight Locking Combine Operations Maximize coalescing Avoid CPU transfers

Results

References M. Burtscher and K. Pingali. An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-Body Algorithm. Chapter 6 in GPU Computing Gems Emerald Edition, pp. 75-92. January 2011.

An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.

Similar presentations

Presentation on theme: "An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.

Similar presentations

Presentation on theme: "An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert."— Presentation transcript:

Similar presentations

About project

Feedback