An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.

Slides:



Advertisements
Similar presentations
Instructor Notes Lecture discusses parallel implementation of a simple embarrassingly parallel nbody algorithm We aim to provide some correspondence between.
Advertisements

Speed, Accurate and Efficient way to identify the DNA.
List Ranking and Parallel Prefix
List Ranking on GPUs Sathish Vadhiyar. List Ranking on GPUs Linked list prefix computations – computations of prefix sum on the elements contained in.
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.
L15: Tree-Structured Algorithms on GPUs CS6963L15: Tree Algorithms.
Latency considerations of depth-first GPU ray tracing
Parallel Strategies Partitioning consists of the following steps –Divide the problem into parts –Compute each part separately –Merge the results Divide.
Chapter 6: Transform and Conquer
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
Heapsort By: Steven Huang. What is a Heapsort? Heapsort is a comparison-based sorting algorithm to create a sorted array (or list) Part of the selection.
CSE332: Data Abstractions Lecture 9: B Trees Dan Grossman Spring 2010.
Chapter 15 B External Methods – B-Trees. © 2004 Pearson Addison-Wesley. All rights reserved 15 B-2 B-Trees To organize the index file as an external search.
FALL 2004CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.
Back-Projection on GPU: Improving the Performance Wenlay “Esther” Wei Advisor: Jeff Fessler Mentor: Yong Long April 29, 2010.
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
Cache Oblivious Search Trees via Binary Trees of Small Height
Chapter 12 Trees. Copyright © 2005 Pearson Addison-Wesley. All rights reserved Chapter Objectives Define trees as data structures Define the terms.
File Structures Dale-Marie Wilson, Ph.D.. Basic Concepts Primary storage Main memory Inappropriate for storing database Volatile Secondary storage Physical.
Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST
Making B+-Trees Cache Conscious in Main Memory
Chapter 61 Chapter 6 Index Structures for Files. Chapter 62 Indexes Indexes are additional auxiliary access structures with typically provide either faster.
Priority Queues and Heaps Bryce Boe 2013/11/20 CS24, Fall 2013.
Spring 2006 Copyright (c) All rights reserved Leonard Wesley0 B-Trees CMPE126 Data Structures.
Chapter Tow Search Trees BY HUSSEIN SALIM QASIM WESAM HRBI FADHEEL CS 6310 ADVANCE DATA STRUCTURE AND ALGORITHM DR. ELISE DE DONCKER 1.
Database Management 8. course. Query types Equality query – Each field has to be equal to a constant Range query – Not all the fields have to be equal.
L15: Putting it together: N-body (Ch. 6) October 30, 2012.
A GPU Implementation of Inclusion-based Points-to Analysis Mario Méndez-Lojo (AMD) Martin Burtscher (Texas State University, USA) Keshav Pingali (U.T.
1 Chapter 17 Disk Storage, Basic File Structures, and Hashing Chapter 18 Index Structures for Files.
Trevor Brown – University of Toronto B-slack trees: Space efficient B-trees.
The Binary Heap. Binary Heap Looks similar to a binary search tree BUT all the values stored in the subtree rooted at a node are greater than or equal.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
GPU Architecture and Programming
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
Sorting. Pseudocode of Insertion Sort Insertion Sort To sort array A[0..n-1], sort A[0..n-2] recursively and then insert A[n-1] in its proper place among.
COSC2007 Data Structures II Chapter 11 Trees V. 2 Topics TreeSort Save/Restore into/from file General Trees.
CUDA - 2.
CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio.
Starting at Binary Trees
Arboles B External Search The algorithms we have seen so far are good when all data are stored in primary storage device (RAM). Its access is fast(er)
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
Chapter 18: Searching and Sorting Algorithms. Objectives In this chapter, you will: Learn the various search algorithms Implement sequential and binary.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
 Genetic Algorithms  A class of evolutionary algorithms  Efficiently solves optimization tasks  Potential Applications in many fields  Challenges.
Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
1/14/20161 BST Operations Data Structures Ananda Gunawardena.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
Synchronization These notes introduce:
Barnes Hut N-body Simulation Martin Burtscher Fall 2009.
Circular linked list A circular linked list is a linear linked list accept that last element points to the first element.
Heap Sort Uses a heap, which is a tree-based data type Steps involved: Turn the array into a heap. Delete the root from the heap and insert into the array,
8/3/2007CMSC 341 BTrees1 CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio.
Barnes Hut N-body Simulation Martin Burtscher Fall 2009.
CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
Spatial Data Management
Sathish Vadhiyar Parallel Programming
External Methods Chapter 15 (continued)
Presented by: Isaac Martin
Course Outline Introduction in algorithms and applications
CENG 351 Data Management and File Structures
Chapter 20: Binary Trees.
Lecture 13: Computer Memory
Synchronization These notes introduce:
Force Directed Placement: GPU Implementation
N-Body Gravitational Simulations
Presentation transcript:

An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert

Barnes Hut Algorithm O(n log n) The space is recursively divided into cells Groups of distant bodies in the same cell can be treated as one in force calculation

Steps of the Algorithm

Global Optimizations Multiple field arrays instead of one object array Bodies are allocated to beginning of array, cells to end Constant kernel parameters (e.g. array starting addresses) are written to the GPU's constant memory No data transferred between CPU & GPU except at beginning and end

Kernel 1 Find the the root of the octree (the space) The data is split into equal chunks and assigned to each block Each block finds the min and max along each dimension and writes it to main mem The last block combines the results and calculates the bounding box of all bodies

Kernel 2 Constructs the BH Tree Round-robin assignment of bodies to blocks & threads For each body, the tree is traversed until a null or body pointer is found, and this pointer is locked. If null, insert the body there If a body already there, create a new cell containing that body and the new one and insert it.

Kernel 3 Fills in the CoM & mass in cell nodes In K2, the cells were allocated from the end of the array, so by stepping through the array forward, children are always visited before their parents Accelerates later kernels by:  Counting bodies in a subtree and stores in the root cell of that subtree  Moving null child pointers to the end of each cell's array

Kernel 4 Top-down traversal that places the bodies into an array in the order they would be in for an in-order tree transversal This places spatially close bodies near each other in the array, to speed up kernel 5

Kernel 5 Force Calculations Each thread traverses some portion of the tree, get the cells & bodies it needs to calculate the force Each CUDA warp (a group of threads that execute concurrently) has to traverse the union of all its threads' relevant portion By sorting in K4, this union is minimized, improving K5 by an order of magnitude.

Kernel 5 Cont Memory Accesses:  Older GPU hardware doesn't optimize multiple reads of the same memory  So all 32 threads in the warp might make 32 separate accesses of the same data  Solution: Have one thread read the data and store it in a shared cache

Kernel 6 Updates the bodies' positions and velocities based from the forces calculated in K5

Optimization Principles Maximize parallelism & load balance Minimize thread divergence Minimize accesses of main memory Lightweight Locking Combine Operations Maximize coalescing Avoid CPU transfers

Results

References M. Burtscher and K. Pingali. An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-Body Algorithm. Chapter 6 in GPU Computing Gems Emerald Edition, pp January 2011.