L15: Tree-Structured Algorithms on GPUs CS6963L15: Tree Algorithms.

Slides:



Advertisements
Similar presentations
List Ranking and Parallel Prefix
Advertisements

Lecture 15. Graph Algorithms
List Ranking on GPUs Sathish Vadhiyar. List Ranking on GPUs Linked list prefix computations – computations of prefix sum on the elements contained in.
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.
AVL Trees1 Part-F2 AVL Trees v z. AVL Trees2 AVL Tree Definition (§ 9.2) AVL trees are balanced. An AVL Tree is a binary search tree such that.
L16: Sorting and OpenGL Interface
Searching on Multi-Dimensional Data
Comp 122, Fall 2004 Elementary Graph Algorithms. graphs Lin / Devi Comp 122, Fall 2004 Graphs  Graph G = (V, E) »V = set of vertices »E = set of.
An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.
© 2006 Pearson Addison-Wesley. All rights reserved14 A-1 Chapter 14 excerpts Graphs (breadth-first-search)
Data Structure and Algorithms (BCS 1223) GRAPH. Introduction of Graph A graph G consists of two things: 1.A set V of elements called nodes(or points or.
Advanced Data Structures
Experiences with Streaming Construction of SAH KD Trees Stefan Popov, Johannes Günther, Hans-Peter Seidel, Philipp Slusallek.
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 21: Graphs.
Chapter 15 B External Methods – B-Trees. © 2004 Pearson Addison-Wesley. All rights reserved 15 B-2 B-Trees To organize the index file as an external search.
©Wen-mei W. Hwu and David Kirk/NVIDIA 2010 ECE 498HK Computational Thinking for Many-core Computing Lecture 15: Dealing with Dynamic Data.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.
Transforming Infix to Postfix
L12: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Monday, March 8 – handin cs6963 lab 3 ”
The Euler-tour technique
Chapter 12 Trees. Copyright © 2005 Pearson Addison-Wesley. All rights reserved Chapter Objectives Define trees as data structures Define the terms.
CISC220 Fall 2009 James Atlas Nov 13: Graphs, Line Intersections.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.
Graph Implementations Chapter 29 Copyright ©2012 by Pearson Education, Inc. All rights reserved.
1 Database Tuning Rasmus Pagh and S. Srinivasa Rao IT University of Copenhagen Spring 2007 February 8, 2007 Tree Indexes Lecture based on [RG, Chapter.
Review of Graphs A graph is composed of edges E and vertices V that link the nodes together. A graph G is often denoted G=(V,E) where V is the set of vertices.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Data Structures Using C++ 2E
CS223 Algorithms D-Term 2013 Instructor: Mohamed Eltabakh WPI, CS Introduction Slide 1.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 11 Parallel.
Chapter 9 – Graphs A graph G=(V,E) – vertices and edges
Searching: Binary Trees and Hash Tables CHAPTER 12 6/4/15 Nyhoff, ADTs, Data Structures and Problem Solving with C++, Second Edition, © 2005 Pearson Education,
Representing and Using Graphs
Chapter 19 Implementing Trees and Priority Queues Fundamentals of Java.
Chapter 19 Implementing Trees and Priority Queues Fundamentals of Java.
On a Few Ray Tracing like Algorithms and Structures. -Ravi Prakash Kammaje -Swansea University.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
L17: Introduction to “Irregular” Algorithms and MPI, cont. November 8, 2011.
Fast BVH Construction on GPUs (Eurographics 2009) Park, Soonchan KAIST (Korea Advanced Institute of Science and Technology)
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.
© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.
Optimizing Stencil Computations March 18, Administrative Midterm coming April 3? In class March 25, can bring one page of notes Review notes, readings.
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.
Data Structures and Algorithms in Parallel Computing Lecture 3.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley. Ver Chapter 13: Graphs Data Abstraction & Problem Solving with C++
1 Directed Graphs Chapter 8. 2 Objectives You will be able to: Say what a directed graph is. Describe two ways to represent a directed graph: Adjacency.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
© 2006 Pearson Addison-Wesley. All rights reserved 14 A-1 Chapter 14 Graphs.
Chapter 10: Trees A tree is a connected simple undirected graph with no simple circuits. Properties: There is a unique simple path between any 2 of its.
Breadth-first and depth-first traversal Prof. Noah Snavely CS1114
Shahed University Dr. Shahriar Bijani May  A path is a sequence of vertices P = (v 0, v 1, …, v k ) such that, for 1 ≤ i ≤ k, edge (v i – 1, v.
Chapter 20: Graphs. Objectives In this chapter, you will: – Learn about graphs – Become familiar with the basic terminology of graph theory – Discover.
CS6963 L13: Application Case Study III: Molecular Visualization and Material Point Method.
Mergesort example: Merge as we return from recursive calls Merge Divide 1 element 829.
CSC317 1 At the same time: Breadth-first search tree: If node v is discovered after u then edge uv is added to the tree. We say that u is a predecessor.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 12 Parallel Computation.
L21: Putting it together: Tree Search (Ch. 6)
CS120 Graphs.
CS/EE 217 – GPU Architecture and Parallel Programming
Parallel Computation Patterns (Scan)
Mattan Erez The University of Texas at Austin
Parallel Computation Patterns (Reduction)
ECE 498AL Lecture 15: Reductions and Their Implementation
Chapter 14 Graphs © 2011 Pearson Addison-Wesley. All rights reserved.
CSC 325: Algorithms Graph Algorithms David Luebke /24/2019.
Presentation transcript:

L15: Tree-Structured Algorithms on GPUs CS6963L15: Tree Algorithms

Administrative STRSM due March 17 (EXTENDED) Midterm coming In class April 4, open notes Review notes, readings and review lecture (before break) Will post prior exams Design Review Intermediate assessment of progress on project, oral and short Tentatively April 11 and 13 Final projects Poster session, April 27 (dry run April 25) Final report, May 4 CS6963L15: Tree Algorithms

Outline Mapping trees to data-parallel architectures Sources: Parallel scan from Lin and Snyder, _Principles of Parallel Programming_ “An Effective GPU Implementation of Breadth-First Search,” Lijuan Luo, Martin Wong and Wen-mei Hwu, DAC ‘10, June “Inter-block GPU communication via fast barrier synchronization,” S. Xiao and W. Feng, ?2009 Va. Tech TR?. “Stackless KD-Tree Traversal for High Performance GPU Ray Tracing,” S. Popov, J. Gunther, H-P Seidel, P. Slusallek, Eurographics 2007, 26(3), CS6963L15: Tree Algorithms

Mapping Challenge From this: CS6963L15: Tree Algorithms To this:

Simple Example Parallel Prefix Sum: Compute a partial sum from A[0],…,A[n-1] Standard way to express it for (i=0; i<n; i++) { sum += A[i]; y[i] = sum; } Semantics require: (…((sum+A[0])+A[1])+…)+A[n-1] That is, sequential Can it be executed in parallel? CS6963L15: Tree Algorithms

Graphical Depiction of Sum Code CS6963L15: Tree Algorithms Original Order Pairwise Order Which decomposition is better suited for parallel execution.

Parallelization Strategy 1.Map tree to flat array 2.Two passes through tree: a.Bottom up: Compute sum for each subtree and propagate all the way up to root node b.Top down: Each non-leaf node receives a value from its parent for sum up to current element. It sends the right child the sum of the parent plus left child value computed on top-down pass. Leaves add the prefix value from parent and saved value to compute final result. 3.Solution can be found on my website from CS4961 last fall CS6963L15: Tree Algorithms

Solution (Figure 1.4 from Lin and Snyder) Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Breadth First Search 9 Definition: Beginning at a node s in a connected component, explore all the neighboring nodes. Then for each of the neighbors, explore their unexplored neighbors until all nodes have been visited. The result of this search is the set of nodes reachable from s. Input: G=(V,E) and distinguished vertex s Output: a breadth first spanning tree with root s that contains all reachable vertices Key data structure: a frontier queue of nodes that have been visited at the current level of the tree CS6963L15: Tree Algorithms

CS6963L15: Tree Algorithms GPU Challenges Very little work at the root node. Much more work as the algorithm progresses to leaves. Managing global frontier queue can lead to high overhead. Sparse matrix implementation (L multiplications corresponding to L levels) can be slower than sequential CPU algorithms for large graphs. Assumption for this paper: graph is sparse

CS6963L15: Tree Algorithms Overview of Strategy Parallelism comes from propagation from all frontier vertices in parallel. With sparse graph, searching each neighbor of a frontier vertex in parallel will not have as much work associated with it. Vary the amount of GPU that is being used depending on threshold of profitability A single warp as the baseline A single block as the next level The entire device as the outermost level

Overview of Data Decomposition 12 Corresponding mapping of frontier queue to levels in implementation Hierarchical frontier queue leads to infrequent global synchronization. Split the frontier queue into levels according to position in the tree Lowest level is for a single warp Next level for per block shared memory Outermost level is for a larger global memory structure. CS6963L15: Tree Algorithms

CS6963L15: Tree Algorithms A Few Details Warp level writes to W-Frontier atomically. Add a single element to queue and update end of queue Different warps write to different W-Frontier queues. A B-Frontier is the union of 8 W-Frontiers. A single thread walks the W-Frontiers to derive indices of frontier nodes. A G-Frontier is shared across the device. Copies from B-Frontier to G-Frontier are done atomically. (Can use coalescing.)

CS6963L15: Tree Algorithms Global synchronization across blocks uses reference to Va Tech TR. Approach depends on: Using the entire device, all SMs Exact match of blocks to SMs All SMs must be included in “global barrier” to prevent deadlock. Global Synchronization Approach

CS6963L15: Tree Algorithms Another Algorithm: K-Nearest Neighbor Ultimate goal: identify the k nearest neighbors to a distinguished point in space from a set of points Create an acceleration data structure, called a KD- tree, to represent the points in space. Given this tree, finding neighbors can be identified for any distinguished point One application of K-Nearest Neighbor is ray tracing

CS6963L15: Tree Algorithms Constructing a KD-Tree Hierarchically partition a set of points (2D example) Slide source: spring/TA/manuals/CGAL/ref-manual2/SearchStructures/Chapter_main.html

CS6963L15: Tree Algorithms Representing the KD-Tree Like the parallel prefix sum, we flatten the tree data structure to represent KD-Tree in memory. Difference: Tree is not fully populated. So, cannot use linearized structure of parallel prefix sum. An auxiliary structure, called “ropes”, provides a link between neighboring cells. Called a “stackless” KD-tree traversal because it is not recursive.

CS6963L15: Tree Algorithms Summary of Lecture To map trees to data parallel architectures with little or no support for recursion Flatten hierarchical data structures into dense vectors Use hierarchical storage corresponding to hierarchical algorithm decomposition to avoid costly global synchronization and minimize global memory accesses. Possibly vary amount of device participating in computation for different levels of the tree.