©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.

Slides:

Advertisements

Similar presentations

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Advertisements

Optimization on Kepler Zehuan Wang

Chimera: Collaborative Preemption for Multitasking on a Shared GPU

Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.

L15: Tree-Structured Algorithms on GPUs CS6963L15: Tree Algorithms.

An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.

Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

©Wen-mei W. Hwu and David Kirk/NVIDIA 2010 ECE 498HK Computational Thinking for Many-core Computing Lecture 15: Dealing with Dynamic Data.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

Back-Projection on GPU: Improving the Performance Wenlay “Esther” Wei Advisor: Jeff Fessler Mentor: Yong Long April 29, 2010.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.

L8: Memory Hierarchy Optimization, Bandwidth CS6963.

All-Pairs-Shortest-Paths for Large Graphs on the GPU Gary J Katz 1,2, Joe Kider 1 1 University of Pennsylvania 2 Lockheed Martin IS&GS.

Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Adnan Ozsoy & Martin Swany DAMSL - Distributed and MetaSystems Lab Department of Computer Information and Science University of Delaware September 2011.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 18-22, 2008 VSCSE Summer School 2008 Accelerators for Science and Engineering Applications:

A GPU Implementation of Inclusion-based Points-to Analysis Mario Méndez-Lojo (AMD) Martin Burtscher (Texas State University, USA) Keshav Pingali (U.T.

NVIDIA Tesla GPU Zhuting Xue EE126. GPU Graphics Processing Unit The "brain" of graphics, which determines the quality of performance of the graphics.

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.

Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CMPS 5433 Dr. Ranette Halverson Programming Massively.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 4: CUDA Threads – Part 2.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.

Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)

Data Structures and Algorithms in Parallel Computing Lecture 7.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 15: Basic Parallel Programming Concepts.

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Irregular Applications –Sparse Matrix Vector Multiplication

Martin Kruliš by Martin Kruliš (v1.0)1.

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.

CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.

Synergy.cs.vt.edu VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units Shucai Xiao 1, Pavan Balaji 2, Qian Zhu 3,

Exploiting Graphics Processors for High-performance IP Lookup in Software Routers Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu IEEE INFOCOM.

ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Linchuan Chen, Xin Huo and Gagan Agrawal

Mattan Erez The University of Texas at Austin

VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture 5: Data Layout for Grid Applications ©Wen-mei W. Hwu and David Kirk/NVIDIA.

Programming Massively Parallel Processors Lecture Slides for Chapter 9: Application Case Study – Electrostatic Potential Calculation © David Kirk/NVIDIA.

ECE 498AL Lecture 15: Reductions and Their Implementation

ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2

Mattan Erez The University of Texas at Austin

Mattan Erez The University of Texas at Austin

Mattan Erez The University of Texas at Austin

6- General Purpose GPU Programming

Jianmin Chen, Zhuo Huang, Feiqi Su, Jih-Kwon Peir and Jeff Ho

Presentation transcript:

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture 8: Dealing with Dynamic Data

Dynamic Data The data to be processed in each phase of computation need to be dynamically determined and extracted from a bulk data structure –Harder when the bulk data structure is not organized for massively parallel access, such as graphs. Graph algorithms are popular examples that deal with dynamic data –Widely used in EDA and large scale optimization applications –We will use Breadth-First Search (BFS) as an example

Main Challenges of Dynamic Data Input data need to be organized for locality, coalescing, and contention avoidance as they are extracted during execution The amount of work and level of parallelism often grow and shrink during execution –As more or less data is extracted during each phase –Hard to efficiently fit into one CUDA kernel configuration, which cannot be changed once launched –Different kernel strategies fit different data sizes ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Breadth-First Search (BFS) sr t u v wx y sr t u v wx y sr t u v wx y sr t u v wx y s r w v t x uy Frontier vertex Level 1 Level 2 Level 3 Level 4 Visited vertex

Sequential BFS Store the frontier in a queue Complexity (O(V+E)) s r w v t x u y Level 1 Level 2 Level 3 Level 4 sr tv x y wu DequeueEnqueue

Parallelism in BFS Parallel Propagation in each level One GPU kernel per level s r w v t x uy Parallel v t x uy Level 1 Level 2 Level 3 Level 4 Kernel 3 Example kernel Kernel 4 Parallel global barrier

BFS in VLSI CAD Maze Routing blockage net terminal

BFS in VLSI CAD Logic Simulation/Timing Analysis

BFS in VLSI CAD In formal verification for reachabiliy analysis. In clustering for finding connected components. In logic synthesis ……..

Potential Pitfall of Parallel Algorithms Greatly accelerated n 2 algorithm is still slower than an nlogn algorithm. Always need to keep an eye on fast sequential algorithm as the baseline. Running Time Problem Size

Node Oriented Parallelization IIIT-BFS –P. Harish et. al. “Accelerating large graph algorithms on the GPU using CUDA” –Each thread is dedicated to one node –Every thread examines neighbor nodes to determine if its node will be a frontier node in the next phase –Complexity O(VL+E) (Compared with O(V+E)) –Slower than the sequential version for large graphs Especially for sparsely connect graphs rst uvwx y rst uvwx y v t x uy

Matrix-based Parallelization Yangdong Deng et. al. “Taming Irregular EDA applications on GPUs” Propagation is done through matrix-vector multiplication –For sparsely connected graphs, the connectivity matrix will be a sparse matrix Complexity O(V+EL) (compared with O(V+E)) –Slower than sequential for large graphs s u v s u v s u v s u v s uv

Need a More General Technique To efficiently handle most graph types Use more specialized formulation when appropriate as an optimization Efficient queue-based parallel algorithms –Hierarchical scalable queue implementation –Hierarchical kernel arrangements ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

An Initial Attempt Manage the queue structure –Complexity: O(V+E) –Dequeue in parallel –Each frontier node is a thread –Enqueue in sequence. Poor coalescing Poor scalability –No speedup vtx u y uy v t x uy Parallel Sequential

Parallel Insert-Compact Queues C.Lauterbach et.al.“Fast BVH Construction on GPUs” Parallel enqueue with compaction cost Not suitable for light-node problems vt y x u Φ ΦΦΦ uy Compact Propagate v t x uy

Basic ideas Each thread processes one or more frontier nodes Find the index of each new frontier node Build queue of next frontier hierarchically h Local queues Global queue q1 q2 q3 Index = offset of q2 (#node in q1) + index in q2 a bc gij Local Globalh

Two-level Hierarchy Block queue (b-queue) –Inserted by all threads in a block –Reside in Shared Memory Global queue (g-queue) –Inserted only when a block completes Problem: –Collision on b-queues –Threads in the same block can cause heavy contention g-queue b-queue Global Mem Shared Mem

Warp-level Queue Time WiT7WiT7 WiT6WiT6 WiT0WiT0 W i T 15 W i T 14 WiT8WiT8 W i T 23 W i T 22 W i T 16 W i T 30 W i T 29 W i T 24 WjT7WjT7 WjT6WjT6 WjT0WjT0 W j T 15 W j T 14 WjT8WjT8 W j T 23 W j T 22 W j T 16 W j T 30 W j T 29 W j T 24 Warp i Warp j Thread Scheduling Divide threads into 8 groups (for GTX280) –Number of SP’s in each SM in general –One queue to be written by each SP in the SM Each group writes to one warp-level queue Still should use atomic operation –But much lower level of contention

Three-level hierarchy w-queue b-queue g-queue b-queue

Hierarchical Queue Management Shared Memory: –Interleaved queue layout, no bank conflict Global Memory: –Coalescing when releasing a b-queue to g-queue –Moderate contention across blocks Texture memory : –Store graph structure (random access, no coalescing) –Fermi cache may help. W-queue0W-queue1 W-queue7 W-queues[][8]

Hierarchical Queue Management Advantage and limitation –The technique can be applied to any inherently sequential data structure –The w-queues and b-queues are limited by the capacity of shared memory. If we know the upper limit of the degree, we can adjust the number of threads per block accordingly.

Kernel Arrangement To create global barriers needs frequent kernel launches Too much overhead Solution: –Partially use GPU-synchronization –Three-layer Kernel Arrangement s r w v t x uy Kernel call

Hierarchical Kernel Arrangement Customize kernels based on the size of frontiers. Use GPU synchronization when the frontier is small. Kernel 1: Intra-block Sync. Kernel 2: Inter-block Sync. Kernel 3: Kernel re- launch One-level parallel propagation

Kernel Arrangement Kernel 1: small-sized frontiers –Only launch one block –Use CUDA barrier function –Propagate through multiple levels –Save global memory access

Kernel Arrangement Kernel 2: mid-sized frontiers –Launch m blocks ( m = #SM ) –Each block is assigned to one SM and stays active –Use global memory to implement inter-block synchronization –Global synch across blocks is allowed in CUDA when all there is only one block per SM –Propagate through multiple levels Kernel 3: big-sized frontiers –Use kernel re-launch to implement synchronization –The kernel launch overhead is acceptable considering the time to propagate a huge frontier

Kernel Arrangement for GTX280 Kernel 1: Intra-block Sync. ≤ 512 nodes Kernel 2: Inter-block Sync. ≤ (30*512 ) Kernel 3: kernel termination Sync. > Assumption: #SM = 30 #Thread/block = 512

Experimental setup CPU implementation –The classical BFS algorithm (O(V+E)) –dual socket dual core 2.4 Ghz Opteron processor with 8GB memory. GPU: NVIDIA GeForce GTX280 Benchmarks –Near-regular graphs (degree = 6) –Real world graphs (avg. degree = 2, max degree = 9) –Scale free graphs 0.1% of the vertices: degree = 1000 The remaining vertices: degree = 6

Results on near-regular graphs #Vertex (M) IIIT-BFS (ms) CPU-BFS (ms) UIUC-BFS (ms) Speedup (CPU/UIUC)

Results on real-world graphs #VertexIIIT-BFS (ms) CPU-BFS (ms) UIUC-BFS (ms) Speedup (CPU/UIUC) 264, ,070, ,598, ,262,

Results on scale-free graphs Parallelism on the imbalanced problems does not work as well as of today. #Vertex (M)IIIT-BFS (ms)CPU-BFS (ms)UIUC-BFS (ms)

Concluding Remarks Effectively accelerated BFS, considering the upper limit of such memory-bound applications –Still need to address non-uniform data distribution Hierarchical queue management and multi-layer kernel arrangement are potentially applicable to other types of algorithms with dynamic data (work) To fully exploit the power of GPU computation, we need more efforts at the data structure and algorithmic level.

References Lijuan Luo, Martin Wong, Wen-mei Hwu, “An effective GPU implementation of breadth-first search”, accepted by Design Automation Conference, Pawan Harish and P. J. Narayanan, "Accelerating large graph algorithms on the GPU using CUDA", in IEEE High Performance Computing, LNCS 4873, pp C. Lauterbach, M. Garland, S. Sengupta, D. Luebke, D. Manocha, “Fast BVH Construction on GPUs”, Computer Graphics Forum, Vol. 28, No. 2., pp Yangdong(Steve) Deng, Bo David Wang and Shuai Mu, "Taming Irregular EDA Applications on GPUs", ICCAD'09, page , Shucai Xiao and Wu-chun Feng, "Inter-Block GPU Communication via Fast Barrier Synchronization", Technical Report TR-09-19, Computer Science, Virginia Tech.

ANY FURTHER QUESTIONS? ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010