Jianmin Chen, Zhuo Huang, Feiqi Su, Jih-Kwon Peir and Jeff Ho

Slides:

Advertisements

Similar presentations

List Ranking and Parallel Prefix

Advertisements

1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Optimization on Kepler Zehuan Wang

An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.

Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 26, 2013, DyanmicParallelism.ppt CUDA Dynamic Parallelism These notes will outline CUDA.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

By Xinggao Xia and Jong Chul Lee. TechniqueAdditionsMultiplications/Divisions Gauss-Jordann 3 /2 Gaussian Eliminationn 3 /3 Cramer’s Rulen 4 /3 n :

ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.

Back-Projection on GPU: Improving the Performance Wenlay “Esther” Wei Advisor: Jeff Fessler Mentor: Yong Long April 29, 2010.

Programming with CUDA WS 08/09 Lecture 3 Thu, 30 Oct, 2008.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.

QCAdesigner – CUDA HPPS project

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.

An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.

Synchronization These notes introduce:

Sunpyo Hong, Hyesoon Kim

Martin Kruliš by Martin Kruliš (v1.0)1.

CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.

These slides are based on the book:

Two-Dimensional Phase Unwrapping On FPGAs And GPUs

GPU Computing CIS-543 Lecture 10: Streams and Events

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

Sathish Vadhiyar Parallel Programming

CS427 Multicore Architecture and Parallel Computing

EECE571R -- Harnessing Massively Parallel Processors ece

GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)

I. E. Venetis1, N. Nikoloutsakos1, E. Gallopoulos1, John Ekaterinaris2

Dynamic Parallelism Martin Kruliš by Martin Kruliš (v1.0)

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Lecture 5: GPU Compute Architecture

Linchuan Chen, Xin Huo and Gagan Agrawal

Recitation 2: Synchronization, Shared memory, Matrix Transpose

Presented by: Isaac Martin

Lecture 5: GPU Compute Architecture for the last time

Lecture 8: Directory-Based Cache Coherence

Lecture 7: Directory-Based Cache Coherence

Chapter 4 Multiprocessors

ECE 498AL Lecture 10: Control Flow

ECE 498AL Spring 2010 Lecture 10: Control Flow

Gary M. Zoppetti Gagan Agrawal Rishi Kumar

Synchronization These notes introduce:

6- General Purpose GPU Programming

CIS 6930: Chip Multiprocessor: GPU Architecture and Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

Weak Execution Ordering - Exploiting Iterative Methods on Many-Core GPUs Jianmin Chen, Zhuo Huang, Feiqi Su, Jih-Kwon Peir and Jeff Ho University of Florida Lu Peng Louisiana State University

Outline CUDA review & Inter-Block communication and synchronization Host synchronization overhead Applications with iterative PDE solver Optimizations on inter-block communication Performance results Conclusion

CUDA Programming Model Host invoke Kernels/Grids to execute on GPU Kernel/Grid Blocks Threads Thread Application Host execution kernel 0 Block 0 Block 1 Block 2 Block 3 ... ... ... ... Host execution kernel 1 Block 0 Block N … … ... ... …

CUDA GPU Architecture … … … … … Blocks assigned to Stream Multiprocessors (SM) composed of 8 Stream processors (SP) and Shared (local) Memory. Block synchronization must through Host! No synch. among blocks! Block 58 Block 59 Num. of blocks limited by resources Scheduler: Waiting Blocks GPU SM 0 SM 29 Block 0 … … … SP SP Block 60 SP SP Block 61 SP SP Blocks can communicate through GM Data lost when return to host Block 1 SP SP … … Shared Memory Interconnect Network Block N Global Memory

Example: Breath First Search (BFS) Inactive Given G(V,E) source (S), compute steps to reach all other nodes. Each thread compute one node Initially all inactive except source node If activated, visit it and activate its unvisited neighbors n-1 steps needed to reach nodes visited in nth iteration Keep iterating until no active node Synchronization needed after each Iteration Active Visited 1st Iteration S S S C C C A A A B B B 2nd Iteration D D D E E E 3rd Iteration; Done

No-Host vs. Host Synchronization Limit number of nodes to fit in 1 Block – for avoiding host synchronization Host-sync can be replaced by __syncthreads() Avoid multiple kernel initiation overhead Data can stay in shared memory to reduce global accesses for save/restore Reduce intermediate partial data transfer or termination flag to host during host synchronization 6

No-Host/Host Result Graph generated by GTgraph with 3K nodes No-host uses __syncthreads() in each iteration 67% Host overhead

Applications with Iterative PDE solver Partial Differential Equation solver are widely used Weak execution ordering / chaotic PDE using iterative methods Accuracy of the solver is NOT critical Poisson Image Editing 3D Shape from Shading

Basic 3D-Shape in CUDA Newx,y = f (Oldx-1,y, Oldx,y-1, Oldx,y+1, Oldx+1,y) Each block computes a sub-grid. Nodes from neighboring blocks needed for computing boundary nodes Host synchronization: Go back to host after each iteration But, no exact order needed! Block2 Shared Mem . . . . Grid in Global Memory . . . . . . . . . . . . . . . . . . Block 0 Block 1 Block 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Block 3 Block 4 Block 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Block5 Shared Mem

Coarse Synchronization Host synchronization every (n ) iterations Inter-block communicate through GM with neighbor blocks for updated boundary nodes Block2 Shared Mem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Block5 Shared Mem

Coarse vs. Fine Host Synchronization Coarse synchronization Less synchronization overhead, but Need more iterations to converge due to imprecise boundary updates through inter-block comm. Reduce inter-block communication overhead Overlap communication with computation Neighbor communicate: upper/lower vs. 4 neighbors Blocks scheduling strategy: square vs. stripe

Overlap Communication, Computation Separate communication threads to overlap with computation No precise order is needed Computation Threads Communication Threads Initialization Phase: Load 32*20 data nodes into Shared Mem __syncthreads() Load the boundary nodes into Shared Mem Main Phase: While < n iteration{ Compute iterations (no-host) } Store and Load boundary values to/from Global memory Ending Phase: Store new 32*20 data nodes to global mem return

Overlap Communication with Computation Communication frequency: Execution Time = Time/Iteration * Number of Iteration 13

Neighbor Communication Only communicate upper and lower neighbors Less data communication through global memory Coalesced memory moves Incomplete data communication  slower in convergence Communicate with all four neighbors More and uncoalesced data moves May converge faster

Blocks Scheduling Blocks scheduled in groups due to limited resources. No updated data from inactive blocks. Try to minimize boundary nodes of the whole group 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 4 5 2 3 6 7 8 9 12 13 10 11 14 15 Stripe scheduling Square scheduling

Base: 95.35 s

Conclusion Inter-block synchronization Not supported on GPU Significant impact on asynchronous PDE solvers Coarse synchronization and optimizations to improve the overall Performance Separate communication threads to overlap computation Block scheduling and inter-block communication Speedup of 4-5 times compared with fine-granularity host synchronization

Thank You!! Questions?