Portable Inter-workgroup Barrier Synchronisation for GPUs

Slides:



Advertisements
Similar presentations
Prasanna Pandit R. Govindarajan
Advertisements

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),
An OpenCL Framework for Heterogeneous Multicores with Local Memory PACT 2010 Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu.
 Read about Therac-25 at  [  [
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.
Chapter 6: Process Synchronization
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.
Multiprocessor Synchronization Algorithms ( ) Lecturer: Danny Hendler The Mutual Exclusion problem.
Chapter 6: Process Synchronization. 6.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts – 7 th Edition, Feb 8, 2005 Objectives Understand.
CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson Revised.
CS 732: Advance Machine Learning Usman Roshan Department of Computer Science NJIT.
Contemporary Languages in Parallel Computing Raymond Hummel.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
GPU Concurrency: Weak Behaviours and Programming Assumptions
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.
CS510 Concurrent Systems Introduction to Concurrency.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Internet Software Development Controlling Threads Paul J Krause.
JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.
OpenCL Programming James Perry EPCC The University of Edinburgh.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Concurrency & Dynamic Programming.
Contemporary Languages in Parallel Computing Raymond Hummel.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
Synchronization These notes introduce:
Orchestrating Multiple Data-Parallel Kernels on Multiple Devices Janghaeng Lee, Mehrzad Samadi, and Scott Mahlke October, 2015 University of Michigan -
CS 732: Advance Machine Learning
4.1 Introduction to Threads Overview Multithreading Models Thread Libraries Threading Issues Operating System Examples Windows XP Threads Linux Threads.
Martin Kruliš by Martin Kruliš (v1.0)1.
CS510 Concurrent Systems Jonathan Walpole. Introduction to Concurrency.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Chapter 6: Process Synchronization.
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
PROCESS MANAGEMENT IN MACH
Multi-processor Scheduling
Background on the need for Synchronization
3- Parallel Programming Models
Reactive Synchronization Algorithms for Multiprocessors
Dynamic Parallelism Martin Kruliš by Martin Kruliš (v1.0)
Leiming Yu, Fanny Nina-Paravecino, David Kaeli, Qianqian Fang
Accelerating MapReduce on a Coupled CPU-GPU Architecture
GPU Programming using OpenCL
Linchuan Chen, Xin Huo and Gagan Agrawal
Synchronization trade-offs in GPU implementations of Graph Algorithms
Introduction to OpenCL 2.0
Department of Computer Science University of California, Santa Barbara
Programming with Shared Memory
GPU Schedulers: How Fair is Fair Enough?
Condition Variables and Producer/Consumer
General Programming on Graphical Processing Units
General Programming on Graphical Processing Units
Condition Variables and Producer/Consumer
Threading And Parallel Programming Constructs
Thread Implementation Issues
Lecture: Coherence and Synchronization
Lecture 2 Part 2 Process Synchronization
A Refinement Calculus for Promela
Userspace Synchronization
CSE 153 Design of Operating Systems Winter 19
CS333 Intro to Operating Systems
Department of Computer Science University of California, Santa Barbara
Lecture: Coherence and Synchronization
Synchronization These notes introduce:
6- General Purpose GPU Programming
Multicore and GPU Programming
Process/Thread Synchronization (Part 2)
More concurrency issues
Presentation transcript:

Portable Inter-workgroup Barrier Synchronisation for GPUs Tyler Sorensen, Alastair Donaldson Imperial College London, UK Mark Batty University of Kent, UK Ganesh Gopalakrishnan, Zvonimir Rakamaric University of Utah OOPSLA 2016

Barriers Provided as primitives MPI OpenMP Pthreads GPU (intra-workgroup) T0 T1 T2 T3 …… barrier ……

GPU programming Threads Global Memory

GPU programming Global Memory Workgroup 0 Workgroup 1 Workgroup n Threads Local memory for WG0 Local memory for WG1 Local memory for WGn Global Memory

Inter-workgroup barrier Not provided as primitive Building blocks provided: Inter-workgroup shared memory

Experimental results Chip Vendor Compute Units OpenCL Version Type GTX 980 Nvidia 16 1.1 Discrete Quadro K500 12 Iris 6100 Intel 47 2.0 Integrated HD 5500 24 Radeon R9 AMD 28 Radeon R7 8 T628-4 ARM 4 1.2 T628-2 2 integrated

Challenges Scheduling Memory consistency

Occupancy bound execution Program with 5 workgroups w5 w4 w2 w1 w0 workgroup queue CU CU CU GPU with 3 compute units

Occupancy bound execution Program with 5 workgroups w5 w4 w2 w1 w0 workgroup queue CU CU CU GPU with 3 compute units

Occupancy bound execution Program with 5 workgroups w5 w4 w4 workgroup queue w0 w1 w2 CU CU CU GPU with 3 compute units

Occupancy bound execution Program with 5 workgroups w5 w4 w4 workgroup queue w0 w1 CU CU CU w2 GPU with 3 compute units finished workgroups

Occupancy bound execution Program with 5 workgroups w5 w4 workgroup queue w0 w1 w4 CU CU CU w2 GPU with 3 compute units finished workgroups

Occupancy bound execution Program with 5 workgroups w5 w4 workgroup queue w1 w4 CU CU CU w2 w0 GPU with 3 compute units finished workgroups

Occupancy bound execution Program with 5 workgroups w4 workgroup queue w5 w1 w4 CU CU CU w2 w0 GPU with 3 compute units finished workgroups

Occupancy bound execution Program with 5 workgroups w4 workgroup queue Finished! CU CU CU w2 w0 w5 w1 w4 GPU with 3 compute units finished workgroups

Occupancy bound execution Program with 5 workgroups w5 w4 w2 w1 w0 workgroup queue CU CU CU GPU with 3 compute units

Occupancy bound execution Program with 5 workgroups w5 w4 w2 w1 w0 workgroup queue CU CU CU GPU with 3 compute units

Occupancy bound execution Program with 5 workgroups w5 w4 w4 workgroup queue w0 w1 w2 Cannot synchronise with workgroups in queue CU CU CU Barrier gives deadlock! GPU with 3 compute units

Occupancy bound execution Program with 5 workgroups w5 w4 w4 workgroup queue w0 w1 w2 Barrier is possible if we know the occupancy CU CU CU GPU with 3 compute units

Occupancy bound execution Launch as many workgroups as compute units?

Occupancy bound execution Launch as many workgroups as compute units? Program with 5 workgroups w5 w4 w2 w1 w0 workgroup queue CU CU CU GPU with 3 compute units

Occupancy bound execution Launch as many workgroups as compute units? Program with 5 workgroups w4 workgroup queue w0 w1 w4 w2 w5 Depending on resources, multiple wgs can execute on CU CU CU CU GPU with 3 compute units

Recall of occupancy discovery Chip Compute Units Occupancy Bound GTX 980 16 Quadro K500 12 Iris 6100 47 HD 5500 24 Radeon R9 28 Radeon R7 8 T628-4 4 T628-2 2

Recall of occupancy discovery Chip Compute Units Occupancy Bound GTX 980 16 32 Quadro K500 12 Iris 6100 47 6 HD 5500 24 3 Radeon R9 28 48 Radeon R7 8 T628-4 4 T628-2 2

Our approach (scheduling) wa w9 w8 w7 w6 Program w5 w4 w2 w1 w0 workgroup queue CU CU CU GPU with 3 compute units

Our approach (scheduling) w8 w6 Program w4 w5 w1 workgroup queue w2 w9 w7 w0 w4 wa CU CU CU GPU with 3 compute units

Our approach (scheduling) w8 w6 Program w4 w5 w1 workgroup queue w2 w9 w7 w0 w4 wa Dynamically estimate occupancy CU CU CU GPU with 3 compute units

Our approach (scheduling) w8 w6 Program w4 w5 w1 workgroup queue w2 w9 w7 w0 w4 wa Dynamically estimate occupancy CU CU CU GPU with 3 compute units

Finding occupant workgroups Executed by 1 thread per workgroup Two phases: Polling Closing

Finding occupant workgroups Executed by 1 thread per workgroup Three global variables: Mutex: m Bool poll flag: poll_open Integer counter: count

Finding occupant workgroups lock(m) if (poll_open) { count++; unlock(m); } else { return false; poll_open = false unlock(m) return true; Executed by 1 thread per workgroup Three global variables: Mutex: m Bool poll flag: poll_open Integer counter: count

Finding occupant workgroups Polling phase lock(m) if (poll_open) { count++; unlock(m); } else { return false; poll_open = false unlock(m) return true; Executed by 1 thread per workgroup Three global variables: Mutex: m Bool poll flag: poll_open Integer counter: count

Finding occupant workgroups Polling phase lock(m) if (poll_open) { count++; unlock(m); } else { return false; poll_open = false unlock(m) return true; Executed by 1 thread per workgroup Three global variables: Mutex: m Bool poll flag: poll_open Integer counter: count Closing phase

Finding occupant workgroups lock(m) if (poll_open) { count++; unlock(m); } else { return false; poll_open = false unlock(m) return true; Executed by 1 thread per workgroup Three global variables: Mutex: m Bool poll flag: poll_open Integer counter: count

Finding occupant workgroups lock(m) if (poll_open) { count++; unlock(m); } else { return false; poll_open = false unlock(m) return true; Executed by 1 thread per workgroup Three global variables: Mutex: m Bool poll flag: poll_open Integer counter: count

Recall of occupancy discovery Chip Compute Units Occupancy Bound Spin Lock Ticket Lock GTX 980 16 32 Quadro K500 12 Iris 6100 47 6 HD 5500 24 3 Radeon R9 28 48 Radeon R7 8 T628-4 4 T628-2 2

Recall of occupancy discovery Chip Compute Units Occupancy Bound Spin Lock Ticket Lock GTX 980 16 32 3.1 Quadro K500 12 2.3 Iris 6100 47 6 3.5 HD 5500 24 3 2.7 Radeon R9 28 48 7.7 Radeon R7 8 4.6 T628-4 4 T628-2 2 2.0

Recall of occupancy discovery Chip Compute Units Occupancy Bound Spin Lock Ticket Lock GTX 980 16 32 3.1 32.0 Quadro K500 12 2.3 12.0 Iris 6100 47 6 3.5 6.0 HD 5500 24 3 2.7 3.0 Radeon R9 28 48 7.7 48.0 Radeon R7 8 4.6 16.0 T628-4 4 4.0 T628-2 2 2.0

Challenges Scheduling Memory consistency

Our approach (memory consistency) …… barrier m1 ……

Our approach (memory consistency) …… barrier HB m1 ……

Our approach (memory consistency) …… Barrier implementation creates HB web Using OpenCL 2.0 atomic operations ……

Barrier implementation Start with an implementation from Xiao and Feng (2010) Written in CUDA (ported to OpenCL) No formal memory consistency properties

Slave Workgroup 0 Master Workgroup Slave Workgroup 1 T1 T0 T0 T1 T0 T1

Slave Workgroup 0 Master Workgroup Slave Workgroup 1 T1 T0 T0 T1 T0 T1 barrier barrier spin(x0 != 1) spin(x1 != 1)

Slave Workgroup 0 Master Workgroup Slave Workgroup 1 T1 T0 T0 T1 T0 T1 barrier barrier spin(x0 != 1) spin(x1 != 1) barrier barrier

Slave Workgroup 0 Master Workgroup Slave Workgroup 1 T1 T0 T0 T1 T0 T1 barrier barrier x0 = 1 x1 = 1 spin(x0 != 1) spin(x1 != 1) barrier barrier

Slave Workgroup 0 Master Workgroup Slave Workgroup 1 T1 T0 T0 T1 T0 T1 barrier barrier x0 = 1 x1 = 1 spin(x0 != 1) spin(x1 != 1) barrier spin(x0 != 0) spin(x1 != 0) barrier barrier

Slave Workgroup 0 Master Workgroup Slave Workgroup 1 T1 T0 T0 T1 T0 T1 barrier barrier x0 = 1 x1 = 1 spin(x0 != 1) spin(x1 != 1) barrier x0 = 0 x1 = 0 spin(x0 != 0) spin(x1 != 0) barrier barrier

Slave Workgroup 0 Master Workgroup Slave Workgroup 1 T1 T0 T0 T1 T0 T1 barrier barrier x0 = 1 x1 = 1 spin(x0 != 1) spin(x1 != 1) barrier x0 = 0 x1 = 0 spin(x0 != 0) spin(x1 != 0) barrier barrier

Slave Workgroup 0 Master Workgroup Slave Workgroup 1 T1 T0 T0 T1 T0 T1 x0 = 1 x1 = 1 spin(x0 != 1) spin(x1 != 1) x0 = 0 x1 = 0 spin(x0 != 0) spin(x1 != 0)

Barrier memory consistency Device release-acquire rule T0 and T1 in different workgroups T0 T1 Wrel y = 1 Racq y = 1

Slave Workgroup 0 Master Workgroup Slave Workgroup 1 T1 T0 T0 T1 T0 T1 x0 = 1 x1 = 1 spin(x0 != 1) spin(x1 != 1) x0 = 0 x1 = 0 spin(x0 != 0) spin(x1 != 0)

Slave Workgroup 0 Master Workgroup Slave Workgroup 1 T1 T0 T0 T1 T0 T1 Wrel x0 = 1 Wrel x1 = 1 spin(x0 != 1) spin(x1 != 1) Wrel x0 = 0 Wrel x1 = 0 spin(x0 != 0) spin(x1 != 0)

Slave Workgroup 0 Master Workgroup Slave Workgroup 1 T1 T0 T0 T1 T0 T1 Wrel x0 = 1 Wrel x1 = 1 spin(Racqx0 != 1) spin(Racqx1 != 1) Wrel x0 = 0 Wrel x1 = 0 spin(Racqx0 != 0) spin(Racqx1 != 0)

Slave Workgroup 0 Master Workgroup Slave Workgroup 1 T1 T0 T0 T1 T0 T1 Wrel x0 = 1 Wrel x1 = 1 spin(Racqx0 != 1) spin(Racqx1 != 1) Wrel x0 = 0 Wrel x1 = 0 spin(Racqx0 != 0) spin(Racqx1 != 0)

Barrier vs. multi-kernel Pannotia Target AMD Radeon HD 7000 Written in OpenCL 1.x 4 graph algorithms applications https://github.com/pannotia/pannotia

Barrier vs. multi-kernel Device Host GPU_linear_algebra_routine1; global_barrier(); GPU_linear_algebra_routine2; global_barrier(); GPU_linear_algebra_routine3; global_barrier(); GPU_linear_algebra_routine1; GPU_linear_algebra_routine2; GPU_linear_algebra_routine3; Loop until a fixed point is reached. Loop until a fixed point is reached.

Barrier vs. multi-kernel Up to 2.2x speedup (ARM) Always observed a speedup on Intel (1.25x median) Broke even with Nvidia (1.0x median) In some cases a slowdown (0.22x on ARM)

Barrier vs. multi-kernel Up to 2.2x speedup (ARM) Always observed a speedup on Intel (1.25x median) Broke even with Nvidia (1.0x median) In some cases a slowdown (0.22x on ARM) Fine grained performance model for tuning needed!

Portable GPU Barrier Synchronisation Designed a GPU inter-workgroup barrier portable under occupancy bound execution and OpenCL memory model Evaluated on 8 GPUs (4 vendors) Nearly perfect recall on modern GPUs Can provide substantial performance gains in applications Try it out now! https://github.com/mc-imperial/gpu_discovery_barrier