Presentation is loading. Please wait.

Presentation is loading. Please wait.

Portable Inter-workgroup Barrier Synchronisation for GPUs

Similar presentations


Presentation on theme: "Portable Inter-workgroup Barrier Synchronisation for GPUs"— Presentation transcript:

1 Portable Inter-workgroup Barrier Synchronisation for GPUs
Tyler Sorensen, Alastair Donaldson Imperial College London, UK Mark Batty University of Kent, UK Ganesh Gopalakrishnan, Zvonimir Rakamaric University of Utah OOPSLA 2016

2 Barriers Provided as primitives MPI OpenMP Pthreads
GPU (intra-workgroup) T0 T1 T2 T3 …… barrier ……

3 GPU programming Threads Global Memory

4 GPU programming Global Memory Workgroup 0 Workgroup 1 Workgroup n
Threads Local memory for WG0 Local memory for WG1 Local memory for WGn Global Memory

5 Inter-workgroup barrier
Not provided as primitive Building blocks provided: Inter-workgroup shared memory

6 Experimental results Chip Vendor Compute Units OpenCL Version Type
GTX 980 Nvidia 16 1.1 Discrete Quadro K500 12 Iris 6100 Intel 47 2.0 Integrated HD 5500 24 Radeon R9 AMD 28 Radeon R7 8 T628-4 ARM 4 1.2 T628-2 2 integrated

7 Challenges Scheduling Memory consistency

8 Occupancy bound execution
Program with 5 workgroups w5 w4 w2 w1 w0 workgroup queue CU CU CU GPU with 3 compute units

9 Occupancy bound execution
Program with 5 workgroups w5 w4 w2 w1 w0 workgroup queue CU CU CU GPU with 3 compute units

10 Occupancy bound execution
Program with 5 workgroups w5 w4 w4 workgroup queue w0 w1 w2 CU CU CU GPU with 3 compute units

11 Occupancy bound execution
Program with 5 workgroups w5 w4 w4 workgroup queue w0 w1 CU CU CU w2 GPU with 3 compute units finished workgroups

12 Occupancy bound execution
Program with 5 workgroups w5 w4 workgroup queue w0 w1 w4 CU CU CU w2 GPU with 3 compute units finished workgroups

13 Occupancy bound execution
Program with 5 workgroups w5 w4 workgroup queue w1 w4 CU CU CU w2 w0 GPU with 3 compute units finished workgroups

14 Occupancy bound execution
Program with 5 workgroups w4 workgroup queue w5 w1 w4 CU CU CU w2 w0 GPU with 3 compute units finished workgroups

15 Occupancy bound execution
Program with 5 workgroups w4 workgroup queue Finished! CU CU CU w2 w0 w5 w1 w4 GPU with 3 compute units finished workgroups

16 Occupancy bound execution
Program with 5 workgroups w5 w4 w2 w1 w0 workgroup queue CU CU CU GPU with 3 compute units

17 Occupancy bound execution
Program with 5 workgroups w5 w4 w2 w1 w0 workgroup queue CU CU CU GPU with 3 compute units

18 Occupancy bound execution
Program with 5 workgroups w5 w4 w4 workgroup queue w0 w1 w2 Cannot synchronise with workgroups in queue CU CU CU Barrier gives deadlock! GPU with 3 compute units

19 Occupancy bound execution
Program with 5 workgroups w5 w4 w4 workgroup queue w0 w1 w2 Barrier is possible if we know the occupancy CU CU CU GPU with 3 compute units

20 Occupancy bound execution
Launch as many workgroups as compute units?

21 Occupancy bound execution
Launch as many workgroups as compute units? Program with 5 workgroups w5 w4 w2 w1 w0 workgroup queue CU CU CU GPU with 3 compute units

22 Occupancy bound execution
Launch as many workgroups as compute units? Program with 5 workgroups w4 workgroup queue w0 w1 w4 w2 w5 Depending on resources, multiple wgs can execute on CU CU CU CU GPU with 3 compute units

23 Recall of occupancy discovery
Chip Compute Units Occupancy Bound GTX 980 16 Quadro K500 12 Iris 6100 47 HD 5500 24 Radeon R9 28 Radeon R7 8 T628-4 4 T628-2 2

24 Recall of occupancy discovery
Chip Compute Units Occupancy Bound GTX 980 16 32 Quadro K500 12 Iris 6100 47 6 HD 5500 24 3 Radeon R9 28 48 Radeon R7 8 T628-4 4 T628-2 2

25 Our approach (scheduling)
wa w9 w8 w7 w6 Program w5 w4 w2 w1 w0 workgroup queue CU CU CU GPU with 3 compute units

26 Our approach (scheduling)
w8 w6 Program w4 w5 w1 workgroup queue w2 w9 w7 w0 w4 wa CU CU CU GPU with 3 compute units

27 Our approach (scheduling)
w8 w6 Program w4 w5 w1 workgroup queue w2 w9 w7 w0 w4 wa Dynamically estimate occupancy CU CU CU GPU with 3 compute units

28 Our approach (scheduling)
w8 w6 Program w4 w5 w1 workgroup queue w2 w9 w7 w0 w4 wa Dynamically estimate occupancy CU CU CU GPU with 3 compute units

29 Finding occupant workgroups
Executed by 1 thread per workgroup Two phases: Polling Closing

30 Finding occupant workgroups
Executed by 1 thread per workgroup Three global variables: Mutex: m Bool poll flag: poll_open Integer counter: count

31 Finding occupant workgroups
lock(m) if (poll_open) { count++; unlock(m); } else { return false; poll_open = false unlock(m) return true; Executed by 1 thread per workgroup Three global variables: Mutex: m Bool poll flag: poll_open Integer counter: count

32 Finding occupant workgroups
Polling phase lock(m) if (poll_open) { count++; unlock(m); } else { return false; poll_open = false unlock(m) return true; Executed by 1 thread per workgroup Three global variables: Mutex: m Bool poll flag: poll_open Integer counter: count

33 Finding occupant workgroups
Polling phase lock(m) if (poll_open) { count++; unlock(m); } else { return false; poll_open = false unlock(m) return true; Executed by 1 thread per workgroup Three global variables: Mutex: m Bool poll flag: poll_open Integer counter: count Closing phase

34 Finding occupant workgroups
lock(m) if (poll_open) { count++; unlock(m); } else { return false; poll_open = false unlock(m) return true; Executed by 1 thread per workgroup Three global variables: Mutex: m Bool poll flag: poll_open Integer counter: count

35 Finding occupant workgroups
lock(m) if (poll_open) { count++; unlock(m); } else { return false; poll_open = false unlock(m) return true; Executed by 1 thread per workgroup Three global variables: Mutex: m Bool poll flag: poll_open Integer counter: count

36 Recall of occupancy discovery
Chip Compute Units Occupancy Bound Spin Lock Ticket Lock GTX 980 16 32 Quadro K500 12 Iris 6100 47 6 HD 5500 24 3 Radeon R9 28 48 Radeon R7 8 T628-4 4 T628-2 2

37 Recall of occupancy discovery
Chip Compute Units Occupancy Bound Spin Lock Ticket Lock GTX 980 16 32 3.1 Quadro K500 12 2.3 Iris 6100 47 6 3.5 HD 5500 24 3 2.7 Radeon R9 28 48 7.7 Radeon R7 8 4.6 T628-4 4 T628-2 2 2.0

38 Recall of occupancy discovery
Chip Compute Units Occupancy Bound Spin Lock Ticket Lock GTX 980 16 32 3.1 32.0 Quadro K500 12 2.3 12.0 Iris 6100 47 6 3.5 6.0 HD 5500 24 3 2.7 3.0 Radeon R9 28 48 7.7 48.0 Radeon R7 8 4.6 16.0 T628-4 4 4.0 T628-2 2 2.0

39 Challenges Scheduling Memory consistency

40 Our approach (memory consistency)
…… barrier m1 ……

41 Our approach (memory consistency)
…… barrier HB m1 ……

42 Our approach (memory consistency)
…… Barrier implementation creates HB web Using OpenCL 2.0 atomic operations ……

43 Barrier implementation
Start with an implementation from Xiao and Feng (2010) Written in CUDA (ported to OpenCL) No formal memory consistency properties

44 Slave Workgroup 0 Master Workgroup Slave Workgroup 1 T1 T0 T0 T1 T0 T1

45 Slave Workgroup 0 Master Workgroup Slave Workgroup 1 T1 T0 T0 T1 T0 T1 barrier barrier spin(x0 != 1) spin(x1 != 1)

46 Slave Workgroup 0 Master Workgroup Slave Workgroup 1 T1 T0 T0 T1 T0 T1 barrier barrier spin(x0 != 1) spin(x1 != 1) barrier barrier

47 Slave Workgroup 0 Master Workgroup Slave Workgroup 1 T1 T0 T0 T1 T0 T1 barrier barrier x0 = 1 x1 = 1 spin(x0 != 1) spin(x1 != 1) barrier barrier

48 Slave Workgroup 0 Master Workgroup Slave Workgroup 1 T1 T0 T0 T1 T0 T1 barrier barrier x0 = 1 x1 = 1 spin(x0 != 1) spin(x1 != 1) barrier spin(x0 != 0) spin(x1 != 0) barrier barrier

49 Slave Workgroup 0 Master Workgroup Slave Workgroup 1 T1 T0 T0 T1 T0 T1 barrier barrier x0 = 1 x1 = 1 spin(x0 != 1) spin(x1 != 1) barrier x0 = 0 x1 = 0 spin(x0 != 0) spin(x1 != 0) barrier barrier

50 Slave Workgroup 0 Master Workgroup Slave Workgroup 1 T1 T0 T0 T1 T0 T1 barrier barrier x0 = 1 x1 = 1 spin(x0 != 1) spin(x1 != 1) barrier x0 = 0 x1 = 0 spin(x0 != 0) spin(x1 != 0) barrier barrier

51 Slave Workgroup 0 Master Workgroup Slave Workgroup 1 T1 T0 T0 T1 T0 T1 x0 = 1 x1 = 1 spin(x0 != 1) spin(x1 != 1) x0 = 0 x1 = 0 spin(x0 != 0) spin(x1 != 0)

52 Barrier memory consistency
Device release-acquire rule T0 and T1 in different workgroups T0 T1 Wrel y = 1 Racq y = 1

53 Slave Workgroup 0 Master Workgroup Slave Workgroup 1 T1 T0 T0 T1 T0 T1 x0 = 1 x1 = 1 spin(x0 != 1) spin(x1 != 1) x0 = 0 x1 = 0 spin(x0 != 0) spin(x1 != 0)

54 Slave Workgroup 0 Master Workgroup Slave Workgroup 1 T1 T0 T0 T1 T0 T1 Wrel x0 = 1 Wrel x1 = 1 spin(x0 != 1) spin(x1 != 1) Wrel x0 = 0 Wrel x1 = 0 spin(x0 != 0) spin(x1 != 0)

55 Slave Workgroup 0 Master Workgroup Slave Workgroup 1 T1 T0 T0 T1 T0 T1 Wrel x0 = 1 Wrel x1 = 1 spin(Racqx0 != 1) spin(Racqx1 != 1) Wrel x0 = 0 Wrel x1 = 0 spin(Racqx0 != 0) spin(Racqx1 != 0)

56 Slave Workgroup 0 Master Workgroup Slave Workgroup 1 T1 T0 T0 T1 T0 T1 Wrel x0 = 1 Wrel x1 = 1 spin(Racqx0 != 1) spin(Racqx1 != 1) Wrel x0 = 0 Wrel x1 = 0 spin(Racqx0 != 0) spin(Racqx1 != 0)

57 Barrier vs. multi-kernel
Pannotia Target AMD Radeon HD 7000 Written in OpenCL 1.x 4 graph algorithms applications

58 Barrier vs. multi-kernel
Device Host GPU_linear_algebra_routine1; global_barrier(); GPU_linear_algebra_routine2; global_barrier(); GPU_linear_algebra_routine3; global_barrier(); GPU_linear_algebra_routine1; GPU_linear_algebra_routine2; GPU_linear_algebra_routine3; Loop until a fixed point is reached. Loop until a fixed point is reached.

59 Barrier vs. multi-kernel
Up to 2.2x speedup (ARM) Always observed a speedup on Intel (1.25x median) Broke even with Nvidia (1.0x median) In some cases a slowdown (0.22x on ARM)

60 Barrier vs. multi-kernel
Up to 2.2x speedup (ARM) Always observed a speedup on Intel (1.25x median) Broke even with Nvidia (1.0x median) In some cases a slowdown (0.22x on ARM) Fine grained performance model for tuning needed!

61 Portable GPU Barrier Synchronisation
Designed a GPU inter-workgroup barrier portable under occupancy bound execution and OpenCL memory model Evaluated on 8 GPUs (4 vendors) Nearly perfect recall on modern GPUs Can provide substantial performance gains in applications Try it out now!


Download ppt "Portable Inter-workgroup Barrier Synchronisation for GPUs"

Similar presentations


Ads by Google