Presentation is loading. Please wait.

Presentation is loading. Please wait.

GPU Concurrency: Weak Behaviours and Programming Assumptions

Similar presentations


Presentation on theme: "GPU Concurrency: Weak Behaviours and Programming Assumptions"— Presentation transcript:

1 GPU Concurrency: Weak Behaviours and Programming Assumptions
Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015  April 12, 2105

2 Based on our ASPLOS ‘15 paper:
GPU Concurrency: Weak Behaviours and Programming Assumptions Jade Alglave1,2, Mark Batty3, Alastair F. Donaldson4, Ganesh Gopalakrishnan5, Jeroen Ketema4, Daniel Poetzl6, Tyler Sorensen1,5, John Wickerson4 1 University College London, 2 Microsoft Research, 3 University of Cambridge, 4 Imperial College London, University of Utah, 6 University of Oxford

3 Intel Core i CPU

4

5 Nvidia Tesla C2075 GPU

6 Roadmap what happened to the pony how we found the bug
how we are able to fix the pony (background) (methodology) (contribution)

7 What happened to the pony?
the visualization bugs are due to weak memory behaviours on GPUs

8 Weak memory models consider the test known as message passing (mp)
an instance of this test appears in the pony code

9 Weak memory models consider the test known as message passing (mp)
initial state: x and y are memory locations

10 Weak memory models consider the test known as message passing (mp)
thread ids

11 Weak memory models consider the test known as message passing (mp)
program: for each thread id

12 Weak memory models consider the test known as message passing (mp)
assertion: question about the final state of registers

13 Message passing (mp) test
Tests how to implement a handshake idiom Data Data

14 Message passing (mp) test
Tests how to implement a handshake idiom Flag Flag

15 Message passing (mp) test
Tests how to implement a handshake idiom Stale Data

16

17

18

19

20

21 this is known as Lamport’s sequential consistency (or SC) assertion cannot be satisfied by interleavings

22 Weak memory models can we assume assertion will never pass?

23 Weak memory models can we assume assertion will never pass? No!

24 Weak memory models Alglave and Maranget report this assertion appears 41 million times out of 5 billion test runs on Tegra2 ARM processor1 1http://diy.inria.fr/cats/tables.html

25 Weak memory models what happened?
architectures implement weak memory models where the hardware is allowed to re-order certain memory instructions. weak memory models can allow weak behaviors (executions that do not correspond to an interleaving)

26 GPU memory models what type of memory model do current GPUs implement? documentation is sparse CUDA has 1 page + 1 example PTX has 1 page + 0 examples given in English prose we need to know this if we are to write correct GPU programs!

27 GPU programming Global Memory CTA 0 CTA 1 CTA n Within CTAs, threads
are grouped into warps (32 threads per warp in Nvidia GPUs) Threads Shared Memory For CTA 0 Shared Memory For CTA 1 Shared Memory For CTA n Global Memory

28 GPU programming Threads Global Memory

29 GPU programming CTA 0 CTA 1 CTA n Threads Global Memory

30 GPU programming Global Memory CTA 0 CTA 1 CTA n Threads
Shared Memory For CTA 0 Shared Memory For CTA 1 Shared Memory For CTA n Global Memory

31 GPU programming Global Memory CTA 0 CTA 1 CTA n Within CTAs, threads
are grouped into warps (32 threads per warp in Nvidia GPUs) Threads Shared Memory For CTA 0 Shared Memory For CTA 1 Shared Memory For CTA n Global Memory

32 Roadmap what happened to the pony how we found the bug
how we are able to fix the pony (background) (methodology) (contribution)

33 Methodology GPU hardware GPU litmus tests compare results formal model

34 GPU tests GPU litmus test considerations
Scope Tree (device (cta T0) (cta T1) ) x: global, y: global

35 GPU tests GPU litmus test considerations PTX instructions
Scope Tree (device (cta T0) (cta T1) ) x: global, y: global

36 GPU tests GPU litmus test considerations
what memory region (shared or global) are x and y in? Scope Tree (device (cta T0) (cta T1) ) x: global, y: global

37 GPU tests GPU litmus test considerations
what memory region (shared or global) are x and y in?

38 GPU tests GPU litmus test considerations
are T0 and T1 in the same CTA or different CTAs? Scope Tree (device (cta T0) (cta T1) ) x: global, y: global

39 GPU tests GPU litmus test considerations
are T0 and T1 in the same CTA or different CTAs?

40 Running tests we extend the litmus CPU testing tool of Alglave and Maranget to run GPU tests given a GPU litmus test, generates an executable CUDA or OpenCL code for the test

41 Heuristics memory stress: extra threads read and write to scratch memory T0 T1 extra thread 1 extra thread n run T0 test program run T1 test program loop: read or write to scratchpad loop: read or write to scratchpad

42 Heuristics random threads: randomize the location of threads T1 T0

43 Heuristics random threads: randomize the location of threads

44 Heuristics random threads: randomize the location of threads

45 Heuristics random threads: randomize the location of threads

46 Heuristics # of weak behaviours in 100,000 runs for different heuristics on a Nvidia Tesla C2075 test none random threads memory stress + gpu-mp

47 Heuristics # of weak behaviours in 100,000 runs for different heuristics on a Nvidia Tesla C2075 test none random threads memory stress + gpu-mp

48 Heuristics # of weak behaviours in 100,000 runs for different heuristics on a Nvidia Tesla C2075 test none random threads memory stress + gpu-mp 139

49 Heuristics # of weak behaviours in 100,000 runs for different heuristics on a Nvidia Tesla C2075 test none random threads memory stress + gpu-mp 139 522

50 How we found the pony bug
This is the idiom and heuristics that caused bug! test none random threads memory stress + gpu-mp 139 522

51 Roadmap what happened to the pony how we found the bug
how we are able to fix the pony (background) (methodology) (contribution)

52 GPU fences PTX gives 2 fences to disallow reading stale data
membar.cta – gives ordering intra-CTA membar.gl – gives ordering over device

53 GPU fences Test amended with a parameterizable fence
Scope Tree (device (cta T0) (cta T1) ) x: global, y: global

54 GPU fences # of weak behaviours in 100,000 runs for different fences on a Nvidia Tesla C2075 test none membar.cta membar.gl gpu-mp 3380

55 GPU fences # of weak behaviours in 100,000 runs for different fences on a Nvidia Tesla C2075 test none membar.cta membar.gl gpu-mp 3380 2

56 GPU fences # of weak behaviours in 100,000 runs for different fences on a Nvidia Tesla C2075 test none membar.cta membar.gl gpu-mp 3380 2

57 How do we fix the pony Tesla C2075 Nvidia GPU

58 How do we fix the pony adding fences to the code
Tesla C2075 Nvidia GPU (with fences)

59 GPU testing campaign we extend the diy CPU litmus test generation tool of Alglave and Maranget to generate GPU tests generates litmus tests based on cycles enumerates the tests over the GPU thread and memory hierarchy

60 GPU testing campaign Using our tools, we generated and ran tests over 5 Nvidia chips: chip year architecture GTX 750 ti 2014 Maxwell GTX Titan 2013 Kepler GTX 660 2012 GTX 540m 2011 Fermi Tesla C2075

61 GPU testing campaign Results are hosted at:

62 Modeling we extended the CPU axiomaitic memory modeling tool herd of Alglave and Maranget, for GPUs we developed an axiomatic memory model for PTX which is able to simulate all of our tests our model is sound with respect to all of our hardware observations

63 Modeling Demo of web interface

64 More results surprising and buggy behaviours observed:
GPU mutex implementations allow stale data to be read (found in CUDA by Example book and other academic papers1,2) led to an erratum issued by Nvidia Hardware re-orders loads from the same address in Nvidia Fermi and Kepler Some testing on AMD GPUs 1J. A. Stuart and J. D. Owens, "Efficient synchronization primitives for GPUs" CoRR, 2011, 2B. He and J. X. Yu, “High-throughput transaction executions on graphics processors” PVLDB 2011.

65 Related work (CPU memory models)
Alglave et. al. have done extensive work on testing and modeling CPUs (notably IBM Power and ARM) and create the tools diy, litmus, and herd which we extended for this work Collier tested CPU memory models using the ARCHTEST tool

66 Related work (GPU memory models)
Hower et. al. have proposed several SC for race-free language level memory models for GPUs

67 Questions? project page: Intel Core i CPU Nvidia Tesla C2075 GPU Nvidia Tesla C2075 GPU (with fences)

68 CUDA by Example Intel Core i CPU

69 CUDA by Example Nvidia Tesla C2075 GPU

70 CUDA by Example Nvidia Tesla C2075 GPU (with fences)

71 Read-after-Read Hazard

72 Ignore after this

73 Results Surprising and buggy behaviours observed:
SC-per-location violations on NVIDIA Fermi and Kepler architecture: todo: add CORR test

74 Limitations warps: we do not test intra-warp behaviours as the lock step behaviour of warps is not compatible with some of our heuristics grids: we do not test inter-grid behaviours as we did not find any examples in the literature

75 GPU programming GPUs are SIMT (Single Instruction, Multiple Thread)
Nvidia GPUs may be programmed using CUDA or OpenCL

76 Roadmap background and motivation approach GPU tests running tests
modeling

77 Heuristics two additional heuristics:
synchronization: testing threads synchronize immediately before running the test program general bank conflicts: generate memory access that conflict with the accesses in the memory stress heuristic

78 Challenges PTX optimizing assembler may reorder or remove instructions
We developed a tool optcheck which compares the litmus test with the binary and checks for optimizations

79 Roadmap background and motivation approach GPU tests running tests
modeling

80 GPU tests concrete GPU test T0 | T1 ;
st.cg.s32 [x], 1 | ld.cg.s32 r1,[y] ; st.cg.s32 [y], 1 | ld.cg.s32 r2,[x] ; ScopeTree (grid(cta(warp T0) (warp T1))) x: shared, y: global exists (1:r1=1 /\ 1:r2=0)

81 GPU tests concrete GPU test T0 | T1 ;
st.cg.s32 [x], 1 | ld.cg.s32 r1,[y] ; st.cg.s32 [y], 1 | ld.cg.s32 r2,[x] ; ScopeTree (grid(cta(warp T0) (warp T1))) x: shared, y: global exists (1:r1=1 /\ 1:r2=0)

82 GPU tests concrete GPU test T0 | T1 ;
st.cg.s32 [x], 1 | ld.cg.s32 r1,[y] ; st.cg.s32 [y], 1 | ld.cg.s32 r2,[x] ; ScopeTree (grid(cta(warp T0) (warp T1))) x: shared, y: global exists (1:r1=1 /\ 1:r2=0)

83 GPU programming explicit hierarchical concurrency model
thread hierarchy: thread warp CTA (Cooperative Thread Array) grid memory hierarchy: shared memory global memory

84 GPU background GPU is a highly parallel co-processor
currently found in devices from tablets to top super computers not just used for visualization anymore! Images from Wikipedia [15,16,17]

85 References [1] L. Lamport, "How to make a multiprocessor computer that correctly executes multi-process programs" Trans. Comput [2] J. Alglave, L. Maranget, S. Sarkar, and P. Sewell, "Litmus: Running tests against hardware" TACAS [3] J. Alglave, L. Maranget, and M. Tautschnig, "Herding cats: modelling, simulation, testing, and data-mining for weak memory" TOPLAS [4] NVIDIA, "CUDA C programming guide, version 6 (July 2014)" C Programming Guide.pdf [5] NVIDIA, "Parallel Thread Execution ISA: Version 4.0 (Feb. 2014)," [6] J. Alglave, L. Maranget, S. Sarkar, and P. Sewell, “Fences in weak memory models (extended version)” FMSD 2012 [7] J. Sanders and E. Kandrot, “CUDA by Example: An Introduction to General-Purpose GPU Programming” Addison-Wesley Professional, 2010.

86 References [8] J. A. Stuart and J. D. Owens, "Efficient synchronization primitives for GPUs" CoRR, 2011, [9] B. He and J. X. Yu, “High-throughput transaction executions on graphics processors” PVLDB [10] W. W. Collier, Reasoning About Parallel Architectures. Prentice-Hall, Inc., [11] D. R. Hower, B. M. Beckmann, B. R. Gaster, B. A. Hechtman, M. D. Hill, S. K. Reinhardt, and D. A. Wood, "Sequential consistency for heterogeneous-race-free" MSPC [12] D. R. Hower, B. A. Hechtman, B. M. Beckmann, B. R. Gaster, M. D. Hill, S. K. Reinhardt, and D. A. Wood, "Heterogeneous-race-free memory models," ASPLOS 2014 [13] T. Sorensen, G. Gopalakrishnan, and V. Grover, "Towards shared memory consistency models for GPUs" ICS 2013 [14] W.-m. W. Hwu, “GPU Computing Gems Jade Edition” Morgan Kaufmann Publishers Inc., 2011.

87 References [15] [16] [17]

88 Roadmap what happened to the pony (background)
how we found the bug (methodology) how we are able to fix the pony (contribution)

89 Message passing (mp) test
Tests how to implement a handshake idiom Found in Octree code for the pony visualization

90 Message passing (mp) test
Tests how to implement a handshake idiom Data Data

91 Message passing (mp) test
Tests how to implement a handshake idiom Flag Flag

92 Methodology empirically explore the hardware memory model implemented on deployed NVIDIA and AMD GPUs develop hardware memory model testing tools for GPUs analyze classic (i.e. CPU) memory model properties and communication idioms in CUDA applications run large families of tests on GPUs as a basis for modeling and bug hunting

93 Message passing (mp) test
Tests how to implement a handshake idiom Stale Data

94 Running tests however, unlike CPUs, simply running the tests did not yield any weak memory behaviours for Nvidia chips! we developed heuristics to run tests under a variety of stress to expose weak behaviours


Download ppt "GPU Concurrency: Weak Behaviours and Programming Assumptions"

Similar presentations


Ads by Google