GPU Concurrency: Weak Behaviours and Programming Assumptions

Slides:

Advertisements

Similar presentations

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,

TESTING AND EXPOSING WEAK GPU MEMORY MODELS

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.

The PTX GPU Assembly Simulator and Interpreter N.M. Stiffler Zheming Jin Ibrahim Savran.

Graphics Processing Units

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

The shift from sequential to parallel and distributed computing is of fundamental importance for the advancement of computing practices. Unfortunately,

Contemporary Languages in Parallel Computing Raymond Hummel.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

(1) ECE 8823: GPU Architectures Sudhakar Yalamanchili School of Electrical and Computer Engineering Georgia Institute of Technology NVIDIA Keplar.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

Extracted directly from:

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

Emergence of GPU systems and clusters for general purpose high performance computing ITCS 4145/5145 April 3, 2012 © Barry Wilkinson.

GPU Architecture and Programming

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.

Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,

Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)

ICFEM 2002, Shanghai Reasoning about Hardware and Software Memory Models Abhik Roychoudhury School of Computing National University of Singapore.

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Graphic Processing Processors (GPUs) Parallel.

CISC 879 : Advanced Parallel Programming Rahul Deore Dept. of Computer & Information Sciences University of Delaware Exploring Memory Consistency for Massively-Threaded.

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION MARC S. ORR †§, SHUAI CHE §, AYSE YILMAZER §, BRADFORD M. BECKMANN §, MARK D. HILL †§, DAVID A. WOOD †§ †

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

My Coordinates Office EM G.27 contact time:

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.

Threads by Dr. Amin Danial Asham. References Operating System Concepts ABRAHAM SILBERSCHATZ, PETER BAER GALVIN, and GREG GAGNE.

Computer Engg, IIT(BHU)

Memory Consistency Models

Portable Inter-workgroup Barrier Synchronisation for GPUs

Memory Consistency Models

Effective Data-Race Detection for the Kernel

Lecture 5: GPU Compute Architecture

Lecture 5: GPU Compute Architecture for the last time

GPU Schedulers: How Fair is Fair Enough?

NVIDIA Fermi Architecture

Chapter 1 Introduction.

Multithreaded Programming

ECE 8823: GPU Architectures

General Purpose Graphics Processing Units (GPGPUs)

6- General Purpose GPU Programming

Presentation transcript:

GPU Concurrency: Weak Behaviours and Programming Assumptions Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105

Based on our ASPLOS ‘15 paper: GPU Concurrency: Weak Behaviours and Programming Assumptions Jade Alglave1,2, Mark Batty3, Alastair F. Donaldson4, Ganesh Gopalakrishnan5, Jeroen Ketema4, Daniel Poetzl6, Tyler Sorensen1,5, John Wickerson4 1 University College London, 2 Microsoft Research, 3 University of Cambridge, 4 Imperial College London, 5 University of Utah, 6 University of Oxford

Intel Core i7 4500 CPU

Nvidia Tesla C2075 GPU

Roadmap what happened to the pony how we found the bug how we are able to fix the pony (background) (methodology) (contribution)

What happened to the pony? the visualization bugs are due to weak memory behaviours on GPUs

Weak memory models consider the test known as message passing (mp) an instance of this test appears in the pony code

Weak memory models consider the test known as message passing (mp) initial state: x and y are memory locations

Weak memory models consider the test known as message passing (mp) thread ids

Weak memory models consider the test known as message passing (mp) program: for each thread id

Weak memory models consider the test known as message passing (mp) assertion: question about the final state of registers

Message passing (mp) test Tests how to implement a handshake idiom Data Data

Message passing (mp) test Tests how to implement a handshake idiom Flag Flag

Message passing (mp) test Tests how to implement a handshake idiom Stale Data

this is known as Lamport’s sequential consistency (or SC) assertion cannot be satisfied by interleavings

Weak memory models can we assume assertion will never pass?

Weak memory models can we assume assertion will never pass? No!

Weak memory models Alglave and Maranget report this assertion appears 41 million times out of 5 billion test runs on Tegra2 ARM processor1 1http://diy.inria.fr/cats/tables.html

Weak memory models what happened? architectures implement weak memory models where the hardware is allowed to re-order certain memory instructions. weak memory models can allow weak behaviors (executions that do not correspond to an interleaving)

GPU memory models what type of memory model do current GPUs implement? documentation is sparse CUDA has 1 page + 1 example PTX has 1 page + 0 examples given in English prose we need to know this if we are to write correct GPU programs!

GPU programming Global Memory CTA 0 CTA 1 CTA n Within CTAs, threads are grouped into warps (32 threads per warp in Nvidia GPUs) Threads Shared Memory For CTA 0 Shared Memory For CTA 1 Shared Memory For CTA n Global Memory

GPU programming Threads Global Memory

GPU programming CTA 0 CTA 1 CTA n Threads Global Memory

GPU programming Global Memory CTA 0 CTA 1 CTA n Threads Shared Memory For CTA 0 Shared Memory For CTA 1 Shared Memory For CTA n Global Memory

GPU programming Global Memory CTA 0 CTA 1 CTA n Within CTAs, threads are grouped into warps (32 threads per warp in Nvidia GPUs) Threads Shared Memory For CTA 0 Shared Memory For CTA 1 Shared Memory For CTA n Global Memory

Roadmap what happened to the pony how we found the bug how we are able to fix the pony (background) (methodology) (contribution)

Methodology GPU hardware GPU litmus tests compare results formal model

GPU tests GPU litmus test considerations Scope Tree (device (cta T0) (cta T1) ) x: global, y: global

GPU tests GPU litmus test considerations PTX instructions Scope Tree (device (cta T0) (cta T1) ) x: global, y: global

GPU tests GPU litmus test considerations what memory region (shared or global) are x and y in? Scope Tree (device (cta T0) (cta T1) ) x: global, y: global

GPU tests GPU litmus test considerations what memory region (shared or global) are x and y in?

GPU tests GPU litmus test considerations are T0 and T1 in the same CTA or different CTAs? Scope Tree (device (cta T0) (cta T1) ) x: global, y: global

GPU tests GPU litmus test considerations are T0 and T1 in the same CTA or different CTAs?

Running tests we extend the litmus CPU testing tool of Alglave and Maranget to run GPU tests given a GPU litmus test, generates an executable CUDA or OpenCL code for the test

Heuristics memory stress: extra threads read and write to scratch memory T0 T1 extra thread 1 . . . . . extra thread n run T0 test program run T1 test program loop: read or write to scratchpad loop: read or write to scratchpad

Heuristics random threads: randomize the location of threads T1 T0

Heuristics random threads: randomize the location of threads

Heuristics random threads: randomize the location of threads

Heuristics random threads: randomize the location of threads

Heuristics # of weak behaviours in 100,000 runs for different heuristics on a Nvidia Tesla C2075 test none random threads memory stress + gpu-mp

Heuristics # of weak behaviours in 100,000 runs for different heuristics on a Nvidia Tesla C2075 test none random threads memory stress + gpu-mp

Heuristics # of weak behaviours in 100,000 runs for different heuristics on a Nvidia Tesla C2075 test none random threads memory stress + gpu-mp 139

Heuristics # of weak behaviours in 100,000 runs for different heuristics on a Nvidia Tesla C2075 test none random threads memory stress + gpu-mp 139 522

How we found the pony bug This is the idiom and heuristics that caused bug! test none random threads memory stress + gpu-mp 139 522

Roadmap what happened to the pony how we found the bug how we are able to fix the pony (background) (methodology) (contribution)

GPU fences PTX gives 2 fences to disallow reading stale data membar.cta – gives ordering intra-CTA membar.gl – gives ordering over device

GPU fences Test amended with a parameterizable fence Scope Tree (device (cta T0) (cta T1) ) x: global, y: global

GPU fences # of weak behaviours in 100,000 runs for different fences on a Nvidia Tesla C2075 test none membar.cta membar.gl gpu-mp 3380

GPU fences # of weak behaviours in 100,000 runs for different fences on a Nvidia Tesla C2075 test none membar.cta membar.gl gpu-mp 3380 2

GPU fences # of weak behaviours in 100,000 runs for different fences on a Nvidia Tesla C2075 test none membar.cta membar.gl gpu-mp 3380 2

How do we fix the pony Tesla C2075 Nvidia GPU

How do we fix the pony adding fences to the code Tesla C2075 Nvidia GPU (with fences)

GPU testing campaign we extend the diy CPU litmus test generation tool of Alglave and Maranget to generate GPU tests generates litmus tests based on cycles enumerates the tests over the GPU thread and memory hierarchy

GPU testing campaign Using our tools, we generated and ran 10930 tests over 5 Nvidia chips: chip year architecture GTX 750 ti 2014 Maxwell GTX Titan 2013 Kepler GTX 660 2012 GTX 540m 2011 Fermi Tesla C2075

GPU testing campaign Results are hosted at: http://virginia.cs.ucl.ac.uk/sunflowers/asplos15/flat.html

Modeling we extended the CPU axiomaitic memory modeling tool herd of Alglave and Maranget, for GPUs we developed an axiomatic memory model for PTX which is able to simulate all of our tests our model is sound with respect to all of our hardware observations

Modeling Demo of web interface

More results surprising and buggy behaviours observed: GPU mutex implementations allow stale data to be read (found in CUDA by Example book and other academic papers1,2) led to an erratum issued by Nvidia Hardware re-orders loads from the same address in Nvidia Fermi and Kepler Some testing on AMD GPUs 1J. A. Stuart and J. D. Owens, "Efficient synchronization primitives for GPUs" CoRR, 2011, http://arxiv.org/pdf/1110.4623.pdf. 2B. He and J. X. Yu, “High-throughput transaction executions on graphics processors” PVLDB 2011.

Related work (CPU memory models) Alglave et. al. have done extensive work on testing and modeling CPUs (notably IBM Power and ARM) and create the tools diy, litmus, and herd which we extended for this work Collier tested CPU memory models using the ARCHTEST tool

Related work (GPU memory models) Hower et. al. have proposed several SC for race-free language level memory models for GPUs

Questions? project page: http://virginia.cs.ucl.ac.uk/sunflowers/asplos15/ Intel Core i7 4500 CPU Nvidia Tesla C2075 GPU Nvidia Tesla C2075 GPU (with fences)

CUDA by Example Intel Core i7 4500 CPU

CUDA by Example Nvidia Tesla C2075 GPU

CUDA by Example Nvidia Tesla C2075 GPU (with fences)

Read-after-Read Hazard

Ignore after this

Results Surprising and buggy behaviours observed: SC-per-location violations on NVIDIA Fermi and Kepler architecture: todo: add CORR test

Limitations warps: we do not test intra-warp behaviours as the lock step behaviour of warps is not compatible with some of our heuristics grids: we do not test inter-grid behaviours as we did not find any examples in the literature

GPU programming GPUs are SIMT (Single Instruction, Multiple Thread) Nvidia GPUs may be programmed using CUDA or OpenCL

Roadmap background and motivation approach GPU tests running tests modeling

Heuristics two additional heuristics: synchronization: testing threads synchronize immediately before running the test program general bank conflicts: generate memory access that conflict with the accesses in the memory stress heuristic

Challenges PTX optimizing assembler may reorder or remove instructions We developed a tool optcheck which compares the litmus test with the binary and checks for optimizations

Roadmap background and motivation approach GPU tests running tests modeling

GPU tests concrete GPU test T0 | T1 ; st.cg.s32 [x], 1 | ld.cg.s32 r1,[y] ; st.cg.s32 [y], 1 | ld.cg.s32 r2,[x] ; ScopeTree (grid(cta(warp T0) (warp T1))) x: shared, y: global exists (1:r1=1 /\ 1:r2=0)

GPU tests concrete GPU test T0 | T1 ; st.cg.s32 [x], 1 | ld.cg.s32 r1,[y] ; st.cg.s32 [y], 1 | ld.cg.s32 r2,[x] ; ScopeTree (grid(cta(warp T0) (warp T1))) x: shared, y: global exists (1:r1=1 /\ 1:r2=0)

GPU tests concrete GPU test T0 | T1 ; st.cg.s32 [x], 1 | ld.cg.s32 r1,[y] ; st.cg.s32 [y], 1 | ld.cg.s32 r2,[x] ; ScopeTree (grid(cta(warp T0) (warp T1))) x: shared, y: global exists (1:r1=1 /\ 1:r2=0)

GPU programming explicit hierarchical concurrency model thread hierarchy: thread warp CTA (Cooperative Thread Array) grid memory hierarchy: shared memory global memory

GPU background GPU is a highly parallel co-processor currently found in devices from tablets to top super computers not just used for visualization anymore! Images from Wikipedia [15,16,17]

References [1] L. Lamport, "How to make a multiprocessor computer that correctly executes multi-process programs" Trans. Comput. 1979. [2] J. Alglave, L. Maranget, S. Sarkar, and P. Sewell, "Litmus: Running tests against hardware" TACAS 2011. [3] J. Alglave, L. Maranget, and M. Tautschnig, "Herding cats: modelling, simulation, testing, and data-mining for weak memory" TOPLAS 2014. [4] NVIDIA, "CUDA C programming guide, version 6 (July 2014)" http://docs.nvidia.com/cuda/pdf/CUDA C Programming Guide.pdf [5] NVIDIA, "Parallel Thread Execution ISA: Version 4.0 (Feb. 2014)," http://docs.nvidia.com/cuda/parallel-thread-execution [6] J. Alglave, L. Maranget, S. Sarkar, and P. Sewell, “Fences in weak memory models (extended version)” FMSD 2012 [7] J. Sanders and E. Kandrot, “CUDA by Example: An Introduction to General-Purpose GPU Programming” Addison-Wesley Professional, 2010.

References [8] J. A. Stuart and J. D. Owens, "Efficient synchronization primitives for GPUs" CoRR, 2011, http://arxiv.org/pdf/1110.4623.pdf. [9] B. He and J. X. Yu, “High-throughput transaction executions on graphics processors” PVLDB 2011. [10] W. W. Collier, Reasoning About Parallel Architectures. Prentice-Hall, Inc., 1992. [11] D. R. Hower, B. M. Beckmann, B. R. Gaster, B. A. Hechtman, M. D. Hill, S. K. Reinhardt, and D. A. Wood, "Sequential consistency for heterogeneous-race-free" MSPC 2013. [12] D. R. Hower, B. A. Hechtman, B. M. Beckmann, B. R. Gaster, M. D. Hill, S. K. Reinhardt, and D. A. Wood, "Heterogeneous-race-free memory models," ASPLOS 2014 [13] T. Sorensen, G. Gopalakrishnan, and V. Grover, "Towards shared memory consistency models for GPUs" ICS 2013 [14] W.-m. W. Hwu, “GPU Computing Gems Jade Edition” Morgan Kaufmann Publishers Inc., 2011.

References [15] http://en.wikipedia.org/wiki/Samsung_Galaxy_S5 [16] http://en.wikipedia.org/wiki/Titan_(supercomputer) [17] http://en.wikipedia.org/wiki/Barnes_Hut_simulation

Roadmap what happened to the pony (background) how we found the bug (methodology) how we are able to fix the pony (contribution)

Message passing (mp) test Tests how to implement a handshake idiom Found in Octree code for the pony visualization

Message passing (mp) test Tests how to implement a handshake idiom Data Data

Message passing (mp) test Tests how to implement a handshake idiom Flag Flag

Methodology empirically explore the hardware memory model implemented on deployed NVIDIA and AMD GPUs develop hardware memory model testing tools for GPUs analyze classic (i.e. CPU) memory model properties and communication idioms in CUDA applications run large families of tests on GPUs as a basis for modeling and bug hunting

Message passing (mp) test Tests how to implement a handshake idiom Stale Data

Running tests however, unlike CPUs, simply running the tests did not yield any weak memory behaviours for Nvidia chips! we developed heuristics to run tests under a variety of stress to expose weak behaviours