TESTING AND EXPOSING WEAK GPU MEMORY MODELS

Slides:



Advertisements
Similar presentations
1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )
Advertisements

Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
By Sarita Adve & Kourosh Gharachorloo Review by Jim Larson Shared Memory Consistency Models: A Tutorial.
1 Lecture 7: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
Computer Architecture II 1 Computer architecture II Lecture 9.
1 Lecture 22: Synchronization & Consistency Topics: synchronization, consistency models (Sections )
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Review student: Fan Bai Instructor: Dr. Sushil Prasad Andrew Nere, AtifHashmi, and MikkoLipasti University of Wisconsin –Madison IPDPS 2011.
GPU Concurrency: Weak Behaviours and Programming Assumptions
Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.
Rahul Sharma (Stanford) Michael Bauer (NVIDIA Research) Alex Aiken (Stanford) Verification of Producer-Consumer Synchronization in GPU Programs June 15,
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Extracted directly from:
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
GPU Architecture and Programming
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.
By Sarita Adve & Kourosh Gharachorloo Slides by Jim Larson Shared Memory Consistency Models: A Tutorial.
Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.
Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)
ICFEM 2002, Shanghai Reasoning about Hardware and Software Memory Models Abhik Roychoudhury School of Computing National University of Singapore.
CS533 Concepts of Operating Systems Jonathan Walpole.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Graphic Processing Processors (GPUs) Parallel.
CISC 879 : Advanced Parallel Programming Rahul Deore Dept. of Computer & Information Sciences University of Delaware Exploring Memory Consistency for Massively-Threaded.
Sunpyo Hong, Hyesoon Kim
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
CS267 Lecture 61 Shared Memory Hardware and Memory Consistency Modified from J. Demmel and K. Yelick
My Coordinates Office EM G.27 contact time:
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
Fundamentals of Memory Consistency Smruti R. Sarangi Prereq: Slides for Chapter 11 (Multiprocessor Systems), Computer Organisation and Architecture, Smruti.
740: Computer Architecture Memory Consistency Prof. Onur Mutlu Carnegie Mellon University.
GPU ProgrammingOther Relevant ObservationsExperiments GPU kernels run on C blocks (CTAs) of W warps. Each warp is a group of w=32 threads, which are executed.
Multiprocessors – Locks
Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Memory Consistency Models
Lecture 11: Consistency Models
Portable Inter-workgroup Barrier Synchronisation for GPUs
Memory Consistency Models
Threads and Memory Models Hal Perkins Autumn 2011
Symmetric Multiprocessors: Synchronization and Sequential Consistency
GPU Schedulers: How Fair is Fair Enough?
NVIDIA Fermi Architecture
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Threads and Memory Models Hal Perkins Autumn 2009
Shared Memory Consistency Models: A Tutorial
Memory Consistency Models
6- General Purpose GPU Programming
Presentation transcript:

TESTING AND EXPOSING WEAK GPU MEMORY MODELS MS Thesis Defense by Tyler Sorensen Advisor : Ganesh Gopalakrishnan May 30, 2014

Joint Work with: Jade Alglave (University College London), Daniel Poetzl (University of Oxford), Luc Maranget (Inria), Alastair Donaldson, John Wickerson, (Imperial College London), Mark Batty (University of Cambridge)

Roadmap Background and Approach Prior Work Testing Framework Results CUDA Spin Locks Bulk Testing Future Work and Conclusion

Roadmap Background and Approach Prior Work Testing Framework Results CUDA Spin Locks Bulk Testing Future Work and Conclusion

GPU Background GPU is a highly parallel co-processor Currently found in devices from tablets to top super computers (Titan) Not just used for visualization anymore! Images from Wikipedia [16,17,18]

GPU Programming Explicit Hierarchical concurrency model Thread Hierarchy: Thread Warp CTA (Cooperative Thread Array) Kernel (GPU program) Memory Hierarchy: Shared Memory Global Memory

GPU Programming

GPU Programming GPUs are SIMT (Single Instruction, Multiple Thread) NVIDIA GPUs may be programmed using CUDA or OpenCL

GPU Programming

Weak Memory Models Consider the test known as Store Buffering (SB)

Weak Memory Models Consider the test known as Store Buffering (SB) Initial State: x and y are memory locations

Weak Memory Models Consider the test known as Store Buffering (SB) Thread IDs

Weak Memory Models Consider the test known as Store Buffering (SB) Program: for each thread ID

Weak Memory Models Consider the test known as Store Buffering (SB) Assertion: question about the final state of registers

Weak Memory Models Consider the test known as Store Buffering (SB) Can this assertion be satisfied?

Assertion cannot be satisfied by interleavings This is known as sequential consistency (or SC) [1]

Weak Memory Models Can we assume assertion will never pass?

Weak Memory Models Can we assume assertion will never pass? No!

Weak Memory Models Executing this test with the Litmus tool [2] on an Intel i7 x86 processor for 1000000 iterations, we get the following histogram of results:

Weak Memory Models What Happened? Architectures implement weak memory models where the hardware is allowed to re-order certain memory instructions. On x86 architectures, the hardware is allowed to re-order write instructions with program-order later read instructions [3]

GPU Memory Models What type of memory model do current GPUs implement? Documentation is sparse CUDA has 1 page + 1 example [4] PTX has 1 page + 0 examples [5] No specifics about which instructions are allowed to be re-ordered We need to know if we are to write correct GPU programs!

Our Approach Empirically explore the memory model implemented on deployed NVIDIA GPUs Achieved by developing a memory model testing tool for NVIDIA GPUs with specialized heuristics We analyze classic memory model properties and CUDA applications in this framework with unexpected results We test large families of tests on GPUs as a basis for modeling and bug hunting

Our Approach Disclaimer: Testing is not guaranteed to reveal all behaviors

Roadmap Background and Approach Prior Work Testing Framework Results CUDA Spin Locks Bulk Testing Future Work and Conclusion

Prior Work Testing Memory Models: Pioneered by Bill Collier in ARCHTEST in 1992 [6] TSOTool in 2004 [7] Litmus in 2011 [2] We extend this tool

Prior Work (GPU Memory Models) June 2013: Hower et al. proposed a SC for race-free memory model for GPUs [8] Sorensen et al. proposed an operational weak GPU memory model based on available documentation [9] 2014: Hower et al. proposed two SC for race-free memory model for GPUs, HRF- direct and HRF-indirect [10] It remains unclear what memory model deployed GPUs implement

Roadmap Background and Approach Prior Work Testing Framework Results CUDA Spin Locks Bulk Testing Future Work and Conclusion

Testing Framework GPU litmus test

Testing Framework GPU litmus test PTX instructions

Testing Framework GPU litmus test What memory region (shared or global) are x and y in?

Testing Framework GPU litmus test Are T0 and T1 in the same CTA? Or different CTAs?

Testing Framework We consider three different GPU configurations for tests: D-warp:S-cta-Shared: Different warp, Same CTA, targeting shared memory D-warp:S-cta-Global: Different warp, Same CTA, targeting global memory D-cta:S-ker-Global: Different CTA, Same kernel, targeting global memory

Testing Framework Given a GPU Litmus test produce executable CUDA OpenCL

Testing Framework Host (CPU) generated code

Testing Framework Host (CPU) generated code

Testing Framework Host (CPU) generated code

Testing Framework Host (CPU) generated code

Testing Framework Host (CPU) generated code

Testing Framework Host (CPU) generated code

Testing Framework Host (CPU) generated code

Testing Framework Kernel generated code

Testing Framework Kernel generated code

Testing Framework Kernel generated code

Testing Framework Kernel generated code

Testing Framework Kernel generated code

Testing Framework Basic Framework shows NO weak behaviors We develop heuristics (we dub incantations) to encourage weak behaviors to appear

Testing Framework General bank conflict incantation Each access in test is exclusively one of: Optimal

Testing Framework General bank conflict incantation Each access in test is exclusively one of: Optimal Broadcast

Testing Framework General bank conflict incantation Each access in test is exclusively one of: Optimal Broadcast Bank Conflict

Testing Framework General Bank Conflict Heuristic Given this test:

Testing Framework General Bank Conflict Heuristic One possible general bank conflict scheme: Bank Conflict Optimal Optimal Broadcast

Testing Framework Two critical incantations (without them we observe no weak executions): General Bank Conflicts (shown previously) Memory Stress: All non-testing threads read/write to memory

Testing Framework Two extra incantations: Sync: testing threads synchronize before test Randomization: testing thread IDs are randomized

Roadmap Background and Approach Prior Work Testing Framework Results CUDA Spin Locks Bulk Testing Future Work and Conclusion

Traditional Tests We show the results for these tests which have been studied for CPUs in [3]: MP (Message Passing): can stale values can be read in a handshake idiom? SB (Store Buffering): can stores can be buffered after loads? LD (Load Delaying): can loads can be delayed after stores? Results show running 100,000 iterations over 3 chips: Tesla C2075 (Fermi), GTX Titan (Kepler), and GTX 750 (Maxwell)

Message Passing Tests how to implement a handshake idiom

Message Passing Tests how to implement a handshake idiom Flag Flag

Message Passing Tests how to implement a handshake idiom Data Data

Message Passing Tests how to implement a handshake idiom Stale Data

Message Passing

Message Passing How do we disallow reading stale data? PTX gives 2 fences for intra-device [5 p.165] membar.cta – Gives ordering properties intra-CTA membar.gl – Gives ordering properties over device

Message Passing Test amended with a parameterizable fence

Message Passing

Message Passing

Message Passing

Store Buffering Can stores can be delayed after loads?

Store Buffering

Load Delaying Can loads can be delayed after stores?

Load Delaying

CoRR Test Coherence is SC per memory location [11, p. 14] Modern processors (ARM, POWER, x86) implement coherence All language models require coherence (C++11, OpenCL 2.0) Has been observed and confirmed buggy in ARM chips [3, 12]

CoRR Test Coherence of Read-Read test Can loads from the same location be return stale values?

CoRR Test

CoRR Test

CoRR Test

CoRR Test Coherence of Read-Read test Test amended with a parameterized fence

CoRR Test

CoRR Test

CoRR Test

Results Take Away Current GPUs implement observably weak memory models with scoped properties. Without formal docs, how can developers know what behaviors to rely on? This is biting developers even now (discussed next)

Roadmap Background and Approach Prior Work Testing Framework Results CUDA Spin Locks Bulk Testing Future Work and Conclusion

GPU Spin Locks Inter-CTA lock presented in the book CUDA By Example [13]

GPU Spin Locks Inter-CTA lock presented in the book CUDA By Example [13]

GPU Spin Locks Inter-CTA lock presented in the book CUDA By Example [13]

GPU Spin Locks Inter-CTA lock presented in the book CUDA By Example [13]

GPU Spin Locks Distilled to a litmus test (y is mutex, x is data):

GPU Spin Locks Distilled to a litmus test (y is mutex, x is data): Initially Locked by T0

GPU Spin Locks Distilled to a litmus test (y is mutex, x is data): CS* Unlock *CS = Critical Section

GPU Spin Locks Distilled to a litmus test (y is mutex, x is data): CS* *CS = Critical Section

GPU Spin Locks Distilled to a litmus test (y is mutex, x is data): T1 Observes Stale Value *CS = Critical Section

GPU Spin Locks Distilled to a litmus test (y is mutex, x is data): *CS = Critical Section

GPU Spin Locks Do we observe stale data in the Critical Section?

GPU Spin Locks Do we observe stale data in the Critical Section? Yes!

GPU Spin Locks Spin lock test amended with fences

GPU Spin Locks Now test with fences:

GPU Spin Locks Now test with fences: Is membar.cta enough?

GPU Spin Locks Now test with fences: Is membar.cta enough? No! It is an inter-CTA lock! Is membar.gl enough? Is membar.cta enough?

GPU Spin Locks Now test with fences: Is membar.cta enough? No! It is an inter-CTA lock! Is membar.gl enough? Yes!

GPU Spin Lock More examples without fences, which have similar issues: Mutex in Efficient Synchronization Primitives for GPUs [14] Non-blocking GPU deque in GPU Computing Gems Jade Edition [15] GPU applications must use fences!!!

Roadmap Background and Approach Prior Work Testing Framework Results CUDA Spin Locks Bulk Testing Future Work and Conclusion

Bulk Testing Daniel Poetzl (University of Oxford) is developing GPU extensions to DIY test generation [3] Test generation is based on critical cycles Used for validating models, finding bugs, gaining intuition about observable behaviors Image used with permission from [3]

Bulk Testing We have generated over 8000 tests across intra/inter CTA interactions and targeting both shared and global memory Tests include memory barriers (e.g. membar.{cta,gl,sys}), and dependencies (data, address, and control) Tested 5 chips across 3 generations GTX 540m (Fermi), Tesla C2075 (Fermi), GTX 660 (Kepler), GTX Titan (Kepler) GTX 750 Ti (Maxwell)

Roadmap Background and Approach Prior Work Testing Framework Results CUDA Spin Locks Bulk Testing Future Work and Conclusion

Future Work Test more complicated GPU configurations (e.g. both shared and global in the same test) Example: Intra-CTA Store Buffering (SB) test is observable on Maxwell only with mixed shared and global memory locations.

Future Work Axiomatic memory model in Herd [3] New scoped relations: Internal–CTA: Contains pairs of instructions that are in the same CTA Can easily compare model to observations Based on acyclic relations Image used with permission from [3]

Conclusion Current GPUs have observably weak memory models which are largely undocumented GPU programming in proceeding without adequate guidelines which results in buggy code (development of reliable GPU code impossible without specs) Rigorous documentation, testing, and verification of GPU programs based on formal tools is the way forward in terms of developing reliable GPU applications

References [1] L. Lamport, "How to make a multiprocessor computer that correctly executes multi-process programs," IEEE Trans. Comput., pp. 690- 691, Sep. 1979. [2] J. Alglave, L. Maranget, S. Sarkar, and P. Sewell, "Litmus: Running tests against hardware," ser. TACAS'11. Springer-Verlag, pp. 41-44. [3] J. Alglave, L. Maranget, and M. Tautschnig, "Herding cats: modelling, simulation, testing, and data-mining for weak memory," 2014, to appear in TOPLAS. [4] NVIDIA, "CUDA C programming guide, version 6," http://docs.nvidia.com/cuda/pdf/CUDA C Programming Guide.pdf, July 2014. [5] NVIDIA, "Parallel Thread Execution ISA: Version 4.0 (Feb. 2014)," http://docs.nvidia.com/cuda/parallel-thread-execution. [6] W. W. Collier, Reasoning About Parallel Architectures. Prentice-Hall, Inc., 1992. [7] S. Hangal, D. Vahia, C. Manovit, and J.-Y. J. Lu, "TSOtool: A program for verifying memory systems using the memory consistency model," ser. ISCA '04. IEEE Computer Society, 2004, pp. 114.

References [8] D. R. Hower, B. M. Beckmann, B. R. Gaster, B. A. Hechtman, M. D. Hill, S. K. Reinhardt, and D. A. Wood, "Sequential consistency for heterogeneous-race-free," ser. MSPC'13. ACM, 2013. [9] T. Sorensen, G. Gopalakrishnan, and V. Grover, "Towards shared memory consistency models for GPUs," ser. ICS'13. ACM, 2013, pp. 489- 490. [10] D. R. Hower, B. A. Hechtman, B. M. Beckmann, B. R. Gaster, M. D. Hill, S. K. Reinhardt, and D. A. Wood, "Heterogeneous-race-free memory models," ser. ASPLOS'14. ACM, 2014, pp. 427-440. [11] D. J. Sorin, M. D. Hill, and D. A. Wood, A Primer on Memory Consistency and Cache Coherence, ser. Synthesis Lectures on Computer Architecture. Morgan & Claypool Publishers, 2011. [12] ARM, "Cortex-A9 MPCore, programmer advice notice, read-after-read hazards," ARM Reference 761319. http://infocenter.arm.com/help/topic/com.arm.doc.uan0004a/UAN0004A a9 read read.pdf, accessed: May 2014. [13] J. Sanders and E. Kandrot, CUDA by Example: An Introduction to General-Purpose GPU Programming. Addison-Wesley Professional, 2010.

References [14] J. A. Stuart and J. D. Owens, "Efficient synchronization primitives for GPUs," CoRR, 2011, http://arxiv.org/pdf/1110.4623.pdf. [15] W.-m. W. Hwu, GPU Computing Gems Jade Edition. Morgan Kaufmann Publishers Inc., 2011. [16] http://en.wikipedia.org/wiki/Samsung_Galaxy_S5 [17] http://en.wikipedia.org/wiki/Titan_(supercomputer) [18] http://en.wikipedia.org/wiki/Barnes_Hut_simulation

Acknowledgements Advisor: Ganesh Gopalakrishnan Committee: Zvonimir Rakamaric, Mary Hall UK Group: Jade Alglave (University College London), Daniel Poetzl (University of Oxford), Luc Maranget (Inria), John Wickerson, Alastair Donaldson (Imperial College London), Mark Batty (University of Cambridge) Mohammed for feedback on practice runs

Thank You

Prior Work (GPU Memory Models) June 2010: Feng and Xiao revisit their GPU device-wide synchronization method [?] to repair it with fences [?] Speaking about weak behaviors, they state: In practice, it is infinitesimally unlikely that this will ever happen given the amount of time that is spent spinning at the barrier, e.g., none of our thousands of experimental runs ever resulted in an incorrect answer. Furthermore, no existing literature has been able to show how to trigger this type of error.

Testing Framework Evaluate inter-CTA incantations using these tests: MP: checks if stale values can be read in a handshake idiom LD: checks if loads can be delayed after stores SB: checks if stores can be delayed after loads Results show average of running 100,000 iterations over 3 chips: Tesla C2075 (Fermi), GTX Titan (Kepler), and GTX 750 (Maxwell)

Inter-CTA interactions

Without Critical Incantations, No Weak Behaviors Are Observed Inter-CTA interactions

Inter-CTA interactions

Most Effective Incantations Inter-CTA interactions

Testing Framework Evaluate intra-CTA incantations using these tests*: MP-Global: Message Passing tests targeting global memory region MP-Shared: Message Passing tests targeting global memory region * The previous tests (LD, SB) are not observable intra-CTA

Intra-CTA interactions

Without Critical Incantations, No Weak Behaviors Are Observed Intra-CTA interactions

Intra-CTA interactions

Most Effective Incantations Intra-CTA interactions

Bulk Testing Invalidated GPU memory model from [?] Model disallows behaviors observed on hardware Gives too strong of orderings to load operations inter-CTA

Bulk Testing Invalidated GPU memory model from [?] Model disallows behaviors observed on hardware Gives too strong of orderings to load operations inter-CTA

GPU Hardware Multiple SMs (Streaming Multiprocessors) SMs contain CUDA Cores Each SM has an L1 cache All SMs share an L2 cache and DRAM Warp scheduler executes in groups of 32

GPU Hardware

GPU Programming to Hardware Threads in same CTA are mapped to same SM Shared memory is in L1 (Maxwell is an Exception) Global memory is in DRAM and cached in L2 (Fermi is an Exception) Warp scheduler executes threads in groups of 32

Testing Framework

Testing Framework Initial value of shared memory locations

Testing Framework Thread IDs

Testing Framework Programs (written in NVIDIA PTX)

Testing Framework Assertion about final state of system

GPU Terminology We Use