Rahul Sharma (Stanford) Michael Bauer (NVIDIA Research) Alex Aiken (Stanford) Verification of Producer-Consumer Synchronization in GPU Programs June 15,

Slides:



Advertisements
Similar presentations
Dataflow Analysis for Datarace-Free Programs (ESOP 11) Arnab De Joint work with Deepak DSouza and Rupesh Nasre Indian Institute of Science, Bangalore.
Advertisements

Tintu David Joy. Agenda Motivation Better Verification Through Symmetry-basic idea Structural Symmetry and Multiprocessor Systems Mur ϕ verification system.
Intermediate GPGPU Programming in CUDA
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Abstraction and Modular Reasoning for the Verification of Software Corina Pasareanu NASA Ames Research Center.
Conditional Must Not Aliasing for Static Race Detection Mayur Naik Alex Aiken Stanford University.
An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.
Assembly Code Verification Using Model Checking Hao XIAO Singapore University of Technology and Design.
Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University.
Instructor Notes This lecture deals with how work groups are scheduled for execution on the compute units of devices Also explain the effects of divergence.
11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.
Static Analysis of Embedded C Code John Regehr University of Utah Joint work with Nathan Cooprider.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
Architectural Support for Operating Systems. Announcements Most office hours are finalized Assignments up every Wednesday, due next week CS 415 section.
Multiscalar processors
Synergistic Execution of Stream Programs on Multicores with Accelerators Abhishek Udupa et. al. Indian Institute of Science.
CUDA and the Memory Model (Part II). Code executed on GPU.
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
Gnort: High Performance Intrusion Detection Using Graphics Processors Giorgos Vasiliadis, Spiros Antonatos, Michalis Polychronakis, Evangelos Markatos,
Scalable and Flexible Static Analysis of Flight-Critical Software Guillaume P. Brat Arnaud J. Venet Carnegie.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
An Introduction to Programming with CUDA Paul Richmond
1 Thread Synchronization: Too Much Milk. 2 Implementing Critical Sections in Software Hard The following example will demonstrate the difficulty of providing.
Enhancing GPU for Scientific Computing Some thoughts.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
Extracted directly from:
Eraser: A Dynamic Data Race Detector for Multithreaded Programs STEFAN SAVAGE, MICHAEL BURROWS, GREG NELSON, PATRICK SOBALVARRO, and THOMAS ANDERSON Ethan.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.
CUDA - 2.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.
ICFEM 2002, Shanghai Reasoning about Hardware and Software Memory Models Abhik Roychoudhury School of Computing National University of Singapore.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
Sunpyo Hong, Hyesoon Kim
Would'a, CUDA, Should'a. CUDA: Compute Unified Device Architecture OU Supercomputing Symposium Highly-Threaded HPC.
Martin Kruliš by Martin Kruliš (v1.0)1.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.
Sathish Vadhiyar Parallel Programming
Memory Consistency Models
Memory Consistency Models
Effective Data-Race Detection for the Kernel
Implementation of Efficient Check-pointing and Restart on CPU - GPU
Lecture 5: GPU Compute Architecture
Amir Kamil and Katherine Yelick
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Foundations of Computer Science
Lecture 5: GPU Compute Architecture for the last time
CS/EE 217 – GPU Architecture and Parallel Programming
All-Pairs Shortest Paths
Amir Kamil and Katherine Yelick
6- General Purpose GPU Programming
Presentation transcript:

Rahul Sharma (Stanford) Michael Bauer (NVIDIA Research) Alex Aiken (Stanford) Verification of Producer-Consumer Synchronization in GPU Programs June 15, 2015Rahul Sharma Michael Bauer

Outline GPU background Motivating examples Verification algorithm and implementation Results

GPU background GPU Off-Chip Global Memory SM Streaming Multiprocessors On-Chip ALU Shared Memory (up to 48 KB) Shared Memory (up to 48 KB) Threadblock (CTA) ~ 100s of threads Load Data from Global to Shared __syncthreads() barrier Compute on data in Shared __syncthreads() barrier Store Data from Shared to Global Warp = 32 threads SM

Named barriers Synchronization primitive Built into hardware 16 named barriers per SM Two instructions Sync: blocking Arrive: non-blocking Specify participating count __synchthreads is a special case of named barriers Sync 0, N Encode producer- consumer patterns Producer Warp Consumer Warp Named Barrier 0 Sync 0,64Arrive 0,64

CudaDMA library (SC 2011) Simple library to abstract data movement between GPU memories Global to shared Shared to global Specialize warps Compute warps: do math DMA warps: move data Use named barriers to synchronize transfers Use more named barriers for double buffering Compute Warps DMA Warps start_xfer (arrive 0,N)wait_start (sync 0,N) finish_xfer (arrive 1,N) wait_start (sync 0,N) wait_finish (sync 1,N) start_xfer (arrive 0,N) Load Data into Shared Buffer Compute on Shared Buffer Load Data into Shared Buffer wait_finish (sync 1,N) finish_xfer (arrive 1,N)

Singe compiler (PPoPP 2014) DSL compiler for combustion chemistry Up to 4X speedup Kernels contain 10K lines Maps static dataflow graphs onto warps Use shared memory for communication Assign synchronization points to named barriers Analogous to register allocation Manage passing of data through shared memory Warp 0Warp 1Warp 2Warp 3 A A B B C C D D E E G G F F I I H H J J

Named barrier challenges Three challenges: Named barrier reuse Must prove that it is safe to recycle named barriers Need happens-before relationship Must be self-consistent Deadlock Shared memory races Two accesses to the same location with at least one being a write Warp 0Warp 1Warp 2Warp 3 A A B B C C D D E E G G F F I I H H J J Warp 0Warp 1 sync 0 arrive 1 sync 1 arrive 0

WEFT architecture GPU kernel GPU kernel compile 0 Thread programs 1n Happens Before Improper barrier recycling Shared memory data races WEFT Deadlocks Threadblock (n threads)

Thread programs Omit statements irrelevant to properties Straight line programs: sequences of commands Commands sync b [m] arrive b [m] read a write a Restrictive, but followed by the majority of GPU code

Well synchronization “Synchronization pattern is deterministic” Same commands synchronize, no double duty Obey generations Subsumes deadlock freedom and safe recycling ProducerConsumer sync 0 write async 1 arrive 1read a sync 0 Generation 1 of barrier 0 Generation 1 of barrier 1 Generation 2 of barrier 0

Check well synchronization Need to know Which commands synchronize together What is the generation of the corresponding barrier First challenge: how to infer this information? Generations are invariant over all executions Statically emulate one execution Record synchronization Check that all executions respect the generations

Happens before HB relation: reachability A happens before B if path from A from B The path has at least one black edge Check successive generations have HB relationship Main result: HB relation is sound and precise ProducerConsumer sync 0 write async 1 arrive 1read a sync 0 write a arrive 1 sync 0 sync 1 read a sync 0 gen 1 gen 2

Data races For every two commands that can race check an HB relationship Sound and complete for race detection

Implementation

Evaluation (Singe kernels)

Discovered bugs Write-after-read Benign data races All kernels were well synchronized

Conclusion GPUs are much more flexible than people realize Can use GPUs in new ways with named barriers Use of named barriers can create many complications Deadlock, improper recycling, data races Providing good software verification is important Necessary to make named barriers easy to use WEFT verifies code with named barriers Algorithm is both sound and complete Handles real production code efficiently