Mai Zheng, Vignesh T. Ravi, Feng Qin, and Gagan Agrawal

Slides:



Advertisements
Similar presentations
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
Advertisements

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Optimization on Kepler Zehuan Wang
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.
Instructor Notes This lecture deals with how work groups are scheduled for execution on the compute units of devices Also explain the effects of divergence.
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Synergy.cs.vt.edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang , Karthik Pattabiraman, Matei Ripeanu, The University of British.
Rahul Sharma (Stanford) Michael Bauer (NVIDIA Research) Alex Aiken (Stanford) Verification of Producer-Consumer Synchronization in GPU Programs June 15,
SAGE: Self-Tuning Approximation for Graphics Engines
Supporting GPU Sharing in Cloud Environments with a Transparent
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
Extracted directly from:
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal
GPU Programming with CUDA – Optimisation Mike Griffiths
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.
CUDA - 2.
Computer Systems Week 14: Memory Management Amanda Oddie.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.
Department of Computer Science and Software Engineering
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Sunpyo Hong, Hyesoon Kim
Eraser: A dynamic Data Race Detector for Multithreaded Programs Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, Thomas Anderson Presenter:
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.
Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.
Gwangsun Kim, Jiyun Jeong, John Kim
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems
Sathish Vadhiyar Parallel Programming
EECE571R -- Harnessing Massively Parallel Processors ece
Effective Data-Race Detection for the Kernel
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Lecture 5: GPU Compute Architecture
Linchuan Chen, Xin Huo and Gagan Agrawal
Operating Systems (CS 340 D)
Presented by: Isaac Martin
Department of Computer Science University of California, Santa Barbara
Lecture 5: GPU Compute Architecture for the last time
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
Operating Systems (CS 340 D)
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
Department of Computer Science University of California, Santa Barbara
6- General Purpose GPU Programming
Presentation transcript:

GRace: A Low-Overhead Mechanism for Detecting Data Races in GPU Programs Mai Zheng, Vignesh T. Ravi, Feng Qin, and Gagan Agrawal Dept. of Computer Science and Engineering The Ohio State University

GPU Programming Gets Popular Many domains are using GPUs for high performance GPU-accelerated Molecular Dynamics GPU-accelerated Seismic Imaging Available in both high-end/low-end systems 3 of the top 10 supercomputers were based on GPUs [Meuer+:10] Many desktops and laptops are equipped with GPUs

Data Races in GPU Programs A typical mistake 1. __shared__ int s[]; 2. int tid=threadIdx.x; … 3. s[tid] = 3; //W 4. result[tid] = s[tid+1] * tid; //R Time T32 W(s+32) Case 1: R(s+31+1) T31 Case 2: T31 R(s+31+1) W(s+32) T32 Result undeterminable! May lead to severe problems later E.g. crash, hang, silent data corruption

Why Data Races in GPU Programs In-experienced programmers GPUs are accessible and affordable to developers that never used parallel machines in the past More and more complicated applications E.g. programs running on a cluster of GPUs involve other programming model like MPI (Message Passing Interfaces) Implicit kernel assumptions broken by kernel users E.g. “max # of threads per block will be 256”, “initialization values of the matrix should be within a certain range”, Otherwise, may create overlapped memory indices among different threads

State-of-the-art Techniques Data race detection for multithreaded CPU programs Lockset [Savage+:97] [Choi+:02] Happens-before [Dinning+:90] [Perkovic+:96] [Flanagan+:09] [Bond+:10] Hybrid [O’Callahan+:03][Pozninansky+:03][Yu+:05] Inapplicable or unnecessarily expensive in barrier-based GPU programs  Data race detection for GPU programs SMT(Satisfiability Modulo Theories)-based verification [Li+:10] Dynamically tracking all shared-variable accesses [Boyer+:08] False positives & State explosion  False positives & Huge overhead 

Our Contributions Statically-assisted dynamic approach Simple static analysis significantly reduces overhead Exploiting GPU’s thread scheduling and execution model Identify key difference between data race on GPU/CPU Avoid false positives Making full use of GPU’s memory hierarchy Reduce overhead further Precise: no false positives in our evaluation Low-overhead: as low as 1.2x on real GPU

Outline Motivation What’s new in GPU programs GRace Evaluation Conclusions

Execution Model GPU architecture and SIMT(Single-Instruction Multiple-Thread) A Block of Threads SM 1 Scheduler Core Shared Memory … SM 2 SM n Scheduler Core Shared Memory … T0 T1 T2 … T31 Warps Device Memory Device Memory Streaming Multiprocessors (SMs) execute blocks of threads Threads inside a block use barrier for synchronization A block of threads are further divided into groups called Warps 32 threads per warp Scheduling unit in SM

Our Insights Two different types of data races between barriers Intra-warp races Threads within a warp can only cause data races by executing the same instruction Inter-warp races Threads across different warps can have data races by executing the same or different instructions 1. __shared__ int s[]; 2. int tid=threadIdx.x; … 3. s[tid] = 3; //W 4. result[tid] = s[tid+1] * tid; //R T30 R(s+32) Warp 0 T31 T32 W(s+32) Warp 1 T33 T30 R(s+32) Warp 0 T31 T32 W(s+32) Warp 1 T33 T30 W(s+31) Warp 0 T31 R(s+31) Time Inter-warp race Impossible! No Intra-warp race

… Memory Hierarchy Memory constraints Performance-critical SM 1 Scheduler Core Shared Memory SM 2 Scheduler Core Shared Memory SM n Scheduler Fast but small(16KB) Shared Memory … Core Large (4GB) but slow (100x) Device Memory Performance-critical Frequently accessed variables are usually stored in shared memory Dynamic tool should also try to use shared memory whenever possible

Outline Motivation What’s new in GPU programs GRace Evaluation Conclusions

GRace: Design Overview Statically-assisted dynamic analysis Annotated GPU Code Static Analyzer Dynamic Checker Original GPU Code Execute the code Statically Detected Bug Dynamically Detected Bug

Simple Static Analysis Helps A Lot Observation I: Many conflicts can be easily determined by static technique 1. __shared__ int s[]; 2. int tid=threadIdx.x; … 3. s[tid] = 3; //W 4. result[tid] = s[tid+1] * tid; //R tid: 0 ~ 511 W(s[tid]): (s+0) ~ (s+511) R(s[tid+1]): (s+1) ~ (s+512) Overlapped! ❶Statically detect certain data races & ❷Prune memory access pairs that cannot be involved in data races

Static analysis can help in other ways How about this… Observation II: Some accesses are loop-invariant 1. __shared__ float s[]; … 2. for(r=0; …; …) 3. { … 4. for(c=0; …; …) 5. { … 6. temp = s[input[c]];//R 8. } 9. }y R(s+input[c]) is irrelevant to r Don’t need to monitor in every r iteration Observation III: Some accesses are tid-invariant R(s+input[c]) is irrelevant to tid Don’t need to monitor in every thread ❸Further reduce runtime overhead by identifying loop-invariant & tid-invariant accesses

Static Analyzer: Workflow pairs of mem accesses don’t need to monitor Mark as monitor at runtime statically determinable? Y N gen_constraints (tid, iter_id, addr) check_loop_invariant (addr, iter_id) add_params (tid, iter_id, max_tid, max_iter) check_tid_invariant(addr) Linear constraint solver ❸Mark loop-invariant & tid-invariant soln_set empty? Y N ❷No race ❶Race detected! Dynamic Checker

GRace: Design Overview Statically-assisted dynamic analysis Annotated GPU Code Static Analyzer Dynamic Checker Original GPU Code Execute the code Statically Detected Bug Dynamically Detected Bug

Dynamic Checker Reminder: Intra-warp races: caused by threads within a warp Inter-warp races: caused by threads across different warps Intra-warp Checker GRace-stmt GRace-addr Check Intra-warp data races Check Inter-warp data races: Two design choices with diff. trade-offs

Intra-warp Race Detection Check conflicts among the threads within a warp Perform detection immediately after each monitored memory access ❶record access type R/W ❷record addresses Address 0 Address 1 Address 2 Address 3 … Address 31 T0 T1 T2 T3 T31 ❸If W, check conflicts in parallel Recyclable and small enough to reside in Shared Memory … Fast & Scalable warpTable (for each warp)

Dynamic Checker Reminder: Intra-warp races: caused by threads within a warp Inter-warp races: caused by threads across different warps Intra-warp Checker GRace-stmt GRace-addr Check Inter-warp data races: Two design choices with diff. trade-offs Check Intra-warp data races in Shared Memory

GRace-stmt: Inter-warp Race Detection I Check conflicts among the threads from different warps After each monitored mem. access, record info. to BlockStmtTable At synchronization call, check conflicts between diff. warps stmt# WarpID R/W … … … Addr0 Addr1 Addr2 … Addr31 Addr0 Addr1 Addr2 … Addr31 … … … … … T0 T1 T31 … T2 WarpTable0 WarpTable1 Write & check in parallel Check conflicts if warpIDs diff. BlockStmtTable in Device Memory (for all warps) Accurate diagnostic info.

GRace-addr: Inter-warp Race Detection II Check conflicts among the threads from different warps After each monitored mem access, update corresponding counters At synchronization call, infer races based on local/global counters rWarpShmMap0 wWarpShmMap0 local tables for Warp-0 (in Device Mem.) rWarpShmMapN wWarpShmMapN local tables for Warp-N … one-to-one mapping Shared Memory 1 1 1 Warp-0: W Warp-N: R Warp-0: R … Check in parallel W R R global tables for all Warps within the Block rBlockShmMap wBlockShmMap 1 2 1 Simple & Scalable 1

Outline Motivation What’s new in GPU programs GRace Evaluation Static analyzer Dynamic checker Evaluation Conclusions

Methodology Hardware Software Applications GPU: NVIDIA Tesla C1060 240 cores (30×8), 1.296GHz 16KB shared memory per SM 4GB device memory CPU: AMD Opteron 2.6GHz ×2 8GB main memory Software Linux kernel 2.6.18 CUDA SDK 3.0 PipLib (Linear constraint solver) Applications co-cluster, em, scan

Overall Effectiveness Accurately report races in three applications No false positives reported Apps GRace(W/-stmt) GRace(W/-addr) R-Stmt# R-Mem# R-Thd# FP# R-Wp# co-cluster 1 10 1,310,720 8 em 14 384 22,023 3 scan 3 pairs of racing statements are detected by Static Analyzer Apps B-tool RP# FP# co-cluster 1 em 200,445 45,870 scan Error R-Stmt: pairs of conflicting accesses R-Mem: memory addresses invoked in data races R-Thd: pairs of racing threads R-Wp: pairs of racing warps FP: false positive RP: race number reported by B-tool

Is there any data race in my kernel? Runtime Overhead GRace(W/-addr): very modest GRace(W/-stmt): higher overhead with diagnostic info. , but still faster than previous tool Is there any data race in my kernel? 103,850x GRace-addr can answer quickly 19x 1.2x Where is it? GRace-stmt can tell exactly

Benefits from Static Analysis Simple static analysis can significantly reduce overhead Apps Without Static Analyzer With Stmt MemAcc co-cluster 10,524,416 41,216 em 19,070,976 54,460,416 20,736 10,044 Execution # of monitored statements and memory accesses Stmt: statements MemAcc: memory access

Conclusions and Future Work Statically-assisted dynamic analysis Architecture-based approach: Intra/Inter–warp race detection Precise and Low-overhead Future work Detect races in device memory Rank races Thanks!