Hardware Transactional Memory for GPU Architectures* Wilson W. L. Fung Inderpeet Singh Andrew Brownsword Tor M. Aamodt University of British Columbia *In Proc ACM/IEEE Int’l Symp. Microarchitecture (MICRO-44)
Talk Outline What we mean by “GPU” in this work. Data Synchronization on GPUs. What is Transactional Memory (TM)? TM is compatible with OpenCL. … but is TM compatible with GPU hardware? KILO TM: A Hardware TM for GPUs. Results Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 2
Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 3 What is a GPU (in this work)? GPU is NVIDIA/AMD-like, Compute Accelerator SIMD HW + Aggressive Memory Subsystem => High Compute Throughput and Efficiency Non-Graphics API: OpenCL, DirectCompute, CUDA Programming Model: Hierarchy of scalar threads Today: Limited Communication & Synchronization Kernel Blocks Work Group / Thread Blocks Scalar Thread Wavefront / Warp Shared (Local) MemoryBarrier Global Memory
Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 4 Memory Partition Baseline GPU Architecture Interconnection Network SIMT Core SIMT Front End SIMD Datapath Fetch Decode Schedule Branch Done (Warp ID) Memory Subsystem Icnt. Network SMem Non-Coherent L1 D-Cache Tex $Const$ Memory Partition Last-Level Cache Bank Off-Chip DRAM Channel Atomic Op. Unit
Stack-Based SIMD Reconvergence (“SIMT”) (Levinthal SIGGRAPH’84, Fung MICRO’07) 5 -G1111 TOS B CD E F A G Thread Warp Common PC Thread 2 Thread 3 Thread 4 Thread 1 B/1111 C/1001D/0110 E/1111 A/1111 G/1111 -A1111 TOS ED0110 EC1001 TOS -E1111 ED0110 TOS -E1111 ADGA Time CBE -B1111 TOS -E1111 TOS Reconv. PC Next PCActive Mask Stack ED0110 EE1001 TOS -E
Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 6 Data Synchronizations on GPUs Motivation Solve wider range of problems on GPU Data Race Data Synchronization Current Solution: Atomic read-modify-write (32-bit/64-bit). Best Sol’n? Why Transactional Memory? E.g. N-Body with 5M bodies (traditional sync, not TM) CUDA SDK: O(n 2 ) – 1640 s (barrier) Barnes Hut: O(nLogn) – 5.2 s (atomics, harder to get right) Easier to Write/Debug Efficient Algorithms Practical efficiency. Want efficiency of GPU with reasonable (not superhuman) effort and time.
Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 7 Data Synchronizations on GPUs Deadlock-free code with fine-grained locks and 10,000+ hardware scheduled threads is hard # Locks x # Sharing Thread # Possible Global Lock States Which of these states are deadlocks?! Other general problems with lock based synchronization Implicit relationship between locks and objects being protected Code is not composable
Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 8 Data Synchronization Problems Specific to GPUs Interaction between locks and SIMT control flow can cause deadlocks A: while(atomicCAS(lock,0,1)==1); B: // Critical Section … C: lock = 0; A: done = 0; B: while(!done){ C: if(atomicCAS(lock,0,1)==1){ D: // Critical Section … E: lock = 0; F: done = 1; G: } H: }
Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 9 Transactional Memory Program specifies atomic code blocks called transactions [Herlihy’93] TM Version: atomic { X[c] = X[a]+X[b]; } Lock Version: Lock(X[a]); Lock(X[b]); Lock(X[c]); X[c] = X[a]+X[b]; Unlock(X[c]); Unlock(X[b]); Unlock(X[a]); Potential Deadlock!
Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 10 Transactional Memory Commit TX1 Non-conflicting transactions may run in parallel TX2 A B C D Memory Conflicting transactions automatically serialized TX1 A B C D Memory TX2 CommitAbort Commit TX2 Programmers’ View: TX1 TX2TX1 TX2 OR Time
Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 11 Transactional Memory Each transaction has 3 phases Execution Track all memory accesses (Read-Set and Write-Set) Validation Detect any conflicting accesses between transactions Resolve conflict if needed (abort/stall) Commit Update global memory
Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 12 Transactional Memory on OpenCL A natural extension to OpenCL Programming Model Program can launch many more threads than the hardware can execute concurrently GPU-TM? Current threads running transactions do not need to wait for future unscheduled threads GPU HW
Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 13 Are TM and GPUs Incompatible? The problem with GPUs (from TM perspective): 1000s of concurrent threads Inter-thread spatial locality common No cache coherence No private cache for each thread (Buffering?) Tx Abort Control flow divergence
Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 14 Scalable Coherence Hardware TM for GPUs Challenge: Conflict Detection Bus Inv C Conflict! 1024-bit Signature/Thread 3.8MB / 30k Threads TX1 TX4 TX2 TX3 R(A), W(C) R(D) R(A) R(C), W(B) Private Data Cache Signature No coherence on GPUs? Each scalar thread needs own cache?
Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 15 Register File CPU Core 10s of Registers Hardware TM for GPUs Challenge: Transaction Rollback Checkpoint Register TX TX Abort Register File GPU Core (SM) 32k Registers Warp 2MB Total On-Chip Storage Checkpoint?
Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 16 Hardware TM for GPUs Challenge: Access Granularity and Write Buffer CPU Core L1 Data Cache TX GPU Core (SM) L1 Data Cache Warp 1-2 Threads32kB Cache Threads Fermi’s L1 Data Cache (48kB) = 384 X 128B Lines Global Memory Commit Warp Problem: 384 lines / 1536 threads < 1 line per thread!
Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 17 Hardware TM on GPUs Challenge: SIMT Hardware On GPUs, scalar threads in a warp/wavefront execute in lockstep... TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit... Committed A Warp with 8 Scalar Threads Aborted Reconvergence?
Goal We take it as a given that most programmers trying lock based programming on a GPU will give up before they manage to get their application working. Hence, our goal was to find the most efficient approach to implement TM on GPU. Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 18
Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 19 KILO TM Supports 1000s of concurrent transactions Transaction-aware SIMT stack No cache coherence protocol dependency Word-level conflict detection Captures 59% of FG Lock Performance 128X Faster than Serialized Tx Exec.
Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 20 KILO TM: Design Highlights Value-Based Conflict Detection Self-Validation + Abort: Simple Communication No Cache Coherence Dependence Speculative Validation Increase Commit Parallelism
Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 21 High Level GPU Architecture + KILO TM Implementation Overview
Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 22 KILO TM: SIMT Core Changes SW Register Checkpoint Observation: Most overwritten registers not used Compiler analysis can identify what to checkpoint Transaction Abort ~ Do-While Loop Extend SIMT Stack with special entries to track aborted transactions in each warp TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit Abort Overwritten
Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 23 Transaction-Aware SIMT Stack A: t = tid.x; if (…) { B: tx_begin; C: x[t%10] = y[t] + 1; D: if (s[t]) E: y[t] = 0; F: tx_commit; G: z = y[t]; } H: w = y[t+1]; Implicit loop when tx_begin: Copy Active Mask Active Mask RPCPCType --HN HBN CR CT tx_commit, restart Tx for thread 6 & 7: Active Mask RPCPCType --HN HBN CR CT tx_commit, thread 6 & 7 failed validation: Copy Active Mask + PC Active Mask RPCPCType --HN HBN CR FT TOS Branch Divergence within Tx: Active Mask RPCPCType --HN HBN CR FT FEN tx_commit, all threads with Tx committed: Active Mask RPCPCType --HN HGN CR TOS
Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 24 TX2 atomic{A=B+2} Private Memory Read-Log Write-Log KILO TM: Value-Based Conflict Detection TX1 atomic{B=A+1} Private Memory Read-Log Write-Log Global Memory A=1 B=2 TxBegin LD r1,[A] ADD r1,r1,1 ST r1,[B] TxCommit A=1 B=0 A=1 B=0 A=2 B=2 Self-Validation + Abort: Only detects existence of conflict (not identity) => No Tx to Tx Msg – Simple Communication TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit
Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 25 TX2 atomic{A=B+2} Private Memory Read-Log Write-Log Parallel Validation? TX1 atomic{B=A+1} Private Memory Read-Log Write-Log Global Memory A=1 B=2 A=1 B=0 A=1 B=0 A=2 B=0 B=2 A=2 Init: A=1,B=0 Tx1 then Tx2: B=2,A=4 Tx2 then Tx1: A=2,B=3 OR Data Race!?!
Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 26 Serialize Validation? Global Memory Commit Unit Benefit #1: No Data Race Benefit #2: No Live Lock (generic lazy TM prob.) Drawback: Serializes Non-Conflicting Transactions (“collateral damage”) TX1TX2 V + CStall Time V + C
Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 27 Identifying Non-conflicting Tx: Step 1: Leverage Parallelism TX1 Global Memory Partition Commit Unit TX2 Global Memory Partition Commit Unit Global Memory Partition Commit Unit TX1 TX3
Solution: Speculative Validation Key Idea: Split Validation into two parts Part 1: Check recently committed transactions Part 2: Check concurrently committing transactions Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 28
Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 29 KILO TM: Speculative Validation Memory subsystem is deeply pipelined and highly parallel Global Memory Partition Commit Unit TX1 TX2 TX3 Hazard Detection Log Transfer Spec. Validation Validation Wait Finalize Outcome Commit C A D TX1 Read-Log Write-Log TX2 TX3 Validation Queue R(C),W(D) R(A),W(B) R(D),W(E)
Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 30 KILO TM: Speculative Validation Global Memory Partition Commit Unit Hazard Detection Log Transfer Spec. Validation Validation Wait Finalize Outcome Commit C A D TX1 TX2 TX3 Validation Queue Last Writer History TX1TX2TX3 R(C),W(D)R(A),W(B)R(D),W(E) C? Nil W(D) TX1DTX2BTX3E TX1 A?D? STALL Last Writer History AddrCID Evict Lookup Table Recency Bloom Filter CID
KLMN 7654 EFGH 6168 Log Storage Transaction logs are stored at the private memory of each thread Located in DRAM, cached in L1 and L2 caches T0T1T2T3 ABCD LD Wavefront ABCD 3497 T0’s view of private memory Consecutive physical address Address Value Read-Log Ptr EFGH LD Write-Log Ptr KLMN ST
Log Transfer Entries heading to same memory partition can be grouped into a larger packet KLMN 7654 EFGH 6168 ABCD 3497 Read-Log Ptr Write-Log Ptr Partition 0 Partition 1Partition 3 B4 AC39 D7 Packets to Commit Units Commit Unit 0 Commit Unit 1 Commit Unit 2
Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 33 Distributed Commit / HW Org.
ABA Problem? Classic Example: Linked List Based Stack Thread 0 – pop(): while (true) { t = top; Next = t->Next; // thread 2: pop A, pop B, push A if (atomicCAS(&top, t, next) == t) break; // succeeds! } topA Next B C Null topA Next C Null topB Next C Null topB Next C Null NextBtAtopC NextNull
ABA Problem? atomicCAS protects only a single word Only part of the data structure Value-based conflict detection protects all relevant parts of the data structure topA Next B C Null while (true) { t = top; Next = t->Next; if (atomicCAS(&top, t, next) == t) break; // succeeds! }
Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 36 Evaluation Methodology GPGPU-Sim 3.0 (BSD license) Detailed: IPC Correlation of 0.93 vs GT 200 KILO TM (Timing-Driven Memory Accesses) GPU TM Applications Hash Table (HT-H, HT-L) Bank Account (ATM) Cloth Physics (CL) Barnes Hut (BH) CudaCuts (CC) Data Mining (AP)
0.976 correlation on subset of CUDA SDK that decuda correctly Disassembles Note: Rest of data uses PTX instead of SASS (0.93 correlation) GPGPU-Sim 3.0.x running SASS (decuda) (We believe GPGPU-Sim is reasonable proxy.)
Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 38 Performance (vs. Serializing Tx)
Absolute Performance (IPC) TM on GPU performs well for applications with low contention. Poorly: Memory divergence, low parallelism, high conflict rate (tackle through alg. design/tuning?) CPU vs GPU? CC: FG-Lock version 400X faster than its CPU version BH: FG-Lock version 2.5X faster than its CPU version Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 39 IPC
Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 40 Performance (Exec. Time) Captures 59% of FG Lock Performance 128X Faster than Serialized Tx Exec.
Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 41 KILO TM Scaling
Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 42 Abort Commit Ratio Increasing number of TXs => increase probability of conflict Two possible solutions (future work): Solution 1: Application performance tuning (easier with TM vs. FG Lock) Solution 2: Transaction schedule
Thread Cycle Breakdown Status of a thread at each cycle Categories: TC: In a warp stalled by concurrency control TO: In a warp committing its transactions TW: Have passed commit, and waiting for other threads in the warp to pass TA: Executing an eventually aborted transaction TU: Executing an eventually committed transaction (Useful work) AT: Acquiring a lock or doing an Atomic Operation BA: Waiting at a Barrier NL: Doing non-transactional (Normal) work Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 43
Thread Cycle Breakdown Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 44 HT-H HT-L ATM CL BH CC AP
Core Cycle Breakdown Action performed by a core at each cycle Categories: EXEC: Issuing a warp for execution STALL: Stalled by a downstream warp SCRB: All warps blocked by the scoreboard, due to data hazards, concurrency control, pending commits (or any combination thereof) IDLE: None of the warps are ready in the instruction buffer. Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 45
Core Cycle Breakdown Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 46
Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 47 Read-Write Buffer Usage
Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 48 # In-Flight Buffers
Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 49 Implementation Complexity Logs in Private L1 Data Cache Commit Unit 5kB Last Writer History Unit 19kB Transaction Status 32kB Read-Set and Write-Set Buffer CACTI 40nm 0.40mm 2 x 6 Memory Partition 0.5% of 520mm 2
Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 50 Summary KILO TM 1000s of Concurrent Transactions Value-Based Conflict Detection Speculative Validation for Commit Parallelism 59% Fine-Grained Locking Performance 0.5% Area Overhead
Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 51 Backup Slides
Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 52 Logical Stage Organization
Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 53 Execution Time Breakdown