Hardware Transactional Memory for GPU Architectures Wilson W. L. Fung Inderpeet Singh Andrew Brownsword Tor M. Aamodt University of British Columbia In Proc ACM/IEEE Int’l Symp. Microarchitecture (MICRO-44)
Hardware TM for GPU Architectures 2 Motivation Lifetime of GPU Application Development Time Functionality Performance Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt ? Time Fine-Grained Locking Time Transactional Memory E.g. N-Body with 5M bodies CUDA SDK: O(n 2 ) – 1640 s (barrier) Barnes Hut: O(nLogn) – 5.2 s (locks)
Hardware TM for GPU Architectures 3 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 3 Are TM and GPUs Incompatible? GPUs different from Multi-Core CPUs 1000s Concurrent Scalar Threads Challenges (from TM perspective) Our Solution: KILO TM Hardware TM for GPUs
Hardware TM for GPU Architectures 4 T0T1T2T3 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 4 Hardware TM for GPUs Challenge #1: SIMD Hardware On GPUs, scalar threads in a warp/wavefront execute in lockstep... TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit... Committed A Warp with 4 Scalar Threads Aborted Branch Divergence! T0T1T2T3T0T1T2T3
... TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit... Hardware TM for GPU Architectures 5 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 5 KILO TM – Solution to Challenge #1: SIMD Hardware Transaction Abort Like a Loop Extend SIMT Stack Abort
Hardware TM for GPU Architectures 6 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 6 Register File CPU Core 10s of Registers Hardware TM for GPUs Challenge #2: Transaction Rollback Checkpoint Register TX TX Abort Register File GPU Core (SM) 32k Registers Warp 2MB Total On-Chip Storage Checkpoint?
Hardware TM for GPU Architectures 7 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 7 KILO TM – Solution to Challenge #2: Transaction Rollback SW Register Checkpoint Most TX: Registers overwritten at first use TX in Barnes Hut: Checkpoint 2 registers TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit Abort Overwritten
Hardware TM for GPU Architectures 8 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures8 Hardware TM for GPUs Challenge #3: Conflict Detection Existing HTMs use Cache Coherence Protocol Not Available on GPUs No Private Data Cache per Thread Signatures? 1024-bit / Thread 3.8MB / 30k Threads
Hardware TM for GPU Architectures 9 GPU Core (SM) L1 Data Cache Warp Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 9 Hardware TM for GPUs Challenge #4: Write Buffer Threads Fermi’s L1 Data Cache (48kB) = 384 X 128B Lines Problem: 384 lines / 1536 threads < 1 line per thread!
Hardware TM for GPU Architectures 10 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 10 TX2 atomic {A=B+2} Private Memory Read-Log Write-Log KILO TM: Value-Based Conflict Detection TX1 atomic {B=A+1} Private Memory Read-Log Write-Log Global Memory A=1 B=2 TxBegin LD r1,[A] ADD r1,r1,1 ST r1,[B] TxCommit A=1 B=0 A=1 B=0 A=2 B=2 Self-Validation + Abort: Only detects existence of conflict (not identity) TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit
TX1 atomic {B=A+1} TX2 atomic {A=B+2} Private Memory Read-Log Write-Log Hardware TM for GPU Architectures 11 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 11 Parallel Validation? Private Memory Read-Log Write-Log Global Memory A=1 B=2 A=1 B=0 A=1 B=0 A=2 B=0 B=2 A=2 Tx1 then Tx2: A=4,B=2 Tx2 then Tx1: A=2,B=3 OR Data Race!?!
Hardware TM for GPU Architectures 12 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 12 Serialize Validation? Global Memory Commit Unit Benefit #1: No Data Race Benefit #2: No Live Lock Drawback: Serializes Non-Conflicting Transactions (“collateral damage”) TX1TX2 V + CStall Time V + C V = Validation C = Commit
Hardware TM for GPU Architectures 13 Solution: Speculative Validation Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 13 Global Memory Commit Unit TX3 TX1 V+C Stall Time V+C TX2 V+C Key Idea: Split Conflict Detection into two parts 1.Recently Committed TX in Parallel 2.Concurrently Committing TX in Commit Order Approximate RS Conflict Rare Good Commit Parallelism V = Validation C = Commit
Hardware TM for GPU Architectures 14 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 14 KILO TM Implementation SIMT Stacks Commit Unit TX Log Unit Minimal Modification to Existing GPU Arch.
Hardware TM for GPU Architectures 15 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 15 Evaluation Methodology GPGPU-Sim 3.0 (BSD license) Detailed: IPC Correlation of 0.93 vs GT 200 KILO TM (Timing-Driven Memory Accesses) GPU TM Applications Hash Table (HT-H, HT-L) Bank Account (ATM) Cloth Physics (CL) Barnes Hut (BH) CudaCuts (CC) Data Mining (AP)
Hardware TM for GPU Architectures 16 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 16 Serializing TX ≈ Coarse-Grained Locks Performance (vs. Serializing TX) Higher is Better
HT-HHT-LATMCLBHCCAP N o r m a l i z e d E x e c. T i m e Hardware TM for GPU Architectures 17 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 17 Performance (Exec. Time) Captures 59% of FG Lock Performance Ideal TM KILO TM FG Lock Lower is Better
Hardware TM for GPU Architectures 18 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 18 Implementation Complexity Logs in Private L1 Data Cache Commit Unit 5kB Last Writer History Unit 19kB Transaction Status 32kB Read-Set and Write-Set Buffer CACTI 40nm 0.40mm 2 x 6 Memory Partition 0.5% of 520mm 2
Hardware TM for GPU Architectures 19 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 19 Summary KILO TM: Hardware TM for GPUs 1000s of Concurrent Scalar TXs Handles Scalar TX Abort No cache coherence protocol dependency Word-level conflict detection Unbounded Transaction 59% Fine-Grained Locking Performance 128X Faster than Serializing TX Execution 0.5% Area Overhead Question?
Hardware TM for GPU Architectures 20 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 20 Backup Slides
Hardware TM for GPU Architectures 21 ABA Problem? Classic Example: Linked List Based Stack Thread 0 – pop(): while (true) { t = top; Next = t->Next; // thread 2: pop A, pop B, push A if (atomicCAS(&top, t, next) == t) break; // succeeds! } topA Next B C Null topA Next C Null topB Next C Null topB Next C Null NextBtAtopC NextNull Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 21
Hardware TM for GPU Architectures 22 ABA Problem? atomicCAS protects only a single word Only part of the data structure Value-based conflict detection protects all relevant parts of the data structure topA Next B C Null while (true) { t = top; Next = t->Next; if (atomicCAS(&top, t, next) == t) break; // succeeds! } Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 22