Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hardware Transactional Memory for GPU Architectures*

Similar presentations


Presentation on theme: "Hardware Transactional Memory for GPU Architectures*"— Presentation transcript:

1 Hardware Transactional Memory for GPU Architectures*
Wilson W. L. Fung Inderpeet Singh Andrew Brownsword Tor M. Aamodt University of British Columbia *In Proc ACM/IEEE Int’l Symp. Microarchitecture (MICRO-44)

2 Hardware TM for GPU Architectures
Motivation Lifetime of GPU Application Development Time Functionality Performance E.g. N-Body with 5M bodies CUDA SDK: O(n2) – 1640 s (barrier) Barnes Hut: O(nLogn) – 5.2 s (locks) Time Fine-Grained Locking Time Transactional Memory Usually, developers can get the application running on GPU after a little bit of initial effort, and spend the rest of their time tuning the application for performance. For application that requires fine-grained data synchronization, the developer may spend significant effort just to get the code working. As in the usual case, the first working code performs poorly, so the effort to tune the application is still required. If the developers spend the effort to tune the code, the reward can be great. For example, O(nLogn) implementation of N-Body is orders of magnitude faster than the naïve implementation, but it is harder to get rights because of locks. Many developers will give up long before the code is working, and even if they did, the foreseeable effort required to tune the application is enough to scare them off. With TM, the developer can get the code working much sooner, and spend their effort tuning the code for performing instead. In reality, there is a performance overhead for using TM, but overall it is a gain for the developer. Better performing code with same effort. ? Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

3 Hardware TM for GPU Architectures
Talk Outline What we mean by “GPU” in this work. Data Synchronization on GPUs. What is Transactional Memory (TM)? TM is compatible with OpenCL. … but is TM compatible with GPU hardware? KILO TM: A Hardware TM for GPUs. Results Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

4 What is a GPU (in this work)?
GPU is NVIDIA/AMD-like, Compute Accelerator SIMD HW + Aggressive Memory Subsystem => High Compute Throughput and Efficiency Non-Graphics API: OpenCL, DirectCompute, CUDA Programming Model: Hierarchy of scalar threads Today: Limited Communication & Synchronization GPUs are prominent examples of commodity compute accelerators Features SIMD HW and aggressive memory subsystem to achieve its compute bandwidth efficiently. The hardware is exposed via non-graphics API: … These API features programming model that allow the CPU thread to launch a hierarchy of scalar threads onto the GPU The threads have limited communication and synchronization: Threads within a block can communicate via shared memory, but communication between threads can only go through global memory. Kernel Blocks Blocks Scalar Thread Wavefront / Warp 1 2 3 4 5 6 7 8 9 10 11 12 Work Group / Thread Blocks Global Memory Barrier Shared (Local) Memory Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

5 Baseline GPU Architecture
Memory Partition Memory Partition Memory Partition SIMT Core Atomic Op. Unit Interconnection Network Last-Level Cache Bank SIMT Front End SIMD Datapath Fetch Decode Schedule Branch Done (Warp ID) Memory Subsystem Icnt. Network SMem Non-Coherent L1 D-Cache Tex $ Const$ Off-Chip DRAM Channel This is our view of a generic GPU architecture. Highlights: SIMT Cores: 100s of thread They share a common memory subsystem Non-Coherent L1 D Cache Atomic Operation Unit Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

6 Stack-Based SIMD Reconvergence (“SIMT”) (Levinthal SIGGRAPH’84, Fung MICRO’07)
Reconv. PC Next PC Active Mask Stack B C D E F A G A/1111 E D 0110 C 1001 TOS - 1111 - E 1111 TOS E D 0110 1001 TOS - 1111 - A 1111 TOS - B 1111 TOS E D 0110 TOS - 1111 - G 1111 TOS B/1111 C/1001 D/0110 Thread Warp Common PC Thread 2 3 4 1 E/1111 G/1111 A B C D E G A Time 17

7 Data Synchronizations on GPUs
Motivation Solve wider range of problems on GPU Data Race  Data Synchronization Current Solution: Atomic read-modify-write (32-bit/64-bit). Best Sol’n? Why Transactional Memory? E.g. N-Body with 5M bodies (traditional sync, not TM) CUDA SDK: O(n2) – 1640 s (barrier) Barnes Hut: O(nLogn) – 5.2 s (atomics, harder to get right) Easier to Write/Debug Efficient Algorithms Practical efficiency. Want efficiency of GPU with reasonable (not superhuman) effort and time. Barnes Hut version is from GPU Compute Gems Chapter 6. Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

8 Data Synchronizations on GPUs
Deadlock-free code with fine-grained locks and 10,000+ hardware scheduled threads is hard # Locks x # Sharing Thread # Possible Global Lock States Which of these states are deadlocks?! Non-Determinism is the worse problem here: Every time the programmer runs the program to try to reproduce a deadlock, the deadlock will occur in some different way. For a starter, one need to look at a right set of threads to understand how the deadlock occurs. That in itself is not simple. This problem is well known in the supercomputing community. They have created a special debugging tool to analyze traces from 1000s of compute nodes and summarize them into categories of behaviors. Other general problems with lock based synchronization Implicit relationship between locks and objects being protected Code is not composable Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

9 Data Synchronization Problems Specific to GPUs
Interaction between locks and SIMT control flow can cause deadlocks A: while(atomicCAS(lock,0,1)==1); B: // Critical Section … C: lock = 0; A: done = 0; B: while(!done){ C: if(atomicCAS(lock,0,1)==1){ D: // Critical Section … E: lock = 0; F: done = 1; G: } H: } This is specific to GPU (SIMT control flow). In this example, we have two threads in a wavefront/warp. In the first code, T0 obtains the lock, and exited the loop. It will wait at the start of the critical section for T1 (for reconvergence). But, T1 will be stuck in the loop forever because T0 has the lock. Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

10 Hardware TM for GPU Architectures
Transactional Memory Program specifies atomic code blocks called transactions [Herlihy’93] Lock Version: Lock(X[a]); Lock(X[b]); Lock(X[c]); X[c] = X[a]+X[b]; Unlock(X[c]); Unlock(X[b]); Unlock(X[a]); TM Version: atomic { X[c] = X[a]+X[b]; } Potential Deadlock! Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

11 Hardware TM for GPU Architectures
Transactional Memory Programmers’ View: TX1 Time TX2 Time OR TX2 TX1 TX1 Non-conflicting transactions may run in parallel TX2 A B C D Memory Conflicting transactions automatically serialized TX1 A B C D Memory TX2 In the programmers’ view, transactions are executed atomically in some order. The underlying runtime system can optimistically run non-conflicting transactions in parallel. Conflicting transactions may be resolved by aborting one of the transaction and restarting it later. Commit Abort Commit Commit TX2 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

12 Hardware TM for GPU Architectures
Transactional Memory Each transaction has 3 phases Execution Track all memory accesses (Read-Set and Write-Set) Validation Detect any conflicting accesses between transactions Resolve conflict if needed (abort/stall) Commit Update global memory Read-Set = All memory locations read by the transaction Write-Set = All memory locations written by the transaction Conflicts exits between two transactions when a transaction updates a memory location that has previously been read by the other transaction. Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

13 Transactional Memory on OpenCL
A natural extension to OpenCL Programming Model Program can launch many more threads than the hardware can execute concurrently GPU-TM? Current threads running transactions do not need to wait for future unscheduled threads GPU HW The GPU hardware will execute the threads group by group. I.e. Threads in the same launch may not coexists on the hardware. If threads requires synchronization with another thread that has yet to be spawn (i.e., future unscheduled thread), it will be a deadlock. Transactions are supposed to run on its own, so a thread do not need to wait for another thread to progress. Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

14 Are TM and GPUs Incompatible?
The problem with GPUs (from TM perspective): 1000s of concurrent threads Inter-thread spatial locality common No cache coherence No private cache for each thread (Buffering?) Tx Abort  Control flow divergence Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

15 Hardware TM for GPUs Challenge: Conflict Detection
1024-bit Signature/Thread 3.8MB / 30k Threads Private Data Cache Bus Signature Inv C TX1 R(A), W(C) R(D) R(A) R(C), W(B) Scalable Coherence No coherence on GPUs? Each scalar thread needs own cache? TX2 TX3 A key challenge to support hardware TM for GPUs: Identifying conflicts among 30k concurrent threads. Typical CPU-Based HTM assume each transaction to have a private data cache to record what the transaction has read or written. In this example: TX1 is writing to X, which has been read by TX4. How do we detect this read-write conflict? Naïve solution: Broadcast the write, and each thread checks its cache to detect a read-write conflict. This generate a lot of wasteful traffic. Scalable coherence can reduce the traffic. But, GPU does not have coherence, nor it has private data cache for each thread! Transactions can use signatures to record the read-set and write-set instead. In this design, a committing transaction would broadcast its signature for conflict detection. We had to use 1024-bit signature per thread (4 x 256-bit sub-signatures with H3 family hash) to keep false conflict rate below 20%. The signature storage is so large that we do not care to measure the traffic overhead of sending them around. TX4 Conflict! Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

16 Hardware TM for GPUs Challenge: Transaction Rollback
Register File GPU Core (SM) 32k Registers Warp Register File CPU Core 10s of Registers Checkpoint Register File @ TX Entry Abort 2MB Total On-Chip Storage Checkpoint? Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

17 Hardware TM for GPUs Challenge: Access Granularity and Write Buffer
GPU Core (SM) Warp L1 Data Cache CPU Core Warp Warp Warp Warp Warp Warp L1 Data Cache TX 1-2 Threads 32kB Cache Global Memory Commit Threads Fermi’s L1 Data Cache (48kB) = 384 X 128B Lines Problem: 384 lines / 1536 threads < 1 line per thread! Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

18 Hardware TM on GPUs Challenge: SIMT Hardware
On GPUs, scalar threads in a warp/wavefront execute in lockstep A Warp with 8 Scalar Threads ... TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit Reconvergence? Committed Aborted Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

19 Hardware TM for GPU Architectures
Goal We take it as a given that most programmers trying lock based programming on a GPU will give up before they manage to get their application working. Hence, our goal was to find the most efficient approach to implement TM on GPU. Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

20 Hardware TM for GPU Architectures
KILO TM Supports 1000s of concurrent transactions Transaction-aware SIMT stack No cache coherence protocol dependency Word-level conflict detection Captures 59% of FG Lock Performance 128X Faster than Serialized Tx Exec. Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

21 KILO TM: Design Highlights
Value-Based Conflict Detection Self-Validation + Abort: Simple Communication No Cache Coherence Dependence Speculative Validation Increase Commit Parallelism Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

22 High Level GPU Architecture + KILO TM Implementation Overview
Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

23 KILO TM: SIMT Core Changes
SW Register Checkpoint Observation: Most overwritten registers not used Compiler analysis can identify what to checkpoint Transaction Abort ~ Do-While Loop Extend SIMT Stack with special entries to track aborted transactions in each warp Overwritten TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit Abort The example transaction shows two things: The register R2 is overwritten right away, so there is no point in restoring it to the original value. In KILO TM, transactions normally aborts at the end of the transaction, so an abort would act like a loop. This is similar divergence with loops. We can set the reconvergence point right after the transaction, and use the stack to for the committed transaction to wait for the aborted ones. We still have specialize entries, because it is possible for transaction to abort in the middle of a transaction (to handle the opacity corner case). Next slide will have more detail. Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

24 Transaction-Aware SIMT Stack
A: t = tid.x; if (…) { B: tx_begin; C: x[t%10] = y[t] + 1; D: if (s[t]) E: y[t] = 0; F: tx_commit; G: z = y[t]; } H: w = y[t+1]; @ tx_begin: Copy Active Mask Active Mask RPC PC Type -- H N B C R T TOS Implicit loop when abort @ tx_commit, thread 6 & 7 failed validation: Copy Active Mask + PC Active Mask RPC PC Type -- H N B C R F T TOS Branch Divergence within Tx: Active Mask RPC PC Type -- H N B C R F T E TOS @ tx_commit, restart Tx for thread 6 & 7: Active Mask RPC PC Type -- H N B C R T TOS @ tx_commit, all threads with Tx committed: Active Mask RPC PC Type -- H N G C R TOS Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

25 KILO TM: Value-Based Conflict Detection
Global Memory A=1 A=1 TX1 atomic{B=A+1} Private Memory B=2 B=0 Read-Log Write-Log TxBegin LD r1,[A] ADD r1,r1,1 ST r1,[B] TxCommit A=1 TX2 atomic{A=B+2} Private Memory Read-Log Write-Log B=2 B=2 TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit B=0 With value-based conflict detection, there is no broadcast traffic. Read-Log and write-log of each transaction are linear buffers stored in the private memory of the thread. We assume that the non-coherence L1 data cache can cache these logs. Self-Validation + Abort: Only detects existence of conflict (not identity) => No Tx to Tx Msg – Simple Communication A=2 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

26 Hardware TM for GPU Architectures
Parallel Validation? Data Race!?! Init: A=1,B=0 Tx1 then Tx2: B=2,A=4 Global Memory A=1 A=1 TX1 atomic{B=A+1} Private Memory B=0 B=0 OR Read-Log Write-Log Tx2 then Tx1: A=2,B=3 A=1 TX2 atomic{A=B+2} Private Memory Read-Log Write-Log B=2 B=2 B=0 A=2 A=2 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

27 Hardware TM for GPU Architectures
Serialize Validation? TX1 TX2 Time Global Memory Commit Unit V + C Stall V + C Benefit #1: No Data Race Benefit #2: No Live Lock (generic lazy TM prob.) Drawback: Serializes Non-Conflicting Transactions (“collateral damage”) Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

28 Identifying Non-conflicting Tx: Step 1: Leverage Parallelism
Global Memory Partition Commit Unit TX3 TX3 Global Memory Partition TX1 Commit Unit TX1 One obvious way to increase commit parallelism is to divide up the memory space into multiple partitions, and allow transactions that touched different partitions to validate and commit in parallel. For example, TX1 and TX2 can commit in parallel here. We do have to add some extra protocol to ensure that transactions touching more than one memory partition to work. For example, we need to make sure that both commit units are serializing the validation and commit in the same order for TX1 and TX3. However, there are just way too many concurrent transactions, and too few partitions. There needs to be a better way to increase commit parallelism within a commit unit. TX2 Global Memory Partition Commit Unit Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

29 Solution: Speculative Validation
Key Idea: Split Validation into two parts Part 1: Check recently committed transactions Part 2: Check concurrently committing transactions Part 1 is value-based conflict detection. It requires checking global memory, a long latency operation. Part 2 is done within the commit unit via specialized hardware, hence it is much faster. To hide this latency, we perform Part 1 speculatively, and do Part 2 while we wait for the data to return from global memory. Next slide is an example to illustrate how this works. Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

30 KILO TM: Speculative Validation
Memory subsystem is deeply pipelined and highly parallel Read-Log Write-Log Commit Unit TX1 TX3 Validation Queue R(C),W(D) TX2 Global Memory Partition Log Transfer Spec. Validation TX1 TX2 Hazard Detection C R(A),W(B) Validation Wait A D TX3 Finalize Outcome R(D),W(E) Commit Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

31 KILO TM: Speculative Validation
TX1 TX2 TX3 R(C),W(D) R(A),W(B) R(D),W(E) Commit Unit Validation Queue TX3 Global Memory Partition Log Transfer Spec. Validation Last Writer History Last Writer History Addr CID Evict Lookup Table Recency Bloom Filter A? D? W(D) C? TX2 TX1 D Hazard Detection TX1 TX2 B C TX1 Nil Last Writer History Table = SRAM. Fast, Low Latency Value-Based Conflict Detection = DRAM. Slower, High Latency TX1 and TX2 pass through the hazard detection stage without any detected hazard. They update the Last Writer History unit after their checking is done. TX3 detects a conflict with TX1. After the hazard detection, the transactions just wait for the value-based validation to be done. In this example, the value from the global memory matches with the transaction value. So, TX1 and TX2 can proceed and commit. Update the memory at commit stage. TX3 has to stall until TX1 has committed, and reread Y from global memory. Since TX1 did successfully commit, Y now has a different value, and the validation failed for TX3. In an actual design, this allow 1000s of non-conflicting transactions to validate and commit in parallel. TX3 E Validation Wait A D STALL Finalize Outcome Commit Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

32 Log Storage Transaction logs are stored at the private memory of each thread Located in DRAM, cached in L1 and L2 caches Consecutive physical address T0’s view of private memory Wavefront Read-Log Ptr A B C D 3 4 9 7 Address Value T0 T1 T2 T3 A B C D LD K L M N ST E F G H LD E F G H 6 1 8 Write-Log Ptr K L M N 7 6 5 4

33 Log Transfer Entries heading to same memory partition can be grouped into a larger packet Partition 0 Partition 1 Partition 3 B 4 A C 3 9 D 7 Packets to Commit Units Commit Unit 0 Commit Unit 1 Commit Unit 2 A B C D 3 4 9 7 E F G H 6 1 8 Read-Log Ptr K L M N 7 6 5 4 Write-Log Ptr

34 Distributed Commit / HW Org.
Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

35 ABA Problem? Classic Example: Linked List Based Stack
Thread 0 – pop(): top A Next B C Null while (true) { t = top; Next = t->Next; // thread 2: pop A, pop B, push A if (atomicCAS(&top, t, next) == t) break; // succeeds! } t A Next B This problem was first discovered on the IBM System 370 (one of the first machines introducing atomicCAS instructions). top A Next C Null top C Next Null top B Next C Null top B Next C Null

36 ABA Problem? atomicCAS protects only a single word
Only part of the data structure Value-based conflict detection protects all relevant parts of the data structure top A Next B C Null while (true) { t = top; Next = t->Next; if (atomicCAS(&top, t, next) == t) break; // succeeds! } The essence of the ABA problem comes from how atomicCAS only checks a single word, so race protection is only offered for part of the data structure. Protection to other parts of the data structure is inferred *by the programmer’s reasoning*. In the case with ABA problem, the programmer’s intuition missed a race condition. The value-based conflict detection used in KILO TM protects all relevant parts of the data structures, so it does not suffer from the ABA problem.

37 Evaluation Methodology
GPGPU-Sim 3.0 (BSD license) Detailed: IPC Correlation of 0.93 vs GT 200 KILO TM (Timing-Driven Memory Accesses) GPU TM Applications Hash Table (HT-H, HT-L) Bank Account (ATM) Cloth Physics (CL) Barnes Hut (BH) CudaCuts (CC) Data Mining (AP) Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

38 GPGPU-Sim 3.0.x running SASS (decuda)
0.976 correlation on subset of CUDA SDK that decuda correctly Disassembles Note: Rest of data uses PTX instead of SASS (0.93 correlation) (We believe GPGPU-Sim is reasonable proxy.)

39 Performance (vs. Serializing Tx)
Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

40 Absolute Performance (IPC)
TM on GPU performs well for applications with low contention. Poorly: Memory divergence, low parallelism, high conflict rate (tackle through alg. design/tuning?) CPU vs GPU? CC: FG-Lock version 400X faster than its CPU version BH: FG-Lock version 2.5X faster than its CPU version In general, memory divergence AP: Ported straight from CPU version Load Imbalance TM simplifies programming  Easier performance tuning Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

41 Performance (Exec. Time)
Serializing Transaction Execution = Coarse Grained Locking Performance. Captures 59% of FG Lock Performance 128X Faster than Serialized Tx Exec. Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

42 Hardware TM for GPU Architectures
KILO TM Scaling Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

43 Hardware TM for GPU Architectures
Abort Commit Ratio Increasing number of TXs => increase probability of conflict Two possible solutions (future work): Solution 1: Application performance tuning (easier with TM vs. FG Lock) Solution 2: Transaction schedule Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

44 Thread Cycle Breakdown
Status of a thread at each cycle Categories: TC: In a warp stalled by concurrency control TO: In a warp committing its transactions TW: Have passed commit, and waiting for other threads in the warp to pass TA: Executing an eventually aborted transaction TU: Executing an eventually committed transaction (Useful work) AT: Acquiring a lock or doing an Atomic Operation BA: Waiting at a Barrier NL: Doing non-transactional (Normal) work Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

45 Thread Cycle Breakdown
KL KL KL FGL FGL FGL KL-UC KL-UC IDEAL IDEAL KL-UC IDEAL KL-UC IDEAL HT-H HT-L ATM CL BH CC AP Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

46 Hardware TM for GPU Architectures
Core Cycle Breakdown Action performed by a core at each cycle Categories: EXEC: Issuing a warp for execution STALL: Stalled by a downstream warp SCRB: All warps blocked by the scoreboard, due to data hazards, concurrency control, pending commits (or any combination thereof) IDLE: None of the warps are ready in the instruction buffer. Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

47 Hardware TM for GPU Architectures
Core Cycle Breakdown KL KL FGL FGL FGL KL-UC KL-UC IDEAL IDEAL KL-UC IDEAL KL-UC IDEAL Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

48 Read-Write Buffer Usage
Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

49 Hardware TM for GPU Architectures
# In-Flight Buffers Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

50 Implementation Complexity
Logs in Private L1 Data Cache Commit Unit 5kB Last Writer History Unit 19kB Transaction Status 32kB Read-Set and Write-Set Buffer CACTI 40nm 0.40mm2 x 6 Memory Partition 0.5% of 520mm2 For sanity check, we used a memory compiler to generate these structures at 65nm, and scaled the area down to 40nm. The area is within 50% of CACTI estimate. Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

51 Hardware TM for GPU Architectures
Summary KILO TM 1000s of Concurrent Transactions Value-Based Conflict Detection Speculative Validation for Commit Parallelism 59% Fine-Grained Locking Performance 0.5% Area Overhead Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

52 Hardware TM for GPU Architectures
Backup Slides Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

53 Logical Stage Organization
Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

54 Execution Time Breakdown
Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures


Download ppt "Hardware Transactional Memory for GPU Architectures*"

Similar presentations


Ads by Google