Hardware Transactional Memory for GPU Architectures* Wilson W. L. Fung Inderpeet Singh Andrew Brownsword Tor M. Aamodt University of British Columbia *In.

Slides:



Advertisements
Similar presentations
Energy Efficient GPU Transactional Memory via Space-Time Optimizations Wilson W. L. Fung Tor M. Aamodt.
Advertisements

Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.
Cache Coherence for GPU Architectures Inderpreet Singh 1, Arrvindh Shriraman 2, Wilson Fung 1, Mike O’Connor 3, Tor Aamodt 1 Image source:
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Optimization on Kepler Zehuan Wang
Hardware Transactional Memory for GPU Architectures Wilson W. L. Fung Inderpeet Singh Andrew Brownsword Tor M. Aamodt University of British Columbia In.
1 Lecture 20: Speculation Papers: Is SC+ILP=RC?, Purdue, ISCA’99 Coherence Decoupling: Making Use of Incoherence, Wisconsin, ASPLOS’04 Selective, Accurate,
Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University.
Instructor Notes This lecture deals with how work groups are scheduled for execution on the compute units of devices Also explain the effects of divergence.
Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.
1 Wilson W. L. Fung Tor M. Aamodt University of British Columbia HPCA-17 Feb 14, 2011.
1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.
[ 1 ] Agenda Overview of transactional memory (now) Two talks on challenges of transactional memory Rebuttals/panel discussion.
1 Lecture 7: Transactional Memory Intro Topics: introduction to transactional memory, “lazy” implementation.
1 Lecture 23: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
12/1/2005Comp 120 Fall December Three Classes to Go! Questions? Multiprocessors and Parallel Computers –Slides stolen from Leonard McMillan.
Unbounded Transactional Memory Paper by Ananian et al. of MIT CSAIL Presented by Daniel.
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Veynu Narasiman The University of Texas at Austin Michael Shebanow
Highly Available ACID Memory Vijayshankar Raman. Introduction §Why ACID memory? l non-database apps: want updates to critical data to be atomic and persistent.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
Sutirtha Sanyal (Barcelona Supercomputing Center, Barcelona) Accelerating Hardware Transactional Memory (HTM) with Dynamic Filtering of Privatized Data.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.
On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
HARD: Hardware-Assisted lockset- based Race Detection P.Zhou, R.Teodorescu, Y.Zhou. HPCA’07 Shimin Chen LBA Reading Group Presentation.
Sunpyo Hong, Hyesoon Kim
Operation of the SM Pipeline
(1) ©Sudhakar Yalamanchili unless otherwise noted Reducing Branch Divergence in GPU Programs T. D. Han and T. Abdelrahman GPGPU 2011.
My Coordinates Office EM G.27 contact time:
Efficient and Easily Programmable Accelerator Architectures Tor Aamodt University of British Columbia PPL Retreat, 31 May 2013.
Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.
Lecture 20: Consistency Models, TM
COMP 740: Computer Architecture and Implementation
15-740/ Computer Architecture Lecture 3: Performance
Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun
Multiscalar Processors
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Lecture 5: GPU Compute Architecture
Hardware Transactional Memory for GPU Architectures*
Superscalar Processors & VLIW Processors
Lecture 5: GPU Compute Architecture for the last time
Lecture 6: Transactions
Lecture 22: Consistency Models, TM
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Instruction Execution Cycle
Operation of the Basic SM Pipeline
Mattan Erez The University of Texas at Austin
General Purpose Graphics Processing Units (GPGPUs)
Lecture: Consistency Models, TM
6- General Purpose GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Hardware Transactional Memory for GPU Architectures* Wilson W. L. Fung Inderpeet Singh Andrew Brownsword Tor M. Aamodt University of British Columbia *In Proc ACM/IEEE Int’l Symp. Microarchitecture (MICRO-44)

Talk Outline What we mean by “GPU” in this work. Data Synchronization on GPUs. What is Transactional Memory (TM)? TM is compatible with OpenCL. … but is TM compatible with GPU hardware? KILO TM: A Hardware TM for GPUs. Results Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 2

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 3 What is a GPU (in this work)? GPU is NVIDIA/AMD-like, Compute Accelerator  SIMD HW + Aggressive Memory Subsystem => High Compute Throughput and Efficiency Non-Graphics API: OpenCL, DirectCompute, CUDA  Programming Model: Hierarchy of scalar threads  Today: Limited Communication & Synchronization Kernel Blocks Work Group / Thread Blocks Scalar Thread Wavefront / Warp Shared (Local) MemoryBarrier Global Memory

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 4 Memory Partition Baseline GPU Architecture Interconnection Network SIMT Core SIMT Front End SIMD Datapath Fetch Decode Schedule Branch Done (Warp ID) Memory Subsystem Icnt. Network SMem Non-Coherent L1 D-Cache Tex $Const$ Memory Partition Last-Level Cache Bank Off-Chip DRAM Channel Atomic Op. Unit

Stack-Based SIMD Reconvergence (“SIMT”) (Levinthal SIGGRAPH’84, Fung MICRO’07) 5 -G1111 TOS B CD E F A G Thread Warp Common PC Thread 2 Thread 3 Thread 4 Thread 1 B/1111 C/1001D/0110 E/1111 A/1111 G/1111 -A1111 TOS ED0110 EC1001 TOS -E1111 ED0110 TOS -E1111 ADGA Time CBE -B1111 TOS -E1111 TOS Reconv. PC Next PCActive Mask Stack ED0110 EE1001 TOS -E

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 6 Data Synchronizations on GPUs Motivation  Solve wider range of problems on GPU  Data Race  Data Synchronization Current Solution: Atomic read-modify-write (32-bit/64-bit). Best Sol’n?  Why Transactional Memory? E.g. N-Body with 5M bodies (traditional sync, not TM) CUDA SDK: O(n 2 ) – 1640 s (barrier) Barnes Hut: O(nLogn) – 5.2 s (atomics, harder to get right) Easier to Write/Debug Efficient Algorithms Practical efficiency. Want efficiency of GPU with reasonable (not superhuman) effort and time.

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 7 Data Synchronizations on GPUs Deadlock-free code with fine-grained locks and 10,000+ hardware scheduled threads is hard # Locks x # Sharing Thread # Possible Global Lock States Which of these states are deadlocks?! Other general problems with lock based synchronization  Implicit relationship between locks and objects being protected  Code is not composable

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 8 Data Synchronization Problems Specific to GPUs Interaction between locks and SIMT control flow can cause deadlocks A: while(atomicCAS(lock,0,1)==1); B: // Critical Section … C: lock = 0; A: done = 0; B: while(!done){ C: if(atomicCAS(lock,0,1)==1){ D: // Critical Section … E: lock = 0; F: done = 1; G: } H: }

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 9 Transactional Memory Program specifies atomic code blocks called transactions [Herlihy’93] TM Version: atomic { X[c] = X[a]+X[b]; } Lock Version: Lock(X[a]); Lock(X[b]); Lock(X[c]); X[c] = X[a]+X[b]; Unlock(X[c]); Unlock(X[b]); Unlock(X[a]); Potential Deadlock!

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 10 Transactional Memory Commit TX1 Non-conflicting transactions may run in parallel TX2 A B C D Memory Conflicting transactions automatically serialized TX1 A B C D Memory TX2 CommitAbort Commit TX2 Programmers’ View: TX1 TX2TX1 TX2 OR Time

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 11 Transactional Memory Each transaction has 3 phases  Execution Track all memory accesses (Read-Set and Write-Set)  Validation Detect any conflicting accesses between transactions Resolve conflict if needed (abort/stall)  Commit Update global memory

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 12 Transactional Memory on OpenCL A natural extension to OpenCL Programming Model  Program can launch many more threads than the hardware can execute concurrently  GPU-TM? Current threads running transactions do not need to wait for future unscheduled threads GPU HW

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 13 Are TM and GPUs Incompatible? The problem with GPUs (from TM perspective): 1000s of concurrent threads Inter-thread spatial locality common No cache coherence No private cache for each thread (Buffering?) Tx Abort  Control flow divergence

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 14 Scalable Coherence Hardware TM for GPUs Challenge: Conflict Detection Bus Inv C Conflict! 1024-bit Signature/Thread 3.8MB / 30k Threads TX1 TX4 TX2 TX3 R(A), W(C) R(D) R(A) R(C), W(B) Private Data Cache Signature No coherence on GPUs? Each scalar thread needs own cache?

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 15 Register File CPU Core 10s of Registers Hardware TM for GPUs Challenge: Transaction Rollback Checkpoint Register TX TX Abort Register File GPU Core (SM) 32k Registers Warp 2MB Total On-Chip Storage Checkpoint?

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 16 Hardware TM for GPUs Challenge: Access Granularity and Write Buffer CPU Core L1 Data Cache TX GPU Core (SM) L1 Data Cache Warp 1-2 Threads32kB Cache Threads Fermi’s L1 Data Cache (48kB) = 384 X 128B Lines Global Memory Commit Warp Problem: 384 lines / 1536 threads < 1 line per thread!

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 17 Hardware TM on GPUs Challenge: SIMT Hardware On GPUs, scalar threads in a warp/wavefront execute in lockstep... TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit... Committed A Warp with 8 Scalar Threads Aborted Reconvergence?

Goal We take it as a given that most programmers trying lock based programming on a GPU will give up before they manage to get their application working. Hence, our goal was to find the most efficient approach to implement TM on GPU. Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 18

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 19 KILO TM Supports 1000s of concurrent transactions Transaction-aware SIMT stack No cache coherence protocol dependency Word-level conflict detection Captures 59% of FG Lock Performance  128X Faster than Serialized Tx Exec.

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 20 KILO TM: Design Highlights Value-Based Conflict Detection  Self-Validation + Abort: Simple Communication  No Cache Coherence Dependence Speculative Validation  Increase Commit Parallelism

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 21 High Level GPU Architecture + KILO TM Implementation Overview

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 22 KILO TM: SIMT Core Changes SW Register Checkpoint  Observation: Most overwritten registers not used  Compiler analysis can identify what to checkpoint Transaction Abort  ~ Do-While Loop  Extend SIMT Stack with special entries to track aborted transactions in each warp TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit Abort Overwritten

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 23 Transaction-Aware SIMT Stack A: t = tid.x; if (…) { B: tx_begin; C: x[t%10] = y[t] + 1; D: if (s[t]) E: y[t] = 0; F: tx_commit; G: z = y[t]; } H: w = y[t+1]; Implicit loop when tx_begin: Copy Active Mask Active Mask RPCPCType --HN HBN CR CT tx_commit, restart Tx for thread 6 & 7: Active Mask RPCPCType --HN HBN CR CT tx_commit, thread 6 & 7 failed validation: Copy Active Mask + PC Active Mask RPCPCType --HN HBN CR FT TOS Branch Divergence within Tx: Active Mask RPCPCType --HN HBN CR FT FEN tx_commit, all threads with Tx committed: Active Mask RPCPCType --HN HGN CR TOS

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 24 TX2 atomic{A=B+2} Private Memory Read-Log Write-Log KILO TM: Value-Based Conflict Detection TX1 atomic{B=A+1} Private Memory Read-Log Write-Log Global Memory A=1 B=2 TxBegin LD r1,[A] ADD r1,r1,1 ST r1,[B] TxCommit A=1 B=0 A=1 B=0 A=2 B=2 Self-Validation + Abort:  Only detects existence of conflict (not identity) => No Tx to Tx Msg – Simple Communication TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 25 TX2 atomic{A=B+2} Private Memory Read-Log Write-Log Parallel Validation? TX1 atomic{B=A+1} Private Memory Read-Log Write-Log Global Memory A=1 B=2 A=1 B=0 A=1 B=0 A=2 B=0 B=2 A=2 Init: A=1,B=0 Tx1 then Tx2: B=2,A=4 Tx2 then Tx1: A=2,B=3 OR Data Race!?!

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 26 Serialize Validation? Global Memory Commit Unit Benefit #1: No Data Race Benefit #2: No Live Lock (generic lazy TM prob.) Drawback: Serializes Non-Conflicting Transactions (“collateral damage”) TX1TX2 V + CStall Time V + C

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 27 Identifying Non-conflicting Tx: Step 1: Leverage Parallelism TX1 Global Memory Partition Commit Unit TX2 Global Memory Partition Commit Unit Global Memory Partition Commit Unit TX1 TX3

Solution: Speculative Validation Key Idea: Split Validation into two parts  Part 1: Check recently committed transactions  Part 2: Check concurrently committing transactions Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 28

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 29 KILO TM: Speculative Validation Memory subsystem is deeply pipelined and highly parallel Global Memory Partition Commit Unit TX1 TX2 TX3 Hazard Detection Log Transfer Spec. Validation Validation Wait Finalize Outcome Commit C A D TX1 Read-Log Write-Log TX2 TX3 Validation Queue R(C),W(D) R(A),W(B) R(D),W(E)

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 30 KILO TM: Speculative Validation Global Memory Partition Commit Unit Hazard Detection Log Transfer Spec. Validation Validation Wait Finalize Outcome Commit C A D TX1 TX2 TX3 Validation Queue Last Writer History TX1TX2TX3 R(C),W(D)R(A),W(B)R(D),W(E) C? Nil W(D) TX1DTX2BTX3E TX1 A?D? STALL Last Writer History AddrCID Evict Lookup Table Recency Bloom Filter CID

KLMN 7654 EFGH 6168 Log Storage Transaction logs are stored at the private memory of each thread  Located in DRAM, cached in L1 and L2 caches T0T1T2T3 ABCD LD Wavefront ABCD 3497 T0’s view of private memory Consecutive physical address Address Value Read-Log Ptr EFGH LD Write-Log Ptr KLMN ST

Log Transfer Entries heading to same memory partition can be grouped into a larger packet KLMN 7654 EFGH 6168 ABCD 3497 Read-Log Ptr Write-Log Ptr Partition 0 Partition 1Partition 3 B4 AC39 D7 Packets to Commit Units Commit Unit 0 Commit Unit 1 Commit Unit 2

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 33 Distributed Commit / HW Org.

ABA Problem? Classic Example: Linked List Based Stack Thread 0 – pop(): while (true) { t = top; Next = t->Next; // thread 2: pop A, pop B, push A if (atomicCAS(&top, t, next) == t) break; // succeeds! } topA Next B C Null topA Next C Null topB Next C Null topB Next C Null NextBtAtopC NextNull

ABA Problem? atomicCAS protects only a single word  Only part of the data structure Value-based conflict detection protects all relevant parts of the data structure topA Next B C Null while (true) { t = top; Next = t->Next; if (atomicCAS(&top, t, next) == t) break; // succeeds! }

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 36 Evaluation Methodology GPGPU-Sim 3.0 (BSD license)  Detailed: IPC Correlation of 0.93 vs GT 200  KILO TM (Timing-Driven Memory Accesses) GPU TM Applications  Hash Table (HT-H, HT-L)  Bank Account (ATM)  Cloth Physics (CL)  Barnes Hut (BH)  CudaCuts (CC)  Data Mining (AP)

0.976 correlation on subset of CUDA SDK that decuda correctly Disassembles Note: Rest of data uses PTX instead of SASS (0.93 correlation) GPGPU-Sim 3.0.x running SASS (decuda) (We believe GPGPU-Sim is reasonable proxy.)

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 38 Performance (vs. Serializing Tx)

Absolute Performance (IPC) TM on GPU performs well for applications with low contention. Poorly: Memory divergence, low parallelism, high conflict rate (tackle through alg. design/tuning?) CPU vs GPU?  CC: FG-Lock version 400X faster than its CPU version  BH: FG-Lock version 2.5X faster than its CPU version Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 39 IPC

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 40 Performance (Exec. Time) Captures 59% of FG Lock Performance 128X Faster than Serialized Tx Exec.

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 41 KILO TM Scaling

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 42 Abort Commit Ratio Increasing number of TXs => increase probability of conflict Two possible solutions (future work): Solution 1: Application performance tuning (easier with TM vs. FG Lock) Solution 2: Transaction schedule

Thread Cycle Breakdown Status of a thread at each cycle Categories:  TC: In a warp stalled by concurrency control  TO: In a warp committing its transactions  TW: Have passed commit, and waiting for other threads in the warp to pass  TA: Executing an eventually aborted transaction  TU: Executing an eventually committed transaction (Useful work)  AT: Acquiring a lock or doing an Atomic Operation  BA: Waiting at a Barrier  NL: Doing non-transactional (Normal) work Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 43

Thread Cycle Breakdown Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 44 HT-H HT-L ATM CL BH CC AP

Core Cycle Breakdown Action performed by a core at each cycle Categories:  EXEC: Issuing a warp for execution  STALL: Stalled by a downstream warp  SCRB: All warps blocked by the scoreboard, due to data hazards, concurrency control, pending commits (or any combination thereof)  IDLE: None of the warps are ready in the instruction buffer. Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 45

Core Cycle Breakdown Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 46

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 47 Read-Write Buffer Usage

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 48 # In-Flight Buffers

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 49 Implementation Complexity Logs in Private L1 Data Cache Commit Unit  5kB Last Writer History Unit  19kB Transaction Status  32kB Read-Set and Write-Set Buffer CACTI 40nm  0.40mm 2 x 6 Memory Partition  0.5% of 520mm 2

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 50 Summary KILO TM  1000s of Concurrent Transactions  Value-Based Conflict Detection  Speculative Validation for Commit Parallelism  59% Fine-Grained Locking Performance  0.5% Area Overhead

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 51 Backup Slides

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 52 Logical Stage Organization

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 53 Execution Time Breakdown