Hardware Transactional Memory for GPU Architectures Wilson W. L. Fung Inderpeet Singh Andrew Brownsword Tor M. Aamodt University of British Columbia In.

Slides:



Advertisements
Similar presentations
Transactional Memory Parag Dixit Bruno Vavala Computer Architecture Course, 2012.
Advertisements

Energy Efficient GPU Transactional Memory via Space-Time Optimizations Wilson W. L. Fung Tor M. Aamodt.
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
Cache Coherence for GPU Architectures Inderpreet Singh 1, Arrvindh Shriraman 2, Wilson Fung 1, Mike O’Connor 3, Tor Aamodt 1 Image source:
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.
Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University.
Transactional Memory Overview Olatunji Ruwase Fall 2007 Oct
Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.
Transactional Memory (TM) Evan Jolley EE 6633 December 7, 2012.
1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
[ 1 ] Agenda Overview of transactional memory (now) Two talks on challenges of transactional memory Rebuttals/panel discussion.
Lock vs. Lock-Free memory Fahad Alduraibi, Aws Ahmad, and Eman Elrifaei.
University of Michigan Electrical Engineering and Computer Science 1 Parallelizing Sequential Applications on Commodity Hardware Using a Low-Cost Software.
1 Lecture 7: Transactional Memory Intro Topics: introduction to transactional memory, “lazy” implementation.
1 Lecture 23: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
Unbounded Transactional Memory Paper by Ananian et al. of MIT CSAIL Presented by Daniel.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Veynu Narasiman The University of Texas at Austin Michael Shebanow
NVSleep: Using Non-Volatile Memory to Enable Fast Sleep/Wakeup of Idle Cores Xiang Pan and Radu Teodorescu Computer Architecture Research Lab
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
Hardware Transactional Memory for GPU Architectures* Wilson W. L. Fung Inderpeet Singh Andrew Brownsword Tor M. Aamodt University of British Columbia *In.
EazyHTM: Eager-Lazy Hardware Transactional Memory Saša Tomić, Cristian Perfumo, Chinmay Kulkarni, Adrià Armejach, Adrián Cristal, Osman Unsal, Tim Harris,
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Divergence-Aware Warp Scheduling
Implementing Signatures for Transactional Memory Daniel Sanchez, Luke Yen, Mark Hill, Karu Sankaralingam University of Wisconsin-Madison.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Design and Implementation of Signatures in Transactional Memory Systems Daniel Sanchez August 2007 University of Wisconsin-Madison.
Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.
Auto-tuning Dense Matrix Multiplication for GPGPU with Cache
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
On Transactional Memory, Spinlocks and Database Transactions Khai Q. Tran Spyros Blanas Jeffrey F. Naughton (University of Wisconsin Madison)
Architectural Features of Transactional Memory Designs for an Operating System Chris Rossbach, Hany Ramadan, Don Porter Advanced Computer Architecture.
My Coordinates Office EM G.27 contact time:
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
Efficient and Easily Programmable Accelerator Architectures Tor Aamodt University of British Columbia PPL Retreat, 31 May 2013.
Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.
Hathi: Durable Transactions for Memory using Flash
Lecture 20: Consistency Models, TM
Speculative Lock Elision
EECE571R -- Harnessing Massively Parallel Processors ece
Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun
Transactional Memory : Hardware Proposals Overview
Lecture 5: GPU Compute Architecture
Hardware Transactional Memory for GPU Architectures*
Lecture 5: GPU Compute Architecture for the last time
NVIDIA Fermi Architecture
Lecture 6: Transactions
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Coe818 Advanced Computer Architecture
Transactional Memory An Overview of Hardware Alternatives
Lecture 22: Consistency Models, TM
Hybrid Transactional Memory
Operation of the Basic SM Pipeline
General Purpose Graphics Processing Units (GPGPUs)
Lecture 5: Synchronization and ILP
Lecture: Consistency Models, TM
Lecture: Transactional Memory
Address-Stride Assisted Approximate Load Value Prediction in GPUs
Presented by Ondrej Cernin
Presentation transcript:

Hardware Transactional Memory for GPU Architectures Wilson W. L. Fung Inderpeet Singh Andrew Brownsword Tor M. Aamodt University of British Columbia In Proc ACM/IEEE Int’l Symp. Microarchitecture (MICRO-44)

Hardware TM for GPU Architectures 2 Motivation Lifetime of GPU Application Development Time Functionality Performance Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt ? Time Fine-Grained Locking Time Transactional Memory E.g. N-Body with 5M bodies CUDA SDK: O(n 2 ) – 1640 s (barrier) Barnes Hut: O(nLogn) – 5.2 s (locks)

Hardware TM for GPU Architectures 3 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 3 Are TM and GPUs Incompatible? GPUs different from Multi-Core CPUs 1000s Concurrent Scalar Threads Challenges (from TM perspective) Our Solution: KILO TM Hardware TM for GPUs

Hardware TM for GPU Architectures 4 T0T1T2T3 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 4 Hardware TM for GPUs Challenge #1: SIMD Hardware On GPUs, scalar threads in a warp/wavefront execute in lockstep... TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit... Committed A Warp with 4 Scalar Threads Aborted Branch Divergence! T0T1T2T3T0T1T2T3

... TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit... Hardware TM for GPU Architectures 5 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 5 KILO TM – Solution to Challenge #1: SIMD Hardware Transaction Abort  Like a Loop  Extend SIMT Stack Abort

Hardware TM for GPU Architectures 6 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 6 Register File CPU Core 10s of Registers Hardware TM for GPUs Challenge #2: Transaction Rollback Checkpoint Register TX TX Abort Register File GPU Core (SM) 32k Registers Warp 2MB Total On-Chip Storage Checkpoint?

Hardware TM for GPU Architectures 7 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 7 KILO TM – Solution to Challenge #2: Transaction Rollback SW Register Checkpoint  Most TX: Registers overwritten at first use  TX in Barnes Hut: Checkpoint 2 registers TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit Abort Overwritten

Hardware TM for GPU Architectures 8 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures8 Hardware TM for GPUs Challenge #3: Conflict Detection Existing HTMs use Cache Coherence Protocol Not Available on GPUs No Private Data Cache per Thread Signatures? 1024-bit / Thread 3.8MB / 30k Threads

Hardware TM for GPU Architectures 9 GPU Core (SM) L1 Data Cache Warp Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 9 Hardware TM for GPUs Challenge #4: Write Buffer Threads Fermi’s L1 Data Cache (48kB) = 384 X 128B Lines Problem: 384 lines / 1536 threads < 1 line per thread!

Hardware TM for GPU Architectures 10 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 10 TX2 atomic {A=B+2} Private Memory Read-Log Write-Log KILO TM: Value-Based Conflict Detection TX1 atomic {B=A+1} Private Memory Read-Log Write-Log Global Memory A=1 B=2 TxBegin LD r1,[A] ADD r1,r1,1 ST r1,[B] TxCommit A=1 B=0 A=1 B=0 A=2 B=2 Self-Validation + Abort:  Only detects existence of conflict (not identity) TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit

TX1 atomic {B=A+1} TX2 atomic {A=B+2} Private Memory Read-Log Write-Log Hardware TM for GPU Architectures 11 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 11 Parallel Validation? Private Memory Read-Log Write-Log Global Memory A=1 B=2 A=1 B=0 A=1 B=0 A=2 B=0 B=2 A=2 Tx1 then Tx2: A=4,B=2 Tx2 then Tx1: A=2,B=3 OR Data Race!?!

Hardware TM for GPU Architectures 12 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 12 Serialize Validation? Global Memory Commit Unit Benefit #1: No Data Race Benefit #2: No Live Lock Drawback: Serializes Non-Conflicting Transactions (“collateral damage”) TX1TX2 V + CStall Time V + C V = Validation C = Commit

Hardware TM for GPU Architectures 13 Solution: Speculative Validation Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 13 Global Memory Commit Unit TX3 TX1 V+C Stall Time V+C TX2 V+C Key Idea: Split Conflict Detection into two parts 1.Recently Committed TX in Parallel 2.Concurrently Committing TX in Commit Order  Approximate RS Conflict Rare  Good Commit Parallelism V = Validation C = Commit

Hardware TM for GPU Architectures 14 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 14 KILO TM Implementation SIMT Stacks Commit Unit TX Log Unit Minimal Modification to Existing GPU Arch.

Hardware TM for GPU Architectures 15 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 15 Evaluation Methodology GPGPU-Sim 3.0 (BSD license)  Detailed: IPC Correlation of 0.93 vs GT 200  KILO TM (Timing-Driven Memory Accesses) GPU TM Applications  Hash Table (HT-H, HT-L)  Bank Account (ATM)  Cloth Physics (CL)  Barnes Hut (BH)  CudaCuts (CC)  Data Mining (AP)

Hardware TM for GPU Architectures 16 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 16 Serializing TX ≈ Coarse-Grained Locks Performance (vs. Serializing TX) Higher is Better

HT-HHT-LATMCLBHCCAP N o r m a l i z e d E x e c. T i m e Hardware TM for GPU Architectures 17 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 17 Performance (Exec. Time) Captures 59% of FG Lock Performance Ideal TM KILO TM FG Lock Lower is Better

Hardware TM for GPU Architectures 18 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 18 Implementation Complexity Logs in Private L1 Data Cache Commit Unit  5kB Last Writer History Unit  19kB Transaction Status  32kB Read-Set and Write-Set Buffer CACTI 40nm  0.40mm 2 x 6 Memory Partition  0.5% of 520mm 2

Hardware TM for GPU Architectures 19 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 19 Summary KILO TM: Hardware TM for GPUs  1000s of Concurrent Scalar TXs  Handles Scalar TX Abort  No cache coherence protocol dependency  Word-level conflict detection  Unbounded Transaction  59% Fine-Grained Locking Performance 128X Faster than Serializing TX Execution  0.5% Area Overhead Question?

Hardware TM for GPU Architectures 20 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 20 Backup Slides

Hardware TM for GPU Architectures 21 ABA Problem? Classic Example: Linked List Based Stack Thread 0 – pop(): while (true) { t = top; Next = t->Next; // thread 2: pop A, pop B, push A if (atomicCAS(&top, t, next) == t) break; // succeeds! } topA Next B C Null topA Next C Null topB Next C Null topB Next C Null NextBtAtopC NextNull Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 21

Hardware TM for GPU Architectures 22 ABA Problem? atomicCAS protects only a single word  Only part of the data structure Value-based conflict detection protects all relevant parts of the data structure topA Next B C Null while (true) { t = top; Next = t->Next; if (atomicCAS(&top, t, next) == t) break; // succeeds! } Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 22