Download presentation
Presentation is loading. Please wait.
Published byKeely Mussey Modified over 10 years ago
1
Energy Efficient GPU Transactional Memory via Space-Time Optimizations Wilson W. L. Fung Tor M. Aamodt
2
Why TM for GPU? Simple Irregular Parallelism on GPU 2 nBody 5M Bodies 1640s5.2s RegularIrregular Energy Efficient GPU TM via Space-Time Opt. Wilson Fung Other Applications?
3
Why TM for GPU? Energy Efficient GPU TM via Space-Time Opt. 3 TM on GPU Predictable Dev Time No Deadlock! Maintainable Code Wilson Fung
4
TM for GPU: Energy Overhead TM = Speculative Execution = Kilo TM: First Hardware TM for GPU – Simple Design for Scalability – 1000s of Concurrent Transactions – Scalar Transaction Management – Value-Based Conflict Detection Energy Efficient GPU TM via Space-Time Opt. 4Wilson Fung GPU Memory
5
Warp-Level Transaction Management Transaction Memory Temporal Conflict Detection Last Written Time 0x00000000 0xFFFFFFFF 2X 1.3X Energy Usage 65% Speedup Wilson Fung5 Energy Efficient GPU TM via Space-Time Opt. SpaceTime
6
Background: Kilo TM 1000s of Concurrent Transactions – Value-based conflict detection: Global Metadata Special HW to boost validation and commit parallelism 6 Transaction Write Log Global Memory Commit Transaction Write Log Global Memory Validation = Read Log Write Log Global Memory Execution Transaction Energy Efficient GPU TM via Space-Time Opt. Wilson Fung
7
Memory Partition SIMT Core Kilo TM Implementation 7 SIMT Core Memory Partition TM-Aware SIMT Stack TX-Log Unit L1 Data Cache Commit Unit L2 Cache DRAM Energy Efficient GPU TM via Space-Time Opt. Wilson Fung Commit Protocol
8
Efficiency Concerns Scalar Transaction Management – Scalar Transaction fits SIMT Model – Simple Design – Poor Use of SIMD Memory Subsystem Rereading every memory location – Memory access takes energy 8 2X Energy Usage 128X Speedup over CG-Locks 40% FG-Locks Performance Energy Efficient GPU TM via Space-Time Opt. Wilson Fung
9
Inefficiency from Scalar Transaction Management 9 Kilo TM ignores GPU thread hierarchy – Excessive Control Message Traffic – Scalar Validation and Commit Poor L2 Bandwidth Utilization Simplify HW Design, but Cost Energy SIMT Core CU Send-Log SIMT Core CU CU-Pass/Fail SIMT Core CU TX-Outcome SIMT Core CU Commit Done Energy Efficient GPU TM via Space-Time Opt. Wilson Fung Commit Unit Last Level Cache 32 B Port 4B
10
Warp Level Transaction Management Key Idea: – Manage transactions within a warp as a whole Enables optimizations that exploit spatial locality: – Aggregate Control Messages – Validation and Commit Coalescing Challenge: Intra-Warp Conflicts 10 Energy Efficient GPU TM via Space-Time Opt. Wilson Fung Transaction Memory
11
Warp Level Transaction Management: Aggregate Control Messages 11 SIMT Core Commit Unit TX1 TX2 TX3 TX4 Scalar Messages 12 Messages SIMT Core Commit Unit TX1 TX2 TX3 TX4 Aggregated Messages 3 Messages Contributes up to 40% of Interconnection Traffic Energy Efficient GPU TM via Space-Time Opt. Wilson Fung
12
Warp Level Transaction Management: Validation and Commit Coalescing 12 TX1 TX2 TX3 TX4 Global Memory (L2 cache/DRAM) 4B32B Port Read and Write Logs Without Coalescing TX1 TX2 TX3 TX4 Global Memory (L2 cache/DRAM) Coalescing Logic 32B Port32/64/128B Read and Write Logs With Coalescing Reduce 40% of Requests to L2 Cache Max Utility = 4/32 = 12.5% Energy Efficient GPU TM via Space-Time Opt. Wilson Fung
13
Intra-Warp Conflict Potential existence of intra-warp conflict introduces complex corner cases: 13 TX3TX1 Z=7X=9 W=7Y=9 TX2TX4 Y=8W=6 Z=8X=6 Read Set Write Set Global Memory X = 9 Y = 8 Z = 7 W = 6 @ Validation Global Memory X = 6 Y = 9 Z = 8 W = 7 All Committed (Wrong) Energy Efficient GPU TM via Space-Time Opt. Global Memory X = 9 Y = 9 Z = 7 W = 7 Global Memory X = 6 Y = 8 Z = 8 W = 6 Correct Outcomes OR Wilson Fung
14
Intra-Warp Conflict Resolution Kilo TM stores read-set and write-set in logs – Compact, fits in caches – Inefficient for search Naive, pair-wise resolution too slow – T threads/warp, R+W words/thread – O(T 2 x (R+W) 2 ), T ≥ 32 14 TX3TX1TX2TX4 O((R+W) 2 ) Comparisons Each Energy Efficient GPU TM via Space-Time Opt. ExecutionValidationCommit Intra-Warp Conflict Resolution Wilson Fung
15
Intra-Warp Conflict Resolution: 2-Phase Parallel Conflict Resolution Insight: Fixed priority for conflict resolution enables parallel resolution O(R+W) Two Phases – Ownership Table Construction – Parallel Match 15 Energy Efficient GPU TM via Space-Time Opt. Wilson Fung
16
Intra-Warp Conflict Resolution: 2-Phase Parallel Conflict Resolution Insight: Fixed priority for conflict resolution enables parallel resolution 16 Ownership Table TX3 WLog Ownership Table Construction TX1TX2TX4 WLog Ownership Table Addr H ID of Highest Prio. TX Written to H(Addr) Stored in Shared Memory (On-Chip Per-Core Scratchpad) HighLow Priority Energy Efficient GPU TM via Space-Time Opt. Wilson Fung
17
Intra-Warp Conflict Resolution: 2-Phase Parallel Conflict Resolution 17 O(R+W) Parallel Match TX1TX2TX4 WLog RLog WLog RLog WLog RLog Ownership Table TX3 WLog RLog Insight: Fixed priority for conflict resolution enables parallel resolution O(W) Ownership Table TX3 WLog Ownership Table Construction TX1TX2TX4 WLog Read-Log: Owner ID < My ID Abort (E.g. Owner ID = 2 Abort) Write-Log: OwnerID != My ID Abort (E.g. Owner ID = 3 Pass) Energy Efficient GPU TM via Space-Time Opt. Wilson Fung
18
Warp Level Transaction Management Made Practical Enables optimizations that exploit spatial locality: – Aggregate Control Messages – Validation and Commit Coalescing Challenge: Intra-Warp Conflicts 18 Energy Efficient GPU TM via Space-Time Opt. Wilson Fung Transaction Memory
19
Temporal Conflict Detection Motivation: Skip value-based conflict detection for conflict-free read-only transactions – 40% and 85% of the transactions in two of our workloads. 19 TX1 if (C == 0) B = B + 1; TX2 int K; K = X + Y; Energy Efficient GPU TM via Space-Time Opt. Data-Dependent Control Flow Consistent View of Memory Wilson Fung
20
Temporal Conflict Detection If LastWrittenTime(X) < StartTime, Pass Otherwise, Conflict Detected 20 Transaction Start Time Global Memory (L2 cache/DRAM) Last Written Time Load [X] Data + LastWrittenTime(X) Energy Efficient GPU TM via Space-Time Opt. Wilson Fung Global Memory (L2 cache/DRAM) Last Written Time Store [X]
21
Life Time of [A] loaded by TX1 Temporal Conflict Detection Energy Efficient GPU TM via Space-Time Opt. 21 TX1 LD [A]; LD [B]; ST [B] Time Life Time of [B] loaded by TX1 ST [A] TX1 LD [A] TX1 Starts TX1 LD [B] Effective instantaneous execution time for TX1 w.r.t. other threads Wilson Fung
22
[B] loaded by TX2 Temporal Conflict Detection Energy Efficient GPU TM via Space-Time Opt. 22 TX1 LD [A]; LD [B]; ST [B] Time Life Time of [A] loaded by TX1 ST [A] TX1 LD [A] TX1 Starts TX1 LD [B] Value loaded by LD [A] and value loaded by LD [B] cannot coexists at any point of time – a detected conflict. Wilson Fung
23
Temporal Conflict Detection Implementation Wilson Fung Energy Efficient GPU TM via Space-Time Opt. 23 Memory Partition SIMT Core Memory Partition Start Time Table Last Written Time Table Time Addr H 16kB Recency Bloom Filter Approximate but Conservative Aliasing two very old store is OK
24
Evaluation 24 GPGPU-Sim 3.2.1 – Detailed: IPC Correlation of 0.90 vs. Fermi GPU Model Energy Overhead of Kilo TM – Extra Hardware CACTI for access energy of major SRAM arrays – Extra Activity via GPUWattch – Increased Execution Time (More Leakage) GPU TM Applications Energy Efficient GPU TM via Space-Time Opt. HT-[H/M/L] – Hash Table ConstructionATM – Bank Transactions BH-[H/L] – Barnes Huts (N-Body)CL/CLto – Cloth Simulation CC – Maxflow/Mincut GraphAP – Data Mining Wilson Fung
25
Results 25 2X 1.3X Energy Usage 40% 66% FG-Lock Performance Low Contention Workload: Kilo TM w/ SW Optimizations on par with FG Lock Energy Efficient GPU TM via Space-Time Opt. Wilson Fung
26
Summary Two Enhancements for Kilo TM – Warp Level Transaction Management Exploit Spatial Locality in Thread Hierarchy – Temporal Conflict Detection Silent Commit of Read-Only Transaction Reduce Performance and Energy Overhead of Kilo TM Low Contention Workload: Kilo TM w/ Optimizations on par with FG Lock 26 Energy Efficient GPU TM via Space-Time Opt. Wilson Fung Questions?
27
BACKUP SLIDES Wilson Fung Energy Efficient GPU TM via Space-Time Opt. 27
28
Normalized Performance 28 40% 66% FG-Locks Performance Low Contention Workload: Kilo TM w/ SW Optimizations on par with FG Lock Energy Efficient GPU TM via Space-Time Opt. Wilson Fung
29
Normalized Energy Usage 29 Energy Efficient GPU TM via Space-Time Opt. Wilson Fung
30
Intra-Warp Conflict Resolution: 2-Phase Parallel Conflict Resolution 30 Energy Efficient GPU TM via Space-Time Opt. Wilson Fung
31
2PCR vs. SCR 31 Energy Efficient GPU TM via Space-Time Opt. Wilson Fung
32
Spatial Locality among Transactions 32 Energy Efficient GPU TM via Space-Time Opt. Wilson Fung
33
Energy Efficient GPU TM via Space-Time Opt. 33 ABA Problem? Classic Example: Linked List Based Stack Thread 0 – pop(): while (true) { t = top; Next = t->Next; // thread 2: pop A, pop B, push A if (atomicCAS(&top, t, next) == t) break; // succeeds! } topA Next B C Null topA Next C Null topB Next C Null topB Next C Null NextBtAtopC NextNull Wilson Fung
34
Energy Efficient GPU TM via Space-Time Opt. 34 ABA Problem? atomicCAS protects only a single word – Only part of the data structure Value-based conflict detection protects all relevant parts of the data structure topA Next B C Null while (true) { t = top; Next = t->Next; if (atomicCAS(&top, t, next) == t) break; // succeeds! } Wilson Fung
35
ABA Problem? If every memory input value is identical, the transaction code should generate the same output. – No point to re-execute transaction for ABA event. 35 Time TX1 if (C == 0) B = B + 1; See Tech. Report: http://www.ece.ubc.ca/~aamodt/papers/wwlfung.tr2012.pdf TX1 LD [C] TX1 LD [B] B = 3 B = 4 TX2 Commit B = 1 TX2... B = B - 2; TX3... B = B + 2; TX3 Commit TX1 Validate Pass TX1 Commit Energy Efficient GPU TM via Space-Time Opt. Wilson Fung
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.