Download presentation
Presentation is loading. Please wait.
Published byElijah Cuthbert Henderson Modified over 8 years ago
1
(1) Register File Organization ©Sudhakar Yalamanchili unless otherwise noted
2
(2) Objective To understand the organization of large register files used in GPUs Identify the performance bottlenecks and opportunities for optimization in accessing the register file
3
(3) Reading S. Liu et.al, “Operand Collector Architecture,” US Patent 7,834,881 Perspective of a lane J. H. Choquette, et. Al., “Methods and Apparatus for Source Operand Caching,” US Patent 8,639,882 Perspective of instruction scheduling GPGPUSim, http://gpgpu- sim.org/manual/index.php/GPGPU- Sim_3.x_Manual#Introductionhttp://gpgpu- sim.org/manual/index.php/GPGPU- Sim_3.x_Manual#Introduction
4
(4) Register File Access: Recap I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback scalar Pipeline scalar pipeline scalar pipeline Issue I-Buffer pending warps RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 Xbar 1024 bit Banks 0-15 OC DU Operand Collectors (OC) Dispatch Units (DU) ALUsL/SSFU Arbiter Single ported Register File Banks
5
(5) The SM Register File NVIDIA Fermi: 128KB/SM, 2MB per device Throughput-optimized design 32 instructions/warp, up to 96 operands/warp/cycle Organization? Main Register File (MRF) SP LS SP LS SF Xbar warp
6
(6) Multi-ported Register Files Multi-ported register file organization Area and delay grows with #ports Use multiple banks with single read and write ports to emulate multiple ports
7
(7) Multiple Banks 1R/1W – single read port and single write port per bank Each access to a register bank produces the same named register per lane Concurrently access multiple banks R0 R1 R63 R0 R1 R63 Bank Organization 1024 bits32 bits R0 R1 R63 R0 R1 R63
8
(8) Thread Register Allocation Operands (registers) of a thread can be mapped across register banks in different ways Thin, fat, and mixed Skewed allocation Goal is maximum bandwidth T1 T2 T1 T2 T3 T4 Fat AllocationThin Allocation Bank 0Bank 1Bank 2Bank 3 Bank 0Bank 1Bank 2Bank 3
9
(9) Register File Organization 128Kbyte register file per SM 48 warps per SM, 21 registers/thread 1536 threads/SM Why and when do bank conflicts occur? Operand access for a warp may take several cycles need a way to collect operands! RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 1024 bit Arbiter Example:
10
(10) Collecting Operands Op0 Op1 Op2 Op3 Op0 Op1 Op2 Op3 Op0 Op1 Op2 Op3 Op0 Op1 Op2 Op3 Operand Set 0 Operand Set 1 Operand Set 2 Operand Set 3 Operand RE-use across instructions with independent mux control No operand RE-use across inputs – Need more wiring Use an Xbar for most flexible re-use Example: From the perspective of a single lane Effectively operates as a cache Re-use determined by interconnect Result FIFO
11
(11) Operand Caching Operand Set 0 Operand Set 1 Operand Set 2 Operand Set 3 Op0 Op1 Op2 Op3 Op0 Op1 Op2 Op3 Op0 Op1 Op2 Op3 Op0 Op1 Op2 Op3 Register# Cache Table Queried by the dispatch unit Register writes invalidate entries in the cache table Set by the dispatch unit Individual vs. common settings Result FIFO
12
(12) Instruction Dispatch I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback scalar Pipeline scalar pipeline scalar pipeline Issue I-Buffer pending warps Decode to get register IDs Check Operand Cache (cache table) Set Crossbar and MRF bank indices Update Cache Table Set ALU and OC Interconnect Op0 Op1 Op2 Op3 Op0 Op1 Op2 Op3 Op0 Op1 Op2 Op3 Op0 Op1 Op2 Op3
13
(13) Instruction Perspective OC request read operands Prioritize writes over reads Schedule read requests for maximum BW RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 Xbar 1024 bit Banks 0-15 OC DU ALUsL/SSFU Arbiter
14
(14) The Operand Collector RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 Xbar 1024 bit Banks 0-15 OC DU ALUsL/SSFU Arbiter Buffer (warp size) operands at the collector Sharing of operands across instructions Operates as an operand cache for “collecting” operands across bank conflicts Simplifies scheduling accesses to the MRF
15
(15) Register File Access: Coherency What happens when a new value is written back the Main Register File (MRF)? OC values must be invalidated RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 Xbar 1024 bit Banks 0-15 OC DU ALUsL/SSFU Arbiter VRegRDYWIDOperand (128 bytes) VRegRDYWIDOperand (128 bytes) VRegRDYWIDOperand (128 bytes) Instruction An example OC 1 1 http://gpgpu-sim.org/manual/index.php/GPGPU- Sim_3.x_Manual#Register_Access_and_the_Operand_Collector
16
(16) Pipeline Stage RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 Xbar 1024 bit Banks 0-15 OC DU ALUsL/SSFU Arbiter VRegRDYWIDOperand (128 bytes) VRegRDYWIDOperand (128 bytes) VRegRDYWIDOperand (128 bytes) Instruction OC allocated and initialized after decode Source operand requests are queued at arbiter Operands/cycle OC limited by interconnect
17
(17) Functional Unit Level Collection Operand collectors are associated with different functional unit types Can naturally support heterogeneity Dedicated vs. shared OC units Connectivity consequences Other sources of operands Constant cache, read-only cache RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 Xbar 1024 bit Banks 0-15 OC DU ALUsL/SSFU Arbiter
18
(18) Summary Register file management and operand dispatch has multiple interacting components Performance complexity tradeoff Concurrency increase requires increasing interconnect complexity Stalls/conflicts require buffering and bypass to increase utilization of the execution units Good register file allocation in critical
19
(19) P. Xiang, Y. Yang, and H. Zhou, “ Warp Level Divergence: Characterization, Impact, and Mitigation HPCA 2014 ©Sudhakar Yalamanchili unless otherwise noted
20
(20) Objectives Understand resource fragmentation in stream multiprocessors Understand challenges and potential solutions of mitigation techniques
21
(21) Keeping Track of Resources KDEIBLKID Thread Block Control Registers (TBCR) per SMX KDE Index Scheduled TB ID (in execution) KDEINextBL KDE Index Next TB to be scheduled SMX Scheduler Control Registers (SSCR) What resources do we need to launch a TB? When can we launch a TB?
22
(22) The Fragmentation Problem TBTB TBTB Warp Available registers: spatial underutilization Registers allocated to completed warps in the TB: temporal underutilization The last warp Warp Context Completed (idle) warp contexts Goal: How can we improve utilization? Temporal & spatial underutilization
23
(23) Key Issues TB resources are not released until the last warp has completed execution Idle registers Idle shared memory segments Idle warp contexts What are the limiting factors? Shared memory size # registers/thread # threads/SM (equivalent to # warps/SM) What can be done to increase utilization?
24
(24) Register Utilization Idle register usage caused by saturation on other resources Large percentage of idle registers can be exploited by Prefetchers? Power-gating reduction in static power Limited by number of TBs not registers From P. Xiang, Et. Al, “ Warp Level Divergence: Characterization, Impact, and Mitigation
25
(25) Execution Imbalance Causes Input-dependent workload imbalance Program dependent workload imbalance Memory divergence Warp scheduling policy Completed warps From P. Xiang, Et. Al, “ Warp Level Divergence: Characterization, Impact, and Mitigation
26
(26) Partial TB Dispatch Need an entries KDEIBLKID Thread Block Control Registers (TBCR) per SMX: need an entry KDE Index Scheduled TB ID (in execution) Dispatch some warps from the next TB Check TB SM resources Other checks are for warp level resources Need support for partial dispatch Tracking issued vs. non-issued warps Need sufficient registers Need TB-level storage Warp information Tracking of dispatched warps for a partial TB
27
(27) Partial TB Dispatch (2) KDEIBLKID Thread Block Control Registers (TBCR) per SMX: need an entry KDE Index TBIDStart_warpIDEnd_warpIDValid Workload Buffer Always dispatch warps from the partial warp first Only 1 partial TB at a time
28
(28) Summary Software tuning of TB is possible Adversely affects and intra-TB sharing Power savings advantages Static power due to reduced execution time Workload remains roughly constant possible increase in contention at shared resources Hardware support required relatively modest
29
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) On Demand Register Allocation and De- Allocation for a Multithreaded Processor D. Tarjan and K. Skadron
30
(30) Register File Fragmentation TBTB TBTB Warp Registers allocated to completed warps in the TB: temporal underutilization The last warp Warp Context Goal: How can we improve utilization?
31
(31) Goal Register allocation de-allocation to increase utilization and/or lower power Dynamic, cycle level allocation and de-allocation via register remapping (renaming) Increase performance for fixed amount of register storage Decrease register storage for fixed performance RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 Xbar 1024 bit ALUsL/SSFU Arbiter
32
(32) Approach For each thread Allocate on write rather than on thread creation Release ASAP just-in-time (JIT) hardware register allocator for multithreaded processors Need a spilling mechanism to avoid deadlock Remember this is JIT allocator! Basic idea register renaming
33
(33) Overview Rename Map DecodeRenameIssue True/False Virtual Register ID Physical Register ID Free List Allocation Check Cycle-level dynamic allocation/de-allocation Recycling register IDs Maintaining performance and correctness
34
(34) Basic Steps Allocation De-allocation In-order issue Out of order issue Register footprints and MRF size Size for maximum size oDrowsy and power gated register cell modes Dynamic spilling
35
(35) Dynamic Spilling MRF L1 D-Cache spill VID Memory address Register or Cache? Base Address To Register spill area Treat as offset into spill area MRF Secondary Register File spill Spill to MemorySpill to Local Storage Strategies for spilling
36
(36) An Experiment Create TBs of 256 threads and 32 registers/thread Each TB requires 8K registers GTX480 – 32K registers Occupancy for each SM is 4 TBs but is actually greater! Why? From P. Xiang, Et. Al, “ Warp Level Divergence: Characterization, Impact, and Mitigation
37
(37) Summary Net effect is to improve register file utilization Note that registers are recycled at a finer granularity than TB boundaries You can run experiments on NVIDIA parts to see these effects (speculatively assume something like this is happening).
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.