(1) Register File Organization ©Sudhakar Yalamanchili unless otherwise noted.

(1) Register File Organization ©Sudhakar Yalamanchili unless otherwise noted

(2) Objective To understand the organization of large register files used in GPUs Identify the performance bottlenecks and opportunities for optimization in accessing the register file

(3) Reading S. Liu et.al, “Operand Collector Architecture,” US Patent 7,834,881  Perspective of a lane J. H. Choquette, et. Al., “Methods and Apparatus for Source Operand Caching,” US Patent 8,639,882  Perspective of instruction scheduling GPGPUSim, http://gpgpusim.org/manual/index.php/GPGPU- Sim_3.x_Manual#Introductionhttp://gpgpusim.org/manual/index.php/GPGPU- Sim_3.x_Manual#Introduction

(4) Register File Access: Recap I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback scalar Pipeline scalar pipeline scalar pipeline Issue I-Buffer pending warps RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 Xbar 1024 bit Banks 0-15 OC DU Operand Collectors (OC) Dispatch Units (DU) ALUsL/SSFU Arbiter Single ported Register File Banks

(5) The SM Register File NVIDIA Fermi: 128KB/SM, 2MB per device Throughput-optimized design  32 instructions/warp, up to 96 operands/warp/cycle Organization? Main Register File (MRF) SP LS SP LS SF Xbar warp

(6) Multi-ported Register Files Multi-ported register file organization Area and delay grows with #ports Use multiple banks with single read and write ports to emulate multiple ports

(7) Multiple Banks 1R/1W – single read port and single write port per bank Each access to a register bank produces the same named register per lane Concurrently access multiple banks R0 R1 R63 R0 R1 R63 Bank Organization 1024 bits32 bits R0 R1 R63 R0 R1 R63

(8) Thread Register Allocation Operands (registers) of a thread can be mapped across register banks in different ways Thin, fat, and mixed Skewed allocation Goal is maximum bandwidth T1 T2 T1 T2 T3 T4 Fat AllocationThin Allocation Bank 0Bank 1Bank 2Bank 3 Bank 0Bank 1Bank 2Bank 3

(9) Register File Organization 128Kbyte register file per SM 48 warps per SM, 21 registers/thread 1536 threads/SM Why and when do bank conflicts occur? Operand access for a warp may take several cycles  need a way to collect operands! RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 1024 bit Arbiter Example:

(10) Collecting Operands Op0 Op1 Op2 Op3 Op0 Op1 Op2 Op3 Op0 Op1 Op2 Op3 Op0 Op1 Op2 Op3 Operand Set 0 Operand Set 1 Operand Set 2 Operand Set 3 Operand RE-use across instructions with independent mux control No operand RE-use across inputs – Need more wiring Use an Xbar for most flexible re-use Example: From the perspective of a single lane Effectively operates as a cache Re-use determined by interconnect Result FIFO

(11) Operand Caching Operand Set 0 Operand Set 1 Operand Set 2 Operand Set 3 Op0 Op1 Op2 Op3 Op0 Op1 Op2 Op3 Op0 Op1 Op2 Op3 Op0 Op1 Op2 Op3 Register# Cache Table Queried by the dispatch unit Register writes invalidate entries in the cache table Set by the dispatch unit Individual vs. common settings Result FIFO

(12) Instruction Dispatch I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback scalar Pipeline scalar pipeline scalar pipeline Issue I-Buffer pending warps Decode to get register IDs Check Operand Cache (cache table) Set Crossbar and MRF bank indices Update Cache Table Set ALU and OC Interconnect Op0 Op1 Op2 Op3 Op0 Op1 Op2 Op3 Op0 Op1 Op2 Op3 Op0 Op1 Op2 Op3

(13) Instruction Perspective OC request read operands Prioritize writes over reads Schedule read requests for maximum BW RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 Xbar 1024 bit Banks 0-15 OC DU ALUsL/SSFU Arbiter

(14) The Operand Collector RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 Xbar 1024 bit Banks 0-15 OC DU ALUsL/SSFU Arbiter Buffer (warp size) operands at the collector Sharing of operands across instructions Operates as an operand cache for “collecting” operands across bank conflicts Simplifies scheduling accesses to the MRF

(15) Register File Access: Coherency What happens when a new value is written back the Main Register File (MRF)?  OC values must be invalidated RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 Xbar 1024 bit Banks 0-15 OC DU ALUsL/SSFU Arbiter VRegRDYWIDOperand (128 bytes) VRegRDYWIDOperand (128 bytes) VRegRDYWIDOperand (128 bytes) Instruction An example OC 1 1 http://gpgpu-sim.org/manual/index.php/GPGPU- Sim_3.x_Manual#Register_Access_and_the_Operand_Collector

(16) Pipeline Stage RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 Xbar 1024 bit Banks 0-15 OC DU ALUsL/SSFU Arbiter VRegRDYWIDOperand (128 bytes) VRegRDYWIDOperand (128 bytes) VRegRDYWIDOperand (128 bytes) Instruction OC allocated and initialized after decode Source operand requests are queued at arbiter Operands/cycle  OC limited by interconnect

(17) Functional Unit Level Collection Operand collectors are associated with different functional unit types Can naturally support heterogeneity Dedicated vs. shared OC units  Connectivity consequences Other sources of operands  Constant cache, read-only cache RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 Xbar 1024 bit Banks 0-15 OC DU ALUsL/SSFU Arbiter

(18) Summary Register file management and operand dispatch has multiple interacting components Performance complexity tradeoff  Concurrency increase requires increasing interconnect complexity  Stalls/conflicts require buffering and bypass to increase utilization of the execution units Good register file allocation in critical

(19) P. Xiang, Y. Yang, and H. Zhou, “ Warp Level Divergence: Characterization, Impact, and Mitigation HPCA 2014 ©Sudhakar Yalamanchili unless otherwise noted

(20) Objectives Understand resource fragmentation in stream multiprocessors Understand challenges and potential solutions of mitigation techniques

(21) Keeping Track of Resources KDEIBLKID Thread Block Control Registers (TBCR) per SMX KDE Index Scheduled TB ID (in execution) KDEINextBL KDE Index Next TB to be scheduled SMX Scheduler Control Registers (SSCR) What resources do we need to launch a TB? When can we launch a TB?

(22) The Fragmentation Problem TBTB TBTB Warp Available registers: spatial underutilization Registers allocated to completed warps in the TB: temporal underutilization The last warp Warp Context Completed (idle) warp contexts Goal: How can we improve utilization? Temporal & spatial underutilization

(23) Key Issues TB resources are not released until the last warp has completed execution  Idle registers  Idle shared memory segments  Idle warp contexts What are the limiting factors?  Shared memory size  # registers/thread  # threads/SM (equivalent to # warps/SM) What can be done to increase utilization?

(24) Register Utilization Idle register usage caused by saturation on other resources Large percentage of idle registers can be exploited by  Prefetchers?  Power-gating  reduction in static power Limited by number of TBs not registers From P. Xiang, Et. Al, “ Warp Level Divergence: Characterization, Impact, and Mitigation

(25) Execution Imbalance Causes  Input-dependent workload imbalance  Program dependent workload imbalance  Memory divergence  Warp scheduling policy Completed warps From P. Xiang, Et. Al, “ Warp Level Divergence: Characterization, Impact, and Mitigation

(26) Partial TB Dispatch Need an entries KDEIBLKID Thread Block Control Registers (TBCR) per SMX: need an entry KDE Index Scheduled TB ID (in execution) Dispatch some warps from the next TB Check TB SM resources Other checks are for warp level resources Need support for partial dispatch Tracking issued vs. non-issued warps Need sufficient registers Need TB-level storage Warp information Tracking of dispatched warps for a partial TB

(27) Partial TB Dispatch (2) KDEIBLKID Thread Block Control Registers (TBCR) per SMX: need an entry KDE Index TBIDStart_warpIDEnd_warpIDValid Workload Buffer Always dispatch warps from the partial warp first Only 1 partial TB at a time

(28) Summary Software tuning of TB is possible  Adversely affects and intra-TB sharing Power savings advantages  Static power due to reduced execution time  Workload remains roughly constant  possible increase in contention at shared resources Hardware support required relatively modest

(30) Register File Fragmentation TBTB TBTB Warp Registers allocated to completed warps in the TB: temporal underutilization The last warp Warp Context Goal: How can we improve utilization?

(31) Goal Register allocation de-allocation to increase utilization and/or lower power Dynamic, cycle level allocation and de-allocation via register remapping (renaming)  Increase performance for fixed amount of register storage  Decrease register storage for fixed performance RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 Xbar 1024 bit ALUsL/SSFU Arbiter

(32) Approach For each thread  Allocate on write rather than on thread creation  Release ASAP  just-in-time (JIT) hardware register allocator for multithreaded processors Need a spilling mechanism to avoid deadlock  Remember this is JIT allocator! Basic idea  register renaming

(33) Overview Rename Map DecodeRenameIssue True/False Virtual Register ID Physical Register ID Free List Allocation Check Cycle-level dynamic allocation/de-allocation Recycling register IDs  Maintaining performance and correctness

(34) Basic Steps Allocation De-allocation  In-order issue  Out of order issue Register footprints and MRF size  Size for maximum size oDrowsy and power gated register cell modes  Dynamic spilling

(35) Dynamic Spilling MRF L1 D-Cache spill VID Memory address Register or Cache? Base Address To Register spill area Treat as offset into spill area MRF Secondary Register File spill Spill to MemorySpill to Local Storage Strategies for spilling

(36) An Experiment Create TBs of 256 threads and 32 registers/thread  Each TB requires 8K registers GTX480 – 32K registers Occupancy for each SM is 4 TBs but is actually greater! Why? From P. Xiang, Et. Al, “ Warp Level Divergence: Characterization, Impact, and Mitigation

(37) Summary Net effect is to improve register file utilization Note that registers are recycled at a finer granularity than TB boundaries You can run experiments on NVIDIA parts to see these effects (speculatively assume something like this is happening).

(1) Register File Organization ©Sudhakar Yalamanchili unless otherwise noted.

Similar presentations

Presentation on theme: "(1) Register File Organization ©Sudhakar Yalamanchili unless otherwise noted."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

(1) Register File Organization ©Sudhakar Yalamanchili unless otherwise noted.

Similar presentations

Presentation on theme: "(1) Register File Organization ©Sudhakar Yalamanchili unless otherwise noted."— Presentation transcript:

Similar presentations

About project

Feedback