(1) Register File Organization ©Sudhakar Yalamanchili unless otherwise noted.

Slides:



Advertisements
Similar presentations
Threads, SMP, and Microkernels
Advertisements

Topics Left Superscalar machines IA64 / EPIC architecture
Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.
Lecture 12 Reduce Miss Penalty and Hit Time
Optimization on Kepler Zehuan Wang
Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
CSIE30300 Computer Architecture Unit 10: Virtual Memory Hsin-Chou Chi [Adapted from material by and
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
1 Chapter 8 Virtual Memory Virtual memory is a storage allocation scheme in which secondary memory can be addressed as though it were part of main memory.
Reducing the Complexity of the Register File in Dynamic Superscalar Processors Rajeev Balasubramonian, Sandhya Dwarkadas, and David H. Albonesi In Proceedings.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
Chapter 11 Operating Systems
Chapter 12 CPU Structure and Function. Example Register Organizations.
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu
Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.
CUDA - 2.
Computer Organization & Programming
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.
Processes and Virtual Memory
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Intra-Warp Compaction Techniques.
CPE232 Cache Introduction1 CPE 232 Computer Organization Spring 2006 Cache Introduction Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.
(1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.
Operation of the SM Pipeline
(1) ©Sudhakar Yalamanchili unless otherwise noted Reducing Branch Divergence in GPU Programs T. D. Han and T. Abdelrahman GPGPU 2011.
Memory Hierarchy and Cache Design (4). Reducing Hit Time 1. Small and Simple Caches 2. Avoiding Address Translation During Indexing of the Cache –Using.
Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.
My Coordinates Office EM G.27 contact time:
1 Contents Memory types & memory hierarchy Virtual memory (VM) Page replacement algorithms in case of VM.
CS161 – Design and Architecture of Computer
Memory COMPUTER ARCHITECTURE
CS161 – Design and Architecture of Computer
Lecture 12 Virtual Memory.
ISPASS th April Santa Rosa, California
CS 704 Advanced Computer Architecture
5.2 Eleven Advanced Optimizations of Cache Performance
/ Computer Architecture and Design
Two Ideas of This Paper Using Permissions-only Cache to deduce the rate at which less-efficient overflow handling mechanisms are invoked. When the overflow.
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Memory Management Lectures notes from the text supplement by Siberschatz and Galvin Modified by B.Ramamurthy Chapter 8 11/24/2018.
Memory Management Lectures notes from the text supplement by Siberschatz and Galvin Modified by B.Ramamurthy Chapter 9 12/1/2018.
* From AMD 1996 Publication #18522 Revision E
Virtual Memory Overcoming main memory size limitation
Contents Memory types & memory hierarchy Virtual memory (VM)
Summary 3 Cs: Compulsory, Capacity, Conflict Misses Reducing Miss Rate
Operation of the Basic SM Pipeline
Register File Organization
Mattan Erez The University of Texas at Austin
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
Memory Management Lectures notes from the text supplement by Siberschatz and Galvin Modified by B.Ramamurthy Chapter 9 4/5/2019.
Virtual Memory: Working Sets
Main Memory Background
The University of Adelaide, School of Computer Science
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

(1) Register File Organization ©Sudhakar Yalamanchili unless otherwise noted

(2) Objective To understand the organization of large register files used in GPUs Identify the performance bottlenecks and opportunities for optimization in accessing the register file

(3) Reading S. Liu et.al, “Operand Collector Architecture,” US Patent 7,834,881  Perspective of a lane J. H. Choquette, et. Al., “Methods and Apparatus for Source Operand Caching,” US Patent 8,639,882  Perspective of instruction scheduling GPGPUSim, sim.org/manual/index.php/GPGPU- Sim_3.x_Manual#Introductionhttp://gpgpu- sim.org/manual/index.php/GPGPU- Sim_3.x_Manual#Introduction

(4) Register File Access: Recap I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback scalar Pipeline scalar pipeline scalar pipeline Issue I-Buffer pending warps RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 Xbar 1024 bit Banks 0-15 OC DU Operand Collectors (OC) Dispatch Units (DU) ALUsL/SSFU Arbiter Single ported Register File Banks

(5) The SM Register File NVIDIA Fermi: 128KB/SM, 2MB per device Throughput-optimized design  32 instructions/warp, up to 96 operands/warp/cycle Organization? Main Register File (MRF) SP LS SP LS SF Xbar warp

(6) Multi-ported Register Files Multi-ported register file organization Area and delay grows with #ports Use multiple banks with single read and write ports to emulate multiple ports

(7) Multiple Banks 1R/1W – single read port and single write port per bank Each access to a register bank produces the same named register per lane Concurrently access multiple banks R0 R1 R63 R0 R1 R63 Bank Organization 1024 bits32 bits R0 R1 R63 R0 R1 R63

(8) Thread Register Allocation Operands (registers) of a thread can be mapped across register banks in different ways Thin, fat, and mixed Skewed allocation Goal is maximum bandwidth T1 T2 T1 T2 T3 T4 Fat AllocationThin Allocation Bank 0Bank 1Bank 2Bank 3 Bank 0Bank 1Bank 2Bank 3

(9) Register File Organization 128Kbyte register file per SM 48 warps per SM, 21 registers/thread 1536 threads/SM Why and when do bank conflicts occur? Operand access for a warp may take several cycles  need a way to collect operands! RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF bit Arbiter Example:

(10) Collecting Operands Op0 Op1 Op2 Op3 Op0 Op1 Op2 Op3 Op0 Op1 Op2 Op3 Op0 Op1 Op2 Op3 Operand Set 0 Operand Set 1 Operand Set 2 Operand Set 3 Operand RE-use across instructions with independent mux control No operand RE-use across inputs – Need more wiring Use an Xbar for most flexible re-use Example: From the perspective of a single lane Effectively operates as a cache Re-use determined by interconnect Result FIFO

(11) Operand Caching Operand Set 0 Operand Set 1 Operand Set 2 Operand Set 3 Op0 Op1 Op2 Op3 Op0 Op1 Op2 Op3 Op0 Op1 Op2 Op3 Op0 Op1 Op2 Op3 Register# Cache Table Queried by the dispatch unit Register writes invalidate entries in the cache table Set by the dispatch unit Individual vs. common settings Result FIFO

(12) Instruction Dispatch I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback scalar Pipeline scalar pipeline scalar pipeline Issue I-Buffer pending warps Decode to get register IDs Check Operand Cache (cache table) Set Crossbar and MRF bank indices Update Cache Table Set ALU and OC Interconnect Op0 Op1 Op2 Op3 Op0 Op1 Op2 Op3 Op0 Op1 Op2 Op3 Op0 Op1 Op2 Op3

(13) Instruction Perspective OC request read operands Prioritize writes over reads Schedule read requests for maximum BW RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 Xbar 1024 bit Banks 0-15 OC DU ALUsL/SSFU Arbiter

(14) The Operand Collector RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 Xbar 1024 bit Banks 0-15 OC DU ALUsL/SSFU Arbiter Buffer (warp size) operands at the collector Sharing of operands across instructions Operates as an operand cache for “collecting” operands across bank conflicts Simplifies scheduling accesses to the MRF

(15) Register File Access: Coherency What happens when a new value is written back the Main Register File (MRF)?  OC values must be invalidated RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 Xbar 1024 bit Banks 0-15 OC DU ALUsL/SSFU Arbiter VRegRDYWIDOperand (128 bytes) VRegRDYWIDOperand (128 bytes) VRegRDYWIDOperand (128 bytes) Instruction An example OC Sim_3.x_Manual#Register_Access_and_the_Operand_Collector

(16) Pipeline Stage RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 Xbar 1024 bit Banks 0-15 OC DU ALUsL/SSFU Arbiter VRegRDYWIDOperand (128 bytes) VRegRDYWIDOperand (128 bytes) VRegRDYWIDOperand (128 bytes) Instruction OC allocated and initialized after decode Source operand requests are queued at arbiter Operands/cycle  OC limited by interconnect

(17) Functional Unit Level Collection Operand collectors are associated with different functional unit types Can naturally support heterogeneity Dedicated vs. shared OC units  Connectivity consequences Other sources of operands  Constant cache, read-only cache RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 Xbar 1024 bit Banks 0-15 OC DU ALUsL/SSFU Arbiter

(18) Summary Register file management and operand dispatch has multiple interacting components Performance complexity tradeoff  Concurrency increase requires increasing interconnect complexity  Stalls/conflicts require buffering and bypass to increase utilization of the execution units Good register file allocation in critical

(19) P. Xiang, Y. Yang, and H. Zhou, “ Warp Level Divergence: Characterization, Impact, and Mitigation HPCA 2014 ©Sudhakar Yalamanchili unless otherwise noted

(20) Objectives Understand resource fragmentation in stream multiprocessors Understand challenges and potential solutions of mitigation techniques

(21) Keeping Track of Resources KDEIBLKID Thread Block Control Registers (TBCR) per SMX KDE Index Scheduled TB ID (in execution) KDEINextBL KDE Index Next TB to be scheduled SMX Scheduler Control Registers (SSCR) What resources do we need to launch a TB? When can we launch a TB?

(22) The Fragmentation Problem TBTB TBTB Warp Available registers: spatial underutilization Registers allocated to completed warps in the TB: temporal underutilization The last warp Warp Context Completed (idle) warp contexts Goal: How can we improve utilization? Temporal & spatial underutilization

(23) Key Issues TB resources are not released until the last warp has completed execution  Idle registers  Idle shared memory segments  Idle warp contexts What are the limiting factors?  Shared memory size  # registers/thread  # threads/SM (equivalent to # warps/SM) What can be done to increase utilization?

(24) Register Utilization Idle register usage caused by saturation on other resources Large percentage of idle registers can be exploited by  Prefetchers?  Power-gating  reduction in static power Limited by number of TBs not registers From P. Xiang, Et. Al, “ Warp Level Divergence: Characterization, Impact, and Mitigation

(25) Execution Imbalance Causes  Input-dependent workload imbalance  Program dependent workload imbalance  Memory divergence  Warp scheduling policy Completed warps From P. Xiang, Et. Al, “ Warp Level Divergence: Characterization, Impact, and Mitigation

(26) Partial TB Dispatch Need an entries KDEIBLKID Thread Block Control Registers (TBCR) per SMX: need an entry KDE Index Scheduled TB ID (in execution) Dispatch some warps from the next TB Check TB SM resources Other checks are for warp level resources Need support for partial dispatch Tracking issued vs. non-issued warps Need sufficient registers Need TB-level storage Warp information Tracking of dispatched warps for a partial TB

(27) Partial TB Dispatch (2) KDEIBLKID Thread Block Control Registers (TBCR) per SMX: need an entry KDE Index TBIDStart_warpIDEnd_warpIDValid Workload Buffer Always dispatch warps from the partial warp first Only 1 partial TB at a time

(28) Summary Software tuning of TB is possible  Adversely affects and intra-TB sharing Power savings advantages  Static power due to reduced execution time  Workload remains roughly constant  possible increase in contention at shared resources Hardware support required relatively modest

© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) On Demand Register Allocation and De- Allocation for a Multithreaded Processor D. Tarjan and K. Skadron

(30) Register File Fragmentation TBTB TBTB Warp Registers allocated to completed warps in the TB: temporal underutilization The last warp Warp Context Goal: How can we improve utilization?

(31) Goal Register allocation de-allocation to increase utilization and/or lower power Dynamic, cycle level allocation and de-allocation via register remapping (renaming)  Increase performance for fixed amount of register storage  Decrease register storage for fixed performance RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 Xbar 1024 bit ALUsL/SSFU Arbiter

(32) Approach For each thread  Allocate on write rather than on thread creation  Release ASAP  just-in-time (JIT) hardware register allocator for multithreaded processors Need a spilling mechanism to avoid deadlock  Remember this is JIT allocator! Basic idea  register renaming

(33) Overview Rename Map DecodeRenameIssue True/False Virtual Register ID Physical Register ID Free List Allocation Check Cycle-level dynamic allocation/de-allocation Recycling register IDs  Maintaining performance and correctness

(34) Basic Steps Allocation De-allocation  In-order issue  Out of order issue Register footprints and MRF size  Size for maximum size oDrowsy and power gated register cell modes  Dynamic spilling

(35) Dynamic Spilling MRF L1 D-Cache spill VID Memory address Register or Cache? Base Address To Register spill area Treat as offset into spill area MRF Secondary Register File spill Spill to MemorySpill to Local Storage Strategies for spilling

(36) An Experiment Create TBs of 256 threads and 32 registers/thread  Each TB requires 8K registers GTX480 – 32K registers Occupancy for each SM is 4 TBs but is actually greater! Why? From P. Xiang, Et. Al, “ Warp Level Divergence: Characterization, Impact, and Mitigation

(37) Summary Net effect is to improve register file utilization Note that registers are recycled at a finer granularity than TB boundaries You can run experiments on NVIDIA parts to see these effects (speculatively assume something like this is happening).