Register File Organization

Slides:



Advertisements
Similar presentations
Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.
Advertisements

A scheme to overcome data hazards
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
COMP25212 Advanced Pipelining Out of Order Processors.
Instruction-Level Parallelism (ILP)
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Chapter 12 Pipelining Strategies Performance Hazards.
Review of CS 203A Laxmi Narayan Bhuyan Lecture2.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 5, 2005 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Scoreboarding)
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
(1) Register File Organization ©Sudhakar Yalamanchili unless otherwise noted.
1 Lecture 6 Tomasulo Algorithm CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading:Textbook 2.4, 2.5.
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Intra-Warp Compaction Techniques.
1 Images from Patterson-Hennessy Book Machines that introduced pipelining and instruction-level parallelism. Clockwise from top: IBM Stretch, IBM 360/91,
Operation of the SM Pipeline
04/03/2016 slide 1 Dynamic instruction scheduling Key idea: allow subsequent independent instructions to proceed DIVDF0,F2,F4; takes long time ADDDF10,F0,F8;
My Coordinates Office EM G.27 contact time:
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
Instruction-Level Parallelism and Its Dynamic Exploitation
IBM System 360. Common architecture for a set of machines
Dynamic Scheduling Why go out of style?
Images from Patterson-Hennessy Book
Instruction Level Parallelism
/ Computer Architecture and Design
Out of Order Processors
Step by step for Tomasulo Scheme
CS203 – Advanced Computer Architecture
Microprocessor Microarchitecture Dynamic Pipeline
Advantages of Dynamic Scheduling
Morgan Kaufmann Publishers The Processor
Instruction Level Parallelism and Superscalar Processors
RegLess: Just-in-Time Operand Staging for GPUs
High-level view Out-of-order pipeline
CMSC 611: Advanced Computer Architecture
Lecture 6: Advanced Pipelines
A Dynamic Algorithm: Tomasulo’s
Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Out of Order Processors
Last Week Talks Any feedback from the talks? What did you like?
Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
ECE 2162 Reorder Buffer.
CS 704 Advanced Computer Architecture
\course\cpeg323-05F\Topic6b-323
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
How to improve (decrease) CPI
Static vs. dynamic scheduling
Advanced Computer Architecture
Static vs. dynamic scheduling
Tomasulo Organization
* From AMD 1996 Publication #18522 Revision E
Instruction Execution Cycle
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
CS152 Computer Architecture and Engineering Lecture 16 Compiler Optimizations (Cont) Dynamic Scheduling with Scoreboards.
Scoreboarding ENGS 116 Lecture 7 Vincent H. Berk October 5, 2005
The Vector-Thread Architecture
Operation of the Basic SM Pipeline
Mattan Erez The University of Texas at Austin
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
pipelining: data hazards Prof. Eric Rotenberg
Morgan Kaufmann Publishers The Processor
CSL718 : Superscalar Processors
High-level view Out-of-order pipeline
Lecture 7 Dynamic Scheduling
How does the CPU work? CPU’s program counter (PC) register has address i of the first instruction Control circuits “fetch” the contents of the location.
The University of Adelaide, School of Computer Science
Conceptual execution on a processor which exploits ILP
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Register File Organization ©Sudhakar Yalamanchili unless otherwise noted

Objective To understand the organization of large register files used in GPUs Identify the performance bottlenecks and opportunities for optimization in accessing the register file

Reading S. Liu et.al, “Operand Collector Architecture,” US Patent 7,834,881 Perspective of a lane J. H. Choquette, et. Al., “Methods and Apparatus for Source Operand Caching,” US Patent 8,639,882 Perspective of instruction scheduling B. Coon, et.al, “Tracking Register Usage During Multithreaded Processing Using A Scoreboard Having Separate Memory Regions and Storing Sequential Register Size Indicators,” US Patent 7,434,032 B1, October 2008 From a dependency perspective T. Aamodt, W. Fung, and T. Rodgers, “General Purpose Graphics Processor Architectures,” Draft Text, Sections 3.2, 3.3

Reading (cont.) S. Mittal, A Survey of Techniques for Architecting and Managing GPU Register File, IEEE TPDS, January 2017 (Sections 1 & 2) GPGPUSim, http://gpgpu- sim.org/manual/index.php/GPGPU- Sim_3.x_Manual#Introduction

CPU vs. GPU Register Files CPU/Core GPU/SM Optimized for latency Tens of threads Small number of ports Small relative to caches Optimized for throughput Tens of thousands of threads Large number of ports (implied) Larger relative to caches

Register File vs. Cache NVIDIA Pascal IBM Power8 12 cores guru3d.com extremetech.com 12 cores 8KB of RF (+ other registers) 96 KB/core L1 512KB/core L2, and 8MB/core L3 3584 cores/GPU Register File is 256 KB/SM 14336 KB/GPU 4MB L2

Register File Access: Recap Banks 0-15 I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback Pipeline scalar pipeline Issue I-Buffer pending warps Arbiter RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 Single ported Register File Banks 1024 bit Feed 32 lanes Xbar OC OC OC OC Operand Collectors (OC); stage operands for a warp DU DU DU DU Dispatch Units (DU) ALUs L/S SFU Lanes

The SM Register File: Bandwidth From * NVIDIA Pascal: 256KB/SM - 64K, 32-bit registers Throughput-optimized design 32 instructions/warp, up to 96 operands/warp/cycle Compare to an out-of-order core Research Q: Area efficient RFs

The SM Register File: Power* From * Constructed with fast (leaky) transistors Up to 20% of chip dynamic power consumption can be consumed by the RF Research Q: How do you architect, organize, & manage RFs for energy efficiency? *S. Mittal, A Survey of Techniques for Architecting and Managing GPU Register File, IEEE TPDS, January 2017

Multi-ported Register Files Multi-ported register file organization Area and delay grows with #ports Use multiple banks with single read and write ports to emulate multiple ports

Multiple Banks 1R/1W – single read port and single write port per bank Bank Organization R0 R0 R0 R0 R1 R1 R1 R1 R63 R63 R63 R63 1024 bits 32 bits 1R/1W – single read port and single write port per bank Each access to a register bank produces the same named register per lane, e.g., R0 Concurrently access multiple banks S. Mittal, A Survey of Techniques for Architecting and Managing GPU Register File, IEEE TPDS, January 2017

Thread Register Allocation Bank 0 Bank 1 Bank 2 Bank 3 Bank 0 Bank 1 Bank 2 Bank 3 T2 T2 T2 T2 T1 T2 T3 T4 T1 T1 T1 T1 T1 T2 T3 T4 Fat Allocation Thin Allocation Interpret thread as warp (in patents) Operands (registers) of a thread can be mapped across register banks in different ways Thin, fat, and mixed Skewed allocation: Goal is maximum bandwidth # of cycles to access operands for an instruction? For a three operand instruction, 3 cycles for thin allocation and 1 cycle for fat How much interconnect are you willing to invest?

Allocation Example w2:r0 w1:r0 w1:r4 w0:r0 w0:r4 w2:r1 w1:r1 w0:r5 Bank 0 Bank 1 Bank 2 Bank 3

Skewed Allocation: Example Figure from T. Aamodt, W. Fung, and T. Rodgers, “General Purpose Graphics Processor Architectures,” Draft Text

Register File Access Why and when do bank conflicts occur? Operand access for a warp may take several cycles  need a way to collect operands! Figure from T. Aamodt, W. Fung, and T. Rodgers, “General Purpose Graphics Processor Architectures,” Draft Text

Register File Access I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback Pipeline scalar pipeline Issue I-Buffer pending warps Buffering for all operands of an instruction Operand collector is allocated at the register read stage Multiple collectors  hides bank conflicts Figure from T. Aamodt, W. Fung, and T. Rodgers, “General Purpose Graphics Processor Architectures,” Draft Text

Bank conflict (read vs. write) Bank Conflicts (1) Fat allocation Bank conflict (read vs. write) Figure from T. Aamodt, W. Fung, and T. Rodgers, “General Purpose Graphics Processor Architectures,” Draft Text

Bank Conflicts (2) Skewed allocation Not Skewed Skewed allocation Bank conflicts remain within a warp Bank conflicts improved across warps Improved concurrency Figure from T. Aamodt, W. Fung, and T. Rodgers, “General Purpose Graphics Processor Architectures,” Draft Text

Bank Conflicts (2) Skewed allocation Bank conflicts remain within a warp Bank conflicts improved across warps Figure from T. Aamodt, W. Fung, and T. Rodgers, “General Purpose Graphics Processor Architectures,” Draft Text

Bank Conflicts (2) Skewed allocation Bank conflicts remain within a warp Bank conflicts improved across warps Figure from T. Aamodt, W. Fung, and T. Rodgers, “General Purpose Graphics Processor Architectures,” Draft Text

Bank Conflicts (2) Skewed allocation Bank conflicts remain within a warp Bank conflicts improved across warps Figure from T. Aamodt, W. Fung, and T. Rodgers, “General Purpose Graphics Processor Architectures,” Draft Text

Bank Conflicts (2) Skewed allocation Bank conflicts remain within a warp Bank conflicts improved across warps In bank 0 In bank 2 In bank 3 Figure from T. Aamodt, W. Fung, and T. Rodgers, “General Purpose Graphics Processor Architectures,” Draft Text

Potential Hazards With multiple instructions in flight from a warp, WAR hazards are possible with repeated bank conflicts Two instructions from the same warp in the OC Figure from T. Aamodt, W. Fung, and T. Rodgers, “General Purpose Graphics Processor Architectures,” Draft Text

Structural Hazards Structural hazards cause instruction replays Instructions are maintained in the instruction buffer until it completes Switch to another warp rather than stall the pipeline Examples: OC is full (analogy with ROB being full) Load/store miss in the cache: replay is to fetch the instruction from the cache Register bank conflicts Shared memory bank conflicts Others NVIDIA profiler will provide statistics on instruction repay GPUs do not use stalls given the number of instructions in flight.

Bank and Register File Access Organization How do we organize per thread RFs within the main register file (MRF) Maximize concurrency Minimize bank conflicts Cost grows multiplicatively with the number of ports Using lower port count memories to emulate high port count memories Access organization should support dynamic warp formation Goal: Full bandwidth access to lanes

A Multi-ported Register File Expensive in area and energy Feasibility in practice 32 * 3 read operands/cycle ports Pipeline scalar Pipeline scalar Pipeline scalar Pipeline scalar

A Baseline Register File Decoder Pipeline scalar Warp ID Reg ID Warp Warp 0 RF Warp 1 RF Wide register format Static assignment of threads to lanes Single decoder across the bank Comprised of 1R/1W banks R0 R1 R2 R3 R4 R5 R6 R7 Lane 0

Register File: Warp Formation (1) active inactive Warp ID TID Warp 12 Decoder Decoder Decoder Warp 18 Xbar New Warp Scalar pipeline Scalar pipeline Scalar pipeline Key point is that we need separate decoders for each sub-bank Decouplng thread  RF and RF  lane assignment Need to maintain thread ID  warp mapping (details later) Warps can be reconstituted in any form Bank conflicts can reduce performance Number of banks vs. number of lanes?

Register File: Warp Formation (2) active inactive Decoder Scalar pipeline Warp 12 Warp 18 New Warp Warps can be reformed but must conform to static lane assignments  want full bandwidth access to lanes More on dynamic warp formation later

Register File: Warp Compaction Decoder Scalar pipeline Xbar Warp 12 New Warp Can enable sub-warp optimizations such as power gating, intra-chip communication optimizations (TBD), etc. Remove the the Xbar with lane-aware compaction

The Operand Collector 32 operand, 128 byte entry Register values for a warp may arrive at difference times due to bank conflicts Staging of operands in a warp prior to issue Figure from T. Aamodt, W. Fung, and T. Rodgers, “General Purpose Graphics Processor Architectures,” Draft Text

Caching in the Operand Collector Operands for each warp are collected over time (bank conflicts) Sharing of operands across collectors  cache? Figure from T. Aamodt, W. Fung, and T. Rodgers, “General Purpose Graphics Processor Architectures,” Draft Text

Example Capacity of the OC can reduce register file pressure Reuse Op Dest Op1 Op2 Op3 Op4 Op5 OP6 FMA R1 R13 R11 R14 X Add R2 R6 R7 R8 Mul R3 Capacity of the OC can reduce register file pressure Store more instructions operands  sharing extends across both instructions, e.g. to registers R11 and R13 Coherence?  operands copied into multiple locations Reuse Depends on connectivity with lanes Ports to the OC From B. Coon, et.al, “Tracking Register Usage During Multithreaded Processing Using A Scoreboard Having Separate Memory Regions and Storing Sequential Register Size Indicators,” US Patent 7,434,032 B1, October 2008

Operand Re-use across instructions with independent mux control Collecting Operands Operand Set 0 Operand Set 1 Operand Set 2 Operand Set 3 Effectively operates as a cache Re-use determined by interconnect No operand RE-use across inputs – Need more wiring Use an Xbar for most flexible re-use Operand Re-use across instructions with independent mux control Example: From the perspective of a single lane/ALU From B. Coon, et.al, “Tracking Register Usage During Multithreaded Processing Using A Scoreboard Having Separate Memory Regions and Storing Sequential Register Size Indicators,” US Patent 7,434,032 B1, October 2008

Operand Caching Set by the dispatch unit Operand Set 3 Op0 Op1 Op2 Op3 Cache Table Operand Set 2 Operand Set 1 Operand Set 0 Register# Set by the dispatch unit Individual vs. common settings Queried by the dispatch unit Register writes invalidate entries in the cache table Result FIFO

Instruction Dispatch OC allocated and initialized after decode I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback Pipeline scalar pipeline Issue I-Buffer pending warps OC allocated and initialized after decode Source operand requests are queued at arbiter Operands/cycle OC limited by interconnect V Reg RDY WID Operand (128 bytes) Instruction

Pipeline Stage OC allocated and initialized after decode RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 Xbar 1024 bit Banks 0-15 OC DU ALUs L/S SFU Arbiter OC allocated and initialized after decode Source operand requests are queued at arbiter Operands/cycle OC limited by interconnect V Reg RDY WID Operand (128 bytes) Instruction

Instruction Dispatch Decode to get register IDs I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback Pipeline scalar pipeline Issue I-Buffer pending warps Decode to get register IDs Check Operand Cache (cache table) Set Crossbar and MRF bank indices Update Cache Table Set ALU and OC Interconnect

Functional Unit Level Collection RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 Xbar 1024 bit Banks 0-15 OC DU ALUs L/S SFU Arbiter Operand collectors are associated with different functional unit types Can naturally support heterogeneity Dedicated vs. shared OC units Connectivity consequences Other sources of operands Constant cache, read-only cache

Instruction Perspective Use scoreboard to minimize bank conflicts (pending writes) RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 Xbar 1024 bit Banks 0-15 OC DU ALUs L/S SFU Arbiter OC request read operands Prioritize writes over reads Schedule read requests for maximum BW

Register File Access: Coherency RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 Xbar 1024 bit Banks 0-15 OC DU ALUs L/S SFU Arbiter What happens when a new value is written back the Main Register File (MRF)? OC values must be invalidated V Reg RDY WID Operand (128 bytes) Instruction An example OC1 1http://gpgpu-sim.org/manual/index.php/GPGPU-Sim_3.x_Manual#Register_Access_and_the_Operand_Collector

Scheduling and Allocating OCs Warp ID Reg ID Decoder ? OC OC OC ? Interconnect DU DU Scheduling of OCs –policy? Mitigate bank conflicts Support for heterogeneous architecture ? EU EU

The Operand Collector: Summary RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 Xbar 1024 bit Banks 0-15 OC DU ALUs L/S SFU Arbiter Buffer (warp size) operands at the collector Sharing of operands across instructions Operates as an operand cache for “collecting” operands across bank conflicts Simplifies scheduling accesses to the MRF An example, block diagram flow of events Example Flow:

Project Ideas Goals Possible Methodologies Reduce MRF Bank Conflicts Allocation policies across Warps Hierarchical Register Files – implement from papers Possible Methodologies Get instructions traces from GPGPUSim or Harmonica Implement simple register file analysis What is the degree of actual sharing that is taking place across registers? What do the register dependency chains look like? What is the headroom for reducing bank conflicts? Impact of relationships between #banks and #lanes

Summary: Register File Register file management and operand dispatch has multiple interacting components Performance complexity tradeoff Concurrency increase requires increasing interconnect complexity Stalls/conflicts require buffering and bypass to increase utilization of the execution units Good register file allocation in critical

The Scoreboard I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback Pipeline scalar pipeline Issue I-Buffer pending warps Scoreboard implemented @issue  register read operand step Adaptation to high register count of SIMT/GPUs Scoreboard

Review: Generic Scoreboard Classic: Comprised of several data structures Instruction status table Function status table Result Table What are the challenges in straightforward adaptation of CPU concepts?

Scoreboard Operation Separate read operand stage from issue logic FU Instruction re-ordering occurs here! FU IFU IDU ID1 ID2 MEM WB FU Check for WAW and Structural Hazards FU Check for WAR Hazards Check for RAW Hazards Separate read operand stage from issue logic Issue logic tracks register dependencies Issue logic tracks structural dependencies

Instruction Status Table Keeps the information about which activities of the execution process an instruction is currently in. rdopd? - has it completed reading its operands? issue? - is the instructions issued? exec? - has it completed its execution? wrback? - has it completed its writeback? Impact on warp level tracking Note we still have a single thread/warp

Functional Unit Status Table Function unit producing value Source Registers have value? dest reg src1 src2 Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F2 R3 No Add Sub F8 F6 Has an entry for each functional unit and there are 9 fields for each entry: busy? - indicates if the functional unit is busy; op - the kind of operation being performed; dest - the destination register; src1, src2 - the two source registers; Func1 (Qi), func2(Qj) - the functional units producing the results in the two source registers; ready1?, ready2? - indicates if src1 and src2 is ready;

Result Status Table Unit F0 F2 F4 F6 F8 F10 F12 etc FU Mult1 Integer Add Divide Maintain an entry per register indicating the functional unit that will write the pending result into the register Impact on SIMT/GPU  large number of registers

Challenges Scoreboard size Query bandwidth Tracking register dependencies for all registers  very large number of registers Query bandwidth Checking scoreboard structures require high query bandwidth because of the number of registers

Adaptation to SIMT Processors Fetch & I-cache Decode Scoreboard memory Scoreboard processing Instruction buffer Issue Pipeline Completion Signal Pipeline configuration Signals To Register File From B. Coon, et.al, “Tracking Register Usage During Multithreaded Processing Using A Scoreboard Having Separate Memory Regions and Storing Sequential Register Size Indicators,” US Patent 7,434,032 B1, October 2008

Scoreboard Memory Data Structure #dependent registers warp M warp M-1 warp 0 Register ID Each warp tracks number of dependent registers (issued instructions) Limit the number of dependencies rather than storage for tracking all possible dependencies

Scoreboard Memory Data Structure #dependent registers warp M warp M-1 warp 0 source and destination register IDs from issue logic comp comp comp 0/1 0/1 0/1 Dependency mask to Issue logic buffer

Issue Logic Additional check Instruction buffer V Instr 1 W1 R Instr 2 W1 Instr 2 Wn Instr 1 W2 0 0 1 1 0 1 0 1 0 1 Dependency Mask Updated instruction buffer with the dependency mask for each instruction Additional issue constraint: Issue instruction when dependency mask is all 0 Can impact parallelism when there are a large number of dependencies

Operation Decode & Get register operand IDs Any register values cached in OC? Load uncached register value to OC? Update cache table Dispatch/execute At commit stage, update the OC/Issue/cache data structures for registers that have been written Recompute warp dependency mask Update the scoreboard data structures

Summary Buffering at the OC to decouple MRF access/conflicts from dispatch Throughput optimized design vs. latency optimized The OC and scoreboard are organized to handle in-order cores but with a large number of registers/threads/units Trading dependencies for concurrency Throughput optimized design