Register Renaming & Value Prediction
Overview ► Need for Post-RISC ► Register Renaming vs. Allocation Strategies ► How to compile for Post-RISC machines ► Dynamic Register Renaming through Virtual-Physical Registers
Software Outlives Hardware ► How to make old software run faster? Faster CPU clock and memory hierarchy Adapt CPU’s to actual software (profiling/tuning) More instructions per cycle ► Today’s software will run on tomorrow’s CPU’s Need to keep software interface stable More functional units and registers
Compile-time vs. Run-time ► Little is known about software at compile-time ► Space/time trade-offs Memory speeds cannot keep up with CPU speeds When to apply optimizations that increase code-size
Solutions ► New scalable architecture (IA-64) Decouple physical/virtual registers using register windows More explicit parallelism allows for more function units Explicit speculative instructions ► Post-RISC architecture Remove limits in super-scalar implementation of existing architectures Extract even more parallelism out of existing software
Anti- and Output Dependencies ► Also called read-after-write (RAW) hazards ► An instruction may use a result produced by the previous instruction Both instructions may not execute simultaneously in multiple pipelines. The second instruction must typically be stalled.
Structural Dependencies ► Stalls results in less than optimal performance We may have single-issue cycles, which process only a single instruction. Worse, we may have zero-issue cycles, which initiate no new instructions. ► Data dependencies can also limit performance for a scalar machine Two cycle memory load/write Intra-instruction dependencies
Scheduling ► Scheduling can remove stalls ► Intra-instruction dependencies cannot be removed by scheduling (CISC)
Need for Post-RISC ► Super-scalar has diminishing returns in CPI (Clocks Per Instruction) 2-Way (85%) 4-Way 2.6 (65%) 8-Way ??? ► More parallelism needed ► Look beyond set of 4 instructions
Post-RISC characteristics ► Out-of-order execution (Existed 20 years ago on IBM and CDC) Innovative for single-chip Branch history bits ► Precise interrupts ► Fetch/Flow Prediction ► More caching Instruction cache becomes CPU scratch space ► Register renaming First in IBM 360/91 FPU
Specint92 Trends ► Specint92 numbers are increasing DEC has historically been the champ ► Specint92/Clock rates DEC low => /95) IBM strong early => /93) HP /95)
The Post-RISC Architecture
Post-RISC CPU’s ► Traditional RISC DEC Alpha Sun UltraSPARC-1 ► (partially) Post-RISC PowerPC 604 MIPS R10000 HP PA-8000 Intel Pentium Pro DEC Alpha HAL SPARC64
Automatic Register Renaming ► Every R-write allocates new R ► The register name A is an alias for the last R allocated by a write to A ► An instruction reading and writing an register allocates a new R too
Advantages over More ISA Registers ► Smaller instructions ► Allow same software to run on range of implementations Compare the same program running on Pentium or AMD Ath ► Less state to save Faster function calls Faster context switches Life-times can be optimized
Renaming Implementation ► Rename Storage Locations Reorder Buffer Physical Register File ► Similarities: Allocate at decode Release at commit
Renaming using Reorder buffer ► Results are kept in reorder buffer ► Source operands are read either from the register file, or a reorder buffer entry ► Not-yet-ready results are forwarded to instruction queue ► Used by Intel Pentium III, PowerPC 604, SPARC64
Renaming on Pentium III ► All registers can be renamed (generic, floating-point, status) ► Renaming uses a set of 40 reorder buffers FPU control/status cannot be renamed Max 2 renamings per instruction
Register Allocation Example ► Minimal number of named registers ► Scheduling is limited ► Strictly serial execution rA := Mem1; rA := rA * rA; Mem2 := rA; rA := Mem3; rA := rA + 1; Mem4 := rA; Mem2 := Mem1 * Mem1; Mem4 := Mem3 + 1;
Renaming using Physical Register File ► Register file contains more registers than defined in ISA (logical registers) ► Map logical register to physical registers during decode ► Operands are always read from logical file ► Used by MIPS R10000 and DEC 21264
Virtual-Physical Registers ► Motivation: better utilization of physical registers Important in presence of long latency instructions ► Conventional scheme “wastes” register for each: Decoded instruction that has not finished execution Committed instruction whose result is dead Can be eliminated by maintaining reference counter Example: loadf2,0(r6) fdivf2,f2,f10 fmulf2,f2,f12 faddf2,f2,1
Virtual-Physical Register Renaming ► General Map Table Indexed by logical register L VP register: last virtual-physical register that L has been mapped to P register: Last physical register that L and VP have been mapped to V-bit: indicates whether P is valid ► Physical Map Table Has entry for each VP Contains last physical register that VP has been mapped to
Functional Description ► For each logical source register S do a GMT lookup If V-bit is set, rename S to P Otherwise, rename S to VP ► Rename the logical destination register to a new VP ► Update GMT: set VP to new mapping and reset V ► Save previous VP in reorder buffer to be able to roll back
Functional Description ► Instruction Queue Fields: Operation code Destination VP Source operands Ready-bits for source operands: when ready Source operand contains a physical register number ► Reorder Buffer Entry Destination logical register Completion bit VP mapping of last instruction with same logical destination
Functional Description ► When source operands are ready, instruction is issued ► When instruction completes: new physical register R is allocated for result PMT is updated to reflect new mapping VP number of destination is broadcast to all entries in instruction queue with physical register identifier GMT is updated: entry corresponding to logical destination is checked for match with the VP and if so, the physical register nr is copied to the P register field and the V flag is set As a result a new instruction using same logical register will find corresponding physical register in GMT Lastly, C flag of entry in reorder buffer is set
Register Allocation Example ► Uses more named registers ► Scheduling more effective ► 2-way super-scalar execution rA := Mem1; rB := Mem3; rA := rA * rA; rB := rB + 1; Mem2 := rA; Mem4 := rB; Mem2 := Mem1 * Mem1; Mem4 := Mem3 + 1;
Effect of Register Renaming ► Schedule uses 4 hardware registers ► 2-way super-scalar execution rA1 := Mem1; rB1 := Mem3; rA2 := rA1 * rA1; rB2 := rB1 + 1; Mem2 := rA2; Mem4 := rB2;
Effect of Register Renaming ► Schedule uses 4 hardware registers ► Can hide memory-write latency ► Still no full use of multiple pipelines rA1 := Mem1; rA2 := rA1 * rA1; Mem2 := rA2; rA3 := Mem3; rA4 := rA3 + 1; Mem4 := rA4;
Renaming and O-O-O execution ► Instructions wait for: Availability of execution unit Input dependencies Older instructions have priority Load instructions have priority ► Instructions do NOT wait for: Program order Branch resolution Output dependencies (use “rename register”)
Renaming and O-O-O execution ► Schedule uses 4 hardware registers ► Can hide memory-write latency ► “Bad” schedule uses both pipelines ► Only one register name used rA1 := Mem1; rA2 := rA1 * rA1; Mem2 := rA2; rA3 := Mem3; rA4 := rA3 + 1; Mem4 := rA4;
Renaming aware scheduling? ► Use Register Renaming in allocator minimal number of named registers maximal number of register instances ► Do not do scheduling that CPU can do over-scheduling can be worse than no scheduling at all