Register Renaming & Value Prediction. Overview ► Need for Post-RISC ► Register Renaming vs. Allocation Strategies ► How to compile for Post-RISC machines.

Slides:



Advertisements
Similar presentations
Topics Left Superscalar machines IA64 / EPIC architecture
Advertisements

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
Computer Organization and Architecture
CSCI 4717/5717 Computer Architecture
Real-time Signal Processing on Embedded Systems Advanced Cutting-edge Research Seminar I&III.
1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
1 Advanced Computer Architecture Limits to ILP Lecture 3.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Microprocessor Microarchitecture Dependency and OOO Execution Lynn Choi Dept. Of Computer and Electronics Engineering.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.
EECE476: Computer Architecture Lecture 23: Speculative Execution, Dynamic Superscalar (text 6.8 plus more) The University of British ColumbiaEECE 476©
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Mult. Issue CSE 471 Autumn 011 Multiple Issue Alternatives Superscalar (hardware detects conflicts) –Statically scheduled (in order dispatch and hence.
Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.
CS 152 Computer Architecture and Engineering Lecture 15 - Advanced Superscalars Krste Asanovic Electrical Engineering and Computer Sciences University.
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
Chapter 12 Pipelining Strategies Performance Hazards.
March 9, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Krste Asanovic Electrical.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
OOO execution © Avi Mendelson, 4/ MAMAS – Computer Architecture Lecture 7 – Out Of Order (OOO) Avi Mendelson Some of the slides were taken.
Chapter 14 Instruction Level Parallelism and Superscalar Processors
1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 2: Pipeline problems & tricks dr.ir. A.C. Verschueren Eindhoven.
1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.
RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.
Superscalar Processors by
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Trace cache and Back-end Oper. CSE 4711 Instruction Fetch Unit Using I-cache I-cache I-TLB Decoder Branch Pred Register renaming Execution units.
1 Lecture 6 Tomasulo Algorithm CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading:Textbook 2.4, 2.5.
CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 1: Overview of High Performance Processors * Jeremy R. Johnson Wed. Sept. 27,
Csci 136 Computer Architecture II – Superscalar and Dynamic Pipelining Xiuzhen Cheng
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
CSIE30300 Computer Architecture Unit 13: Introduction to Multiple Issue Hsin-Chou Chi [Adapted from material by and
15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
PipeliningPipelining Computer Architecture (Fall 2006)
IBM System 360. Common architecture for a set of machines
CS 352H: Computer Systems Architecture
Dynamic Scheduling Why go out of style?
CSL718 : Superscalar Processors
Computer Architecture
PowerPC 604 Superscalar Microprocessor
Part IV Data Path and Control
CS203 – Advanced Computer Architecture
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Microprocessor Microarchitecture Dynamic Pipeline
/ Computer Architecture and Design
Part IV Data Path and Control
Superscalar Pipelines Part 2
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Out-of-Order Commit Processor
How to improve (decrease) CPI
Sampoorani, Sivakumar and Joshua
Instruction Level Parallelism (ILP)
15-740/ Computer Architecture Lecture 10: Out-of-Order Execution
How to improve (decrease) CPI
Conceptual execution on a processor which exploits ILP
Presentation transcript:

Register Renaming & Value Prediction

Overview ► Need for Post-RISC ► Register Renaming vs. Allocation Strategies ► How to compile for Post-RISC machines ► Dynamic Register Renaming through Virtual-Physical Registers

Software Outlives Hardware ► How to make old software run faster? Faster CPU clock and memory hierarchy Adapt CPU’s to actual software (profiling/tuning) More instructions per cycle ► Today’s software will run on tomorrow’s CPU’s Need to keep software interface stable More functional units and registers

Compile-time vs. Run-time ► Little is known about software at compile-time ► Space/time trade-offs Memory speeds cannot keep up with CPU speeds When to apply optimizations that increase code-size

Solutions ► New scalable architecture (IA-64) Decouple physical/virtual registers using register windows More explicit parallelism allows for more function units Explicit speculative instructions ► Post-RISC architecture Remove limits in super-scalar implementation of existing architectures Extract even more parallelism out of existing software

Anti- and Output Dependencies ► Also called read-after-write (RAW) hazards ► An instruction may use a result produced by the previous instruction Both instructions may not execute simultaneously in multiple pipelines. The second instruction must typically be stalled.

Structural Dependencies ► Stalls results in less than optimal performance ­We may have single-issue cycles, which process only a single instruction. ­Worse, we may have zero-issue cycles, which initiate no new instructions. ► Data dependencies can also limit performance for a scalar machine Two cycle memory load/write Intra-instruction dependencies

Scheduling ► Scheduling can remove stalls ► Intra-instruction dependencies cannot be removed by scheduling (CISC)

Need for Post-RISC ► Super-scalar has diminishing returns in CPI (Clocks Per Instruction) 2-Way  (85%) 4-Way  2.6 (65%) 8-Way  ??? ► More parallelism needed ► Look beyond set of 4 instructions

Post-RISC characteristics ► Out-of-order execution (Existed 20 years ago on IBM and CDC) Innovative for single-chip Branch history bits ► Precise interrupts ► Fetch/Flow Prediction ► More caching Instruction cache becomes CPU scratch space ► Register renaming First in IBM 360/91 FPU

Specint92 Trends ► Specint92 numbers are increasing DEC has historically been the champ ► Specint92/Clock rates DEC low => /95) IBM strong early => /93) HP /95)

The Post-RISC Architecture

Post-RISC CPU’s ► Traditional RISC DEC Alpha Sun UltraSPARC-1 ► (partially) Post-RISC PowerPC 604 MIPS R10000 HP PA-8000 Intel Pentium Pro DEC Alpha HAL SPARC64

Automatic Register Renaming ► Every R-write allocates new R ► The register name A is an alias for the last R allocated by a write to A ► An instruction reading and writing an register allocates a new R too

Advantages over More ISA Registers ► Smaller instructions ► Allow same software to run on range of implementations Compare the same program running on Pentium or AMD Ath ► Less state to save Faster function calls Faster context switches Life-times can be optimized

Renaming Implementation ► Rename Storage Locations Reorder Buffer Physical Register File ► Similarities: Allocate at decode Release at commit

Renaming using Reorder buffer ► Results are kept in reorder buffer ► Source operands are read either from the register file, or a reorder buffer entry ► Not-yet-ready results are forwarded to instruction queue ► Used by Intel Pentium III, PowerPC 604, SPARC64

Renaming on Pentium III ► All registers can be renamed (generic, floating-point, status) ► Renaming uses a set of 40 reorder buffers FPU control/status cannot be renamed Max 2 renamings per instruction

Register Allocation Example ► Minimal number of named registers ► Scheduling is limited ► Strictly serial execution rA := Mem1; rA := rA * rA; Mem2 := rA; rA := Mem3; rA := rA + 1; Mem4 := rA; Mem2 := Mem1 * Mem1; Mem4 := Mem3 + 1;

Renaming using Physical Register File ► Register file contains more registers than defined in ISA (logical registers) ► Map logical register to physical registers during decode ► Operands are always read from logical file ► Used by MIPS R10000 and DEC 21264

Virtual-Physical Registers ► Motivation: better utilization of physical registers Important in presence of long latency instructions ► Conventional scheme “wastes” register for each: Decoded instruction that has not finished execution Committed instruction whose result is dead ­Can be eliminated by maintaining reference counter Example: loadf2,0(r6) fdivf2,f2,f10 fmulf2,f2,f12 faddf2,f2,1

Virtual-Physical Register Renaming ► General Map Table Indexed by logical register L VP register: last virtual-physical register that L has been mapped to P register: Last physical register that L and VP have been mapped to V-bit: indicates whether P is valid ► Physical Map Table Has entry for each VP Contains last physical register that VP has been mapped to

Functional Description ► For each logical source register S do a GMT lookup If V-bit is set, rename S to P Otherwise, rename S to VP ► Rename the logical destination register to a new VP ► Update GMT: set VP to new mapping and reset V ► Save previous VP in reorder buffer to be able to roll back

Functional Description ► Instruction Queue Fields: Operation code Destination VP Source operands Ready-bits for source operands: when ready Source operand contains a physical register number ► Reorder Buffer Entry Destination logical register Completion bit VP mapping of last instruction with same logical destination

Functional Description ► When source operands are ready, instruction is issued ► When instruction completes: new physical register R is allocated for result PMT is updated to reflect new mapping VP number of destination is broadcast to all entries in instruction queue with physical register identifier GMT is updated: entry corresponding to logical destination is checked for match with the VP and if so, the physical register nr is copied to the P register field and the V flag is set As a result a new instruction using same logical register will find corresponding physical register in GMT Lastly, C flag of entry in reorder buffer is set

Register Allocation Example ► Uses more named registers ► Scheduling more effective ► 2-way super-scalar execution rA := Mem1; rB := Mem3; rA := rA * rA; rB := rB + 1; Mem2 := rA; Mem4 := rB; Mem2 := Mem1 * Mem1; Mem4 := Mem3 + 1;

Effect of Register Renaming ► Schedule uses 4 hardware registers ► 2-way super-scalar execution rA1 := Mem1; rB1 := Mem3; rA2 := rA1 * rA1; rB2 := rB1 + 1; Mem2 := rA2; Mem4 := rB2;

Effect of Register Renaming ► Schedule uses 4 hardware registers ► Can hide memory-write latency ► Still no full use of multiple pipelines rA1 := Mem1; rA2 := rA1 * rA1; Mem2 := rA2; rA3 := Mem3; rA4 := rA3 + 1; Mem4 := rA4;

Renaming and O-O-O execution ► Instructions wait for: Availability of execution unit Input dependencies Older instructions have priority Load instructions have priority ► Instructions do NOT wait for: Program order Branch resolution Output dependencies ­(use “rename register”)

Renaming and O-O-O execution ► Schedule uses 4 hardware registers ► Can hide memory-write latency ► “Bad” schedule uses both pipelines ► Only one register name used rA1 := Mem1; rA2 := rA1 * rA1; Mem2 := rA2; rA3 := Mem3; rA4 := rA3 + 1; Mem4 := rA4;

Renaming aware scheduling? ► Use Register Renaming in allocator minimal number of named registers maximal number of register instances ► Do not do scheduling that CPU can do over-scheduling can be worse than no scheduling at all