Instruction Issue Logic for High- Performance Interruptible Pipelined Processors Gurinder S. Sohi Professor UW-Madison Computer Architecture Group University.

Slides:



Advertisements
Similar presentations
Spring 2003CSE P5481 Out-of-Order Execution Several implementations out-of-order completion CDC 6600 with scoreboarding IBM 360/91 with Tomasulos algorithm.
Advertisements

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
Real-time Signal Processing on Embedded Systems Advanced Cutting-edge Research Seminar I&III.
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
Dyn. Sched. CSE 471 Autumn 0219 Tomasulo’s algorithm “Weaknesses” in scoreboard: –Centralized control –No forwarding (more RAW than needed) Tomasulo’s.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Computer Organization and Architecture
Out-of-Order Machine State Instruction Sequence: Inorder State: Look-ahead State: Architectural State: R3  A R7  B R8  C R7  D R4  E R3  F R8  G.
Microprocessor Microarchitecture Dependency and OOO Execution Lynn Choi Dept. Of Computer and Electronics Engineering.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 14, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
1 Tomasulo’s Algorithm and IBM 360 Srivathsan Soundararajan.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
Multiscalar processors
Pipelined Processor II CPSC 321 Andreas Klappenecker.
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )
The PowerPC Architecture  IBM, Motorola, and Apple Alliance  Based on the IBM POWER Architecture ­Facilitate parallel execution ­Scale well with advancing.
1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing.
Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Ravikumar Source:
© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture Instruction Execution: Dynamic Scheduling.
Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Shrikant G.
1 A Superscalar Pipeline [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005 and Instruction Issue Logic, IEEETC, 39:3, Sohi,
Anshul Kumar, CSE IITD CSL718 : Superscalar Processors Speculative Execution 2nd Feb, 2006.
OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.
1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
Dataflow Order Execution  Use data copying and/or hardware register renaming to eliminate WAR and WAW ­register name refers to a temporary value produced.
CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.
CS203 – Advanced Computer Architecture ILP and Speculation.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
IBM System 360. Common architecture for a set of machines
Lecture: Out-of-order Processors
Dynamic Scheduling Why go out of style?
Multiscalar Processors
/ Computer Architecture and Design
CIS-550 Advanced Computer Architecture Lecture 10: Precise Exceptions
CS203 – Advanced Computer Architecture
Single Clock Datapath With Control
Lecture 6: Advanced Pipelines
Superscalar Processors & VLIW Processors
Superscalar Pipelines Part 2
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
CS 704 Advanced Computer Architecture
Smruti R. Sarangi Computer Science and Engineering, IIT Delhi
Lecture: Out-of-order Processors
Lecture 8: Dynamic ILP Topics: out-of-order processors
Adapted from the slides of Prof
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Checking for issue/dispatch
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Advanced Computer Architecture
Adapted from the slides of Prof
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
Lecture 9: Dynamic ILP Topics: out-of-order processors
Conceptual execution on a processor which exploits ILP
Presentation transcript:

Instruction Issue Logic for High- Performance Interruptible Pipelined Processors Gurinder S. Sohi Professor UW-Madison Computer Architecture Group University of Wisconsin-Madison Sriram Vajapeyam Real-Time Collaboration space at Oracle, Bangalore, India

What is this about? The performance of pipelined processors is severely limited by data dependencies and branch instructions. Another major problem that arises in pipelined computer design is that an interrupt can be imprecise. Both of these causes performance degradation. A hardware solution is offered in this paper.

Problems and previous solutions Data Dependency Code scheduling Waiting or Reservation stations Branch Instructions Delayed branching Branch Prediction Imprecise Interrupts Reorder buffer Reorder buffer with bypass logic

Same instruction set as the scalar unit of the CRAY-I Several functional units connected to a common result bus Instruction Fetch Unit Decode and Issue Unit 144 registers Basic Architecture

Tomasulo’s Algorithm First presented for the floating-point unit of the IBM 360/91. Extension of this algorithm for the scalar unit of the CRAY-I is presented later. Algorithm: Instruction whose operands are not available is forwarded to a Reservation stations (RS). It waits in the RS until its operands are available. it is dispatched to the appropriate functional unit register is assigned a bit that determines if the register is busy (it is the destination of an instruction). Busy register is assigned a tag which represents the result to be stored in the register.

Tomasulo’s Algorithm (Contd...) Fields in Reservation Station Disadvantage: High cost of hardware for register tagging and its associative comparison hardware.

Extension to Tomasulo’s Algorithm A Separate Tag Unit Because only few sink registers (busy registers) are active. All tags from active registers are consolidated into Tag Unit Register retains the busy bit Algorithm: At instruction issue time, if a source register is busy, the TU is queried for the current tag of the appropriate register and the tag is forwarded to the reservation stations. If the destination register not busy obtaining tag is straightforward. If it is busy a new tag is obtained. Latest Field is used to keep the register busy even after the old instruction is executed. If the TU is full instruction issue is stopped.

Fields in Reservation Station Extension to Tomasulo’s Algorithm (contd…)

Other Extensions Merging Reservation Stations into RS pool (Disadvantage: only one instruction can be issued at a time! NO) Merging RS pool with Tag Unit. To make RS Tag Unit (RSTU) Fields in RSTU

Implementation of Precise interrupts Reorder Buffer: It allows instructions to finish execution out of order but updates registers, memory, etc. in the order that the instructions were present in the program. So it assures that a precise state of the machine is recoverable at any time. Bypass Logic: An instruction does not have to wait for the reorder buffer to update a source register, it can fetch the value from the reorder buffer (if it is available) and can issue.

MERGING DEPENDENCY RESOLUTION AND PRECISE INTERRUPTS RSTU can be made to behave like a reorder buffer if it is forced to update the state of the machine in the order that the instructions are encountered by making it a queue. Modified unit is called Register Update Unit (RUU). It (i) determines which instruction should be issued to the functional units for execution, reserves the result bus and dispatches the instruction to the functional unit, (ii) determines which instruction can commit, i.e., update the state of the machine, (iii) monitors the result bus to resolve dependencies and (iv) provides tags to and accepts new instructions from the decode and issue unit.

Fields in RUU

Merging … (Contd…) Destination Field In the RSTU the issue logic needed to search the TU to obtain the correct tag for the source operand and to update the latest copy field for the destination Here we use a counter to instead of multiple copies of a destination 2 n-bit counters - Number of Instances (NI) and Latest instance (LI) When an instruction that writes into destination is issued to the RUU, both NI and LI are incremented. LI incremented modulo n. When such instruction leaves the associated NI is decremented. Register tag consists of the register number appended with the LI counter.

Merging … (Contd…) Bypass Logic in the RUU case that bypass logic might be helpful is when Ij has completed execution but has not committed when Ii is issued to the RUU (Ii is issued after Ij) To provide bypass logic for this case, the monitoring capabilities of the reservation stations are extended to monitor both the result bus and the RUU to register bus.

SIMULATION Simulation Results The benchmark programs used were the Lawrence Livermore loops Large sized RUU is needed to achieve a performance improvement. RUU of size 10 has same hardware requirements as an architecture that has reservation station with each of the functional unit.

BRANCH PREDICTION AND CONDITIONAL INSTRUCTIONS To allow conditional execution of instructions, a hardware mechanism is needed that would allow the machine to recover from an incorrect branch prediction. RUU provides a method for nullifying instructions, as for the interrupts.

Conclusions combined the issues of hardware dependency-resolution and implementation of precise interrupts. A scheme to resolve dependencies and allowing the out-order- execution is devised with low hardware cost. It is incorporated with precise interrupts. This incorporation made each issue simpler than before. Results of performance evaluation are quite encouraging. This mechanism can be easily extended to support conditional execution of instructions from a predicted path.