Out of Order Processors

Slides:



Advertisements
Similar presentations
Scoreboarding & Tomasulos Approach Bazat pe slide-urile lui Vincent H. Berk.
Advertisements

A scheme to overcome data hazards
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
COMP25212 Advanced Pipelining Out of Order Processors.
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Computer Architecture
1 Zvika Guz Slides modified from Prof. Dave Patterson, Prof. John Kubiatowicz, and Prof. Nancy Warter-Perez Out Of Order Execution.
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
COMP381 by M. Hamdi 1 Pipelining (Dynamic Scheduling Through Hardware Schemes)
1 Recap (Scoreboarding). 2 Dynamic Scheduling Dynamic Scheduling by Hardware – – Allow Out-of-order execution, Out-of-order completion – – Even though.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 5, 2005 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Scoreboarding)
Nov. 9, Lecture 6: Dynamic Scheduling with Scoreboarding and Tomasulo Algorithm (Section 2.4)
Out-of-order execution: Scoreboarding and Tomasulo Week 2
1 Lecture 6 Tomasulo Algorithm CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading:Textbook 2.4, 2.5.
CET 520/ Gannod1 Section A.8 Dynamic Scheduling using a Scoreboard.
Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.
2/24; 3/1,3/11 (quiz was 2/22, QuizAns 3/8) CSE502-S11, Lec ILP 1 Tomasulo Organization FP adders Add1 Add2 Add3 FP multipliers Mult1 Mult2 From.
CSC 4250 Computer Architectures September 29, 2006 Appendix A. Pipelining.
04/03/2016 slide 1 Dynamic instruction scheduling Key idea: allow subsequent independent instructions to proceed DIVDF0,F2,F4; takes long time ADDDF10,F0,F8;
CIS 662 – Computer Architecture – Fall Class 11 – 10/12/04 1 Scoreboarding  The following four steps replace ID, EX and WB steps  ID: Issue –
COMP25212 Advanced Pipelining Out of Order Processors.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm 吳俊興 高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.
Instruction-Level Parallelism and Its Dynamic Exploitation
IBM System 360. Common architecture for a set of machines
Images from Patterson-Hennessy Book
/ Computer Architecture and Design
Dynamic Scheduling and Speculation
Step by step for Tomasulo Scheme
CS203 – Advanced Computer Architecture
Lecture 6 Score Board And Tomasulo’s Algorithm
Lecture 10 Tomasulo’s Algorithm
Lecture 12 Reorder Buffers
March 11, 2002 Prof. David E. Culler Computer Science 252 Spring 2002
Chapter 3: ILP and Its Exploitation
Advantages of Dynamic Scheduling
High-level view Out-of-order pipeline
11/14/2018 CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, Electrical and Computer.
CMSC 611: Advanced Computer Architecture
A Dynamic Algorithm: Tomasulo’s
COMP s1 Seminar 3: Dynamic Scheduling
Out of Order Processors
Last Week Talks Any feedback from the talks? What did you like?
John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
CS 704 Advanced Computer Architecture
Adapted from the slides of Prof
Checking for issue/dispatch
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
March 11, 2002 Prof. David E. Culler Computer Science 252 Spring 2002
Static vs. dynamic scheduling
CSCE430/830 Computer Architecture
Advanced Computer Architecture
Static vs. dynamic scheduling
September 20, 2000 Prof. John Kubiatowicz
Tomasulo Organization
Reduction of Data Hazards Stalls with Dynamic Scheduling
CS5100 Advanced Computer Architecture Dynamic Scheduling
Adapted from the slides of Prof
Lecture 5 Scoreboarding: Enforce Register Data Dependence
CS152 Computer Architecture and Engineering Lecture 16 Compiler Optimizations (Cont) Dynamic Scheduling with Scoreboards.
CS252 Graduate Computer Architecture Lecture 6 Tomasulo, Implicit Register Renaming, Loop-Level Parallelism Extraction Explicit Register Renaming February.
Scoreboarding ENGS 116 Lecture 7 Vincent H. Berk October 5, 2005
/ Computer Architecture and Design
John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
September 20, 2000 Prof. John Kubiatowicz
CS252 Graduate Computer Architecture Lecture 6 Introduction to Advanced Pipelining: Out-Of-Order Execution John Kubiatowicz Electrical Engineering and.
CSL718 : Superscalar Processors
High-level view Out-of-order pipeline
Lecture 7 Dynamic Scheduling
Conceptual execution on a processor which exploits ILP
Presentation transcript:

Out of Order Processors

Outline Pipeline events OoO Classes IO2I Processors Dynamic Scheduling Scoreboard Tomasulo's Algorithm Alpha 21264 OoO implementation

MIPS Pipeline Events Instruction Issue When an instruction moves into the EX stage after completing the ID stage Decode Stage = Instruction decode+Structural hazard detection and Operand ready identification+Register Read Instruction Commit When an instruction is guaranteed to commit The instruction updates the state of the processor Branch Delay Clock cycles needed to ascertain whether NPC is to be used or the address after the effective address calculation

Out-of-Order Classes

OoO Motivating Code Sequence Compilers for sequential machines have no way of expressing the inherent parallelism in the code VLIW processors, Data flow machines

I4: Inorder Fetch, Issue, Write Back, Commit X M W

I4: Inorder Fetch, Issue, Write Back, Commit X1 X0 F D W M0 M1

I4: Inorder Fetch, Issue, Write Back, Commit X1 X2 X3 X0 M2 M3 F D M0 M1 W Y0 Y1 Y2 Y3 Integer Function Unit

I4: Inorder Fetch, Issue, Write Back, Commit X1 X2 X3 X0 M2 M3 F D M0 M1 W Y0 Y1 Y2 Y3 Memory Access Unit

I4: Inorder Fetch, Issue, Write Back, Commit X1 X2 X3 X0 M2 M3 F D M0 M1 W Y0 Y1 Y2 Y3 Multiply Function Unit

I4: Inorder Fetch, Issue, Write Back, Commit X1 X2 X3 X0 F M2 M3 D I M0 M1 W Y0 Y1 Y2 Y3 Full bypassing Issue stage

IO2I: IO Fetch, OoO Issue, OoO Write Back, IO Commit X0 SB PRF ARF F D I M0 M1 W C ROB IQ S0 Y0 Y1 Y2 Y3

Dynamic Scheduling Out-of-order execution Check for structural and data hazards Begin executing as soon as operands are available Implies out-of-order completion WAR and WAW hazards Imprecise exceptions DIV.D F0, F2, F4 ADD.D F10, F0, F8 SUB.D F12, F8, F14

Dynamic Scheduling Separate ID stage into 2 stages: Issue: Decode and check for structural hazards Read Operands: Wait till data hazards clear, read operands when ready Multi-cycle execution Scoreboard CDC6600 (1965) Mainframe computer 16 functional units – 4 FP, 5 Memory reference units, 7 INT.

Scoreboarding Example L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 DIV.D F10, F0, F6 ADD.D F6, F8, F2

Scoreboarding Example Before second L.D is about to Write Result Instruction Status Instruction Issue Read operands Execution complete Write result L.D F6, 34(R2) √ √ √ √ L.D F2, 45(R3) √ √ √ MUL.D F0,F2,F4 √ SUB.D F8,F6,F2 √ DIV.D F10,F0,F6 √ ADD.D F6,F8,F2 Functional Unit Status Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Mult2 Add Divide Yes Load F2 R3 No Yes Mult F0 F2 F4 Integer No Yes No Yes Sub F8 F6 F2 Integer Yes No Yes Div F10 F0 F6 Mult1 No Yes Register Result Status F0 F2 F4 F6 F8 F10 12 ... F30 FU Mult1 Integer Add Divide

Scoreboarding Example Instruction Status Instruction Issue Read operands Execution complete Write result L.D F6, 34(R2) √ √ √ √ L.D F2, 45(R3) √ √ √ √ MUL.D F0,F2,F4 √ SUB.D F8,F6,F2 √ DIV.D F10,F0,F6 √ ADD.D F6,F8,F2 Functional Unit Status Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Mult2 Add Divide No Yes Load F2 R3 No Yes Mult F0 F2 F4 Integer No Yes Yes No Yes Sub F8 F6 F2 Integer Yes Yes No Yes Div F10 F0 F6 Mult1 No Yes Register Result Status F0 F2 F4 F6 F8 F10 12 ... F30 FU Mult1 Integer Add Divide

Scoreboarding Example Instruction Status Instruction Issue Read operands Execution complete Write result L.D F6, 34(R2) √ √ √ √ L.D F2, 45(R3) √ √ √ √ MUL.D F0,F2,F4 √ √ SUB.D F8,F6,F2 √ √ √ √ DIV.D F10,F0,F6 √ ADD.D F6,F8,F2 Functional Unit Status Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Mult2 Add Divide No Yes Mult F0 F2 F4 No Yes No Yes No Yes Sub F8 F6 F2 Yes No Yes No Yes Div F10 F0 F6 Mult1 No Yes Register Result Status F0 F2 F4 F6 F8 F10 12 ... F30 FU Mult1 Add Divide

Scoreboarding Example Instruction Status Instruction Issue Read operands Execution complete Write result L.D F6, 34(R2) √ √ √ √ L.D F2, 45(R3) √ √ √ √ MUL.D F0,F2,F4 √ √ SUB.D F8,F6,F2 √ √ √ √ DIV.D F10,F0,F6 √ ADD.D F6,F8,F2 Functional Unit Status Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Mult2 Add Divide No Yes Mult F0 F2 F4 No No No Yes No Sub F8 F6 F2 No No Yes Div F10 F0 F6 Mult1 No Yes Register Result Status F0 F2 F4 F6 F8 F10 12 ... F30 FU Mult1 Add Divide

Scoreboarding Example Instruction Status Instruction Issue Read operands Execution complete Write result L.D F6, 34(R2) √ √ √ √ L.D F2, 45(R3) √ √ √ √ MUL.D F0,F2,F4 √ √ √ SUB.D F8,F6,F2 √ √ √ √ DIV.D F10,F0,F6 √ ADD.D F6,F8,F2 √ √ √ Functional Unit Status Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Mult2 Add Divide No Yes Mult F0 F2 F4 No No No Yes Add F6 F8 F2 Yes No Yes No Yes Div F10 F0 F6 Mult1 No Yes Register Result Status F0 F2 F4 F6 F8 F10 12 ... F30 FU Mult1 Add Divide

Scoreboarding Example Instruction Status Instruction Issue Read operands Execution complete Write result L.D F6, 34(R2) √ √ √ √ L.D F2, 45(R3) √ √ √ √ MUL.D F0,F2,F4 √ √ √ √ SUB.D F8,F6,F2 √ √ √ √ DIV.D F10,F0,F6 √ √ √ ADD.D F6,F8,F2 √ √ √ √ Functional Unit Status Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Mult2 Add Divide No Yes No Mult F0 F2 F4 No No No Yes No Add F6 F8 F2 No No Yes Div F10 F0 F6 Mult1 No Yes No Yes No Register Result Status F0 F2 F4 F6 F8 F10 12 ... F30 FU Mult1 Add Divide

Tomasulo's Algorithm Invented by Robert Tomasulo for the IBM 360/91 (3 years after CDC6600) Goal: High Performance without special compilers Tomasulo Algorithm vs. Scoreboard Influenced designs of Alpha 21264, HP 8000, MIPS 10000, Pentium II, Power PC 604 … Tomasulo, [1967]. “An efficient algorithm for exploiting multiple arithmetic units,” IBM J. Research and Development 11:1 (Jan), 25-33.

Tomasulo's Algorithm From Instruction Unit Instruction Queue FP Registers Load/Store operations Store buffers ADDRESS UNIT Load buffers 3 2 2 1 1 Reservation Stations Data Address MEMORY UNIT FP ADDER FP MULTIPLIERS Common Data Bus

Steps in Tomasulo's Algorithm Issue Check for structural hazards Queue in the Reservation Station Keep track of FU generating operand if not available in RF Eliminates WAR and WAW hazards Also called dispatch Execute Monitor CDB for operand (Eliminates RAW hazards) Write result Write result on the CDB RS is marked available

Example √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ Qi Mult1 Load2 Add2 Add1 Mult2 Instruction Status Instruction Issue Read operands Write result L.D F6, 34(R2) √ √ √ L.D F2, 44(R3) √ √ √ MUL.D F0,F2,F4 √ √ √ SUB.D F8,F2,F6 √ √ √ DIV.D F10,F0,F6 √ ADD.D F6,F8,F2 √ √ √ Reservation Stations Name Busy Op Vj Vk Qj Qk A Load1 Load2 Add1 Add2 Add3 Mult1 Mult2 yes no Load 34 34+Regs[R2] yes no Load 44 44+Regs[R3] yes no SUB Mem[44+Regs[R3]] Mem[34+Regs[R2]] Load2 Load1 yes no ADD Add1[F8] Mem[44+Regs[R3]] Add1 Load2 no yes no MUL Mem[44+Regs[R3]] Regs[F4] Load2 yes DIV Mem[34+Regs[R2]] Mult1 Load1 Register Status Field F0 F2 F4 F6 F8 F10 12 ... F30 Qi Mult1 Load2 Add2 Add1 Mult2

OoO Processor Implementation Reorder Buffer (RoB) Register File R1 – R32 Branch Prediction Instruction Fetch I1 I2 I3 I4 I5 I6 T1 T2 T3 T4 T5 T6 R1 ← R1 + R2 R2 ← R1 + R3 BEQZ R2 R3 ← R1 + R2 R1 ← R3 + R2 ALU ALU ALU Decode and Rename T1 ← R1 + R2 T2 ← T1 + R3 BEQZ T2 T4 ← T1 + T2 T5 ← T4 + T2 Instruction Fetch Queue Issue Queue

Alpha 21264 OoO Implementation Register File R1 – R32 Reorder Buffer (RoB) Branch Prediction Instruction Fetch I1 I2 I3 I4 I5 I6 R1 → P1 R2 → P39 ... R1 ← R1 + R2 R2 ← R1 + R3 BEQZ R2 R3 ← R1 + R2 R1 ← R3 + R2 ALU ALU ALU Decode and Rename T1 ← R1 + R2 T2 ← T1 + R3 BEQZ T2 T4 ← T1 + T2 T5 ← T4 + T2 Instruction Fetch Queue Issue Queue R. E. Kessler, The Alpha 21264 Microprocessor. IEEE Micro, 19(2), 1999.

References ELE475. David Wentzlaff. Princeton. CS6810. Rajeev Balasubramonian, PennState. Shen and Lipasti. Modern Processor Design. Hennessy and Patterson. CA. 5ed.