CSE 8383 Superscalar Processor 1 Abdullah A Alasmari & Eid S. Alharbi.

Slides:

Advertisements

Similar presentations

CH14 Instruction Level Parallelism and Superscalar Processors

Advertisements

Topics Left Superscalar machines IA64 / EPIC architecture

Computer Organization and Architecture

CSCI 4717/5717 Computer Architecture

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

COMP25212 Advanced Pipelining Out of Order Processors.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

CMSC 611: Advanced Computer Architecture Scoreboard Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

Instruction-Level Parallelism (ILP)

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Cont. Computer Architecture.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation – Concepts 吳俊興高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.

Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

Review of CS 203A Laxmi Narayan Bhuyan Lecture2.

1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.

COMP381 by M. Hamdi 1 Pipelining (Dynamic Scheduling Through Hardware Schemes)

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Computer Organization and Architecture Instruction-Level Parallelism and Superscalar Processors.

1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

1 Lecture 5 Overview of Superscalar Techniques CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading: Textbook, Ch. 2.1 “Complexity-Effective.

CMPE 421 Parallel Computer Architecture

CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.

Pipelining and Parallelism Mark Staveley

Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.

Recap Multicycle Operations –MIPS Floating Point Putting It All Together: the MIPS R4000 Pipeline.

PART 5: (1/2) Processor Internals CHAPTER 14: INSTRUCTION-LEVEL PARALLELISM AND SUPERSCALAR PROCESSORS 1.

Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

COMP25212 Advanced Pipelining Out of Order Processors.

CS 352H: Computer Systems Architecture

Dynamic Scheduling Why go out of style?

Advanced Architectures

William Stallings Computer Organization and Architecture 8th Edition

William Stallings Computer Organization and Architecture 8th Edition

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

Chapter 9 a Instruction Level Parallelism and Superscalar Processors

Out of Order Processors

CS203 – Advanced Computer Architecture

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Instruction Level Parallelism and Superscalar Processors

Out of Order Processors

Superscalar Processors & VLIW Processors

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Instruction Level Parallelism and Superscalar Processors

Adapted from the slides of Prof

How to improve (decrease) CPI

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Computer Architecture

Adapted from the slides of Prof

CSC3050 – Computer Architecture

Created by Vivi Sahfitri

Instruction Level Parallelism

Presentation transcript:

CSE 8383 Superscalar Processor 1 Abdullah A Alasmari & Eid S. Alharbi

CSE 8383 Superscalar Processor 2 Outlines  Pipeline & Hazards  Superscalar  Instruction issue policy  Register renaming  MIPS R10000  Advanced Superscalar  Summary

CSE 8383 Superscalar Processor 3 Pipeline & Hazards  Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data Hazard Stalls + Control Stalls Ideal pipeline CPI: measure of the maximum performance attainable by the implementation Structural hazards: HW cannot support this combination of instructions Data hazards: Instruction depends on result of prior instruction still in the pipeline Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps)

CSE 8383 Superscalar Processor 4 Hazards  Structural Hazards: Have as many functional units as needed  Data Hazards solutions: Execute instructions in order. Use score-board to eliminate data hazards by stalling instructions Execute instructions out or order, as soon as operands are available, but graduate them in order. Use register renaming to avoid WAR and WAW data hazards  Control Hazards solutions: Use branch prediction: Make sure that the branch is resolved before registers are modified

CSE 8383 Superscalar Processor 5 Branch prediction  What do we need to predict for a jump/branch? jump: the target address, which can be stored in the same instruction or computed from the current PC plus a displacement Return from subroutine ret: the return address, which is obtained from the stack (increasing the SP and reading from memory) conditional branch: the target address, which is usually computed from the current PC plus a displacement Is the branch going to branch or continue with next instruction?

CSE 8383 Superscalar Processor 6 Outlines  Pipeline & Hazards  Superscalar  Instruction issue policy  Register renaming  MIPS R10000  Advanced Superscalar  Summary

CSE 8383 Superscalar Processor 7 Multiple Instruction Issue Multiple instructions issued each cycle  better performance increase instruction throughput decrease in CPI (below 1)  greater hardware complexity.  harder code scheduling job for the compiler Superscalar processors  instructions are scheduled by the hardware  different numbers of instructions may be issued simultaneously VLIW (“very long instruction word”) processors  instructions are scheduled by the compiler  a fixed number of operations are formatted as one big instruction

CSE 8383 Superscalar Processor 8 What is Superscalar?  A machine designed to improve the performance execution of scalar instructions; where one instruction per cycle.  Superscalar architecture allows several instructions to be issued and completed per clock cycle  consists of a number of pipeline that are working in parallel  Common instructions (arithmetic, load/store, conditional branch) can be initiated and executed independently in different pipelines  Executed in an order different from the program order  Equally applicable to RISC & CISC, In practice usually RISC

CSE 8383 Superscalar Processor 9 Superscalar Execution IMReg ALU DMReg IMReg ALU DMRegIMReg ALU DMRegIMReg ALU DMRegIMReg ALU DMRegIMReg ALU DMReg

CSE 8383 Superscalar Processor 10 Superscalar Execution IMReg ALU DMReg IMReg ALU DMRegIMReg ALU DMRegIMReg ALU DMRegIMReg ALU DMRegIMReg ALU DMReg

CSE 8383 Superscalar Processor 11 How Does it Work? Require:  instruction fetch fetching of multiple instructions at once dynamic branch prediction & fetching beyond conditional branches  instruction issue methods for determining which instructions can be issued next the ability to issue multiple instructions in parallel  instruction commit methods for committing several instructions in fetch order duplicate & more complex hardware

CSE 8383 Superscalar Processor 12  Assumptions Single FP adder takes 2 cycles Single FP multipler takes 5 cycles Can issue add & multiply together Must issue in-order v: addt $f10, $f2, $f4 w: mult $f10, $f10, $f6 x: addt $f12, $f10, $f8 y: addt $f4, $f4, $f6 z: addt $f10, $f4, $f8 (Single adder, data dependence) (In order) v w x y (inorder) z Data Flow ++ * + $f2$f4$f6 $f4 $f10 $f8 y v x z Critical Path = 9 cycles + w z $f12 $f10 Superscalar Execution Example

CSE 8383 Superscalar Processor 13 Adding Advanced Features  Out Of Order Issue Can start y as soon as adder available Must hold back z until $f10 not busy & adder available  With Register Renaming v w x y z v w x y z v: addt $f10, $f2, $f4 w: mult $f10, $f10, $f6 x: addt $f12, $f10, $f8 y: addt $f4, $f4, $f6 z: addt $f10, $f4, $f8 v: addt $f10a, $f2, $f4 w: mult $f10a, $f10a, $f6 x: addt $f12, $f10a, $f8 y: addt $f4, $f4, $f6 z: addt $f10, $f4, $f8 ++ * + $f2$f4$f6 $f4 $f10 $f8 y v x z Critical Path = 9 cycles + w z $f12 $f10

CSE 8383 Superscalar Processor 14 Outlines  Pipeline & Hazards  Superscalar  Instruction issue policy  Register renaming  MIPS R10000  Advanced Superscalar  Summary

CSE 8383 Superscalar Processor 15 The Process of Instruction Issue  K-issue, dynamically scheduled superscalar processor IPreF: Prefetches instructions for superscalar IF: Conceptually, IF examines each instruction in the Issue Packet for hazards in program order IS1: Decides how many instruction from the packet can be issued simultaneously IS2: Examines the selected instructions in IS1 with already issued instructions for hazards IPreF IF EXIS1IS2 Issue Packets: 0≤ I ≤ K

CSE 8383 Superscalar Processor 16 Instruction Issue Policy  Instruction Issue Policy refers to the protocol used to issue instruction  The three types of ordering are Order in which instructions are fetched Order in which instructions are executed Order in which instructions change registers and memory

CSE 8383 Superscalar Processor 17 Instruction Issue Policy  The simplest policy is to execute and complete instruction in their sequential order  To improve parallelism, the processor has to look ahead and try to find independent instructions to execute in parallel  Thus, instructions will be executed in an order different from the strictly sequential one, with the restriction that the result must be correct  Execution policies: i. In-order issue with in-order completion ii. In-order issue with out-order completion iii. Out-of-order issue with out-of-order completion

CSE 8383 Superscalar Processor 18 In-Order Issue with In-Order Completion  Instructions are issued in the exact order that would correspond to sequential execution [In-order Issue] and result are written in the same order [In-order Completion]

CSE 8383 Superscalar Processor 19 In-Order Issue with Out-of-Order Completion  Result are written in different order  An output dependency exists if two instructions are writing into the same location  Output dependency R3 R3:= R3 + R5; (I1) R4:= R3 + 1; (I2) R3 R3:= R5 + 1; (I3) R7:= R3 + R4; (I4) If I3 completes before I1, the result from I1 will be wrong.

CSE 8383 Superscalar Processor 20 Out-of-Order Issue with Out-of-Order Completion  With in-order issue, no new instruction can be issued when processor has detected a conflict and is stalled, until after the conflict has been resolved  As such, the processor is not allowed to look ahead for further instructions, which could be executed in parallel  Out-of-order issue tries to resolve the above problem by taking a set of decoded instructions into an instruction window (buffer)  When a functional unit becomes available, an instruction from the window may be issued to the execute stage  Any instruction may be issued, provided that: i. it needs a particular functional unit that is available ii. no conflict or dependencies blocking this instruction

CSE 8383 Superscalar Processor 21 Outlines  Pipeline & Hazards  Superscalar  Instruction issue policy  Register renaming  MIPS R10000  Advanced Superscalar  Summary

CSE 8383 Superscalar Processor 22 Antidependency  Read-Write dependency DIV.D F0, F1, F2 (I1) ADD.D F3, F0, F4 (I2) SUB.D F4, F5, F6 (I3) MUL.D F3, F5, F4 (I4)  I3 can not complete before I2 starts as I2 needs a value in F4 and I3 changes F4  An antidependency exists if an instruction uses a location as an operand while a following one is writing into that location;  if the first one is still using the location when the second one writes into it, an error occurs:

CSE 8383 Superscalar Processor 23 Register Renaming  Output dependencies and antidependencies can be treated similarly to true data dependencies as normal conflicts, by delaying the execution of a certain instruction until it can be executed  Parallelism could be improved by eliminating output dependencies and antidependencies, which are not real data dependencies  These artificial dependencies can be eliminated by automatically allocating new registers to values, when such dependencies has been detected  This technique is called register renaming

CSE 8383 Superscalar Processor 24 Register renaming DIV.D F0, F1, F2 ADD.D F3, F0, F4 SUB.D F4, F5, F6 SUB.D T, F5, F6 MUL.D F3, F5, F4 MUL.D S, F5, T F0 F1 F2 F3 F4 F5 5.6 F6 F31 NameOp1DetOp2 Div520F0 Add2.6DivF3 Sub F4 MulSub5.6F3 Register FileReservation Station

CSE 8383 Superscalar Processor 25 Execution Example  Assumptions Two-way issue with renaming Rename registers B1,B2, etc. 1 cycle ADD.D latency, 2 cycles MUL.D v: ADD.D $f10, $f2, $f4 w: MUL.D $f10’ $f10, $f6 x: ADD.D $f12, $f10, $f8 y: ADD.D $f4, $f4, $f6 ValueRename 10.0 $f $f $f $f $f $f12 ADD -- Op1Op2Dest -- ResultDest MULT -- Op1Op2Dest -- ResultDest -- B1 -- ValueRenames F Valid -- B2 --F B3 --F B4 --F

CSE 8383 Superscalar Processor 26 Execution Example Cycle 1  Actions Instructions v & w issued v target set to B1 w target set to B2 ValueRename 10.0 $f $f $f $f $f10 B $f12 ADD B1 -- Op1Op2Dest -- ResultDest MULT B140.0B2 -- Op1Op2Dest -- ResultDest v w v: ADD.D $f10, $f2, $f4 w: MUL.D $f10’ $f10, $f6 x: ADD.D $f12, $f10, $f8 y: ADD.D $f4, $f4, $f6 -- B1 $f10 ValueRenames F Valid -- B2 $f10F -- B3 --F B4 --F

CSE 8383 Superscalar Processor 27 Execution Example Cycle 2  Actions Instructions x & y issued x & y targets set to B3 and B4 Instruction v executed ValueRename 10.0 $f $f4 B $f $f $f10 B $f12 B3 ADD B280.0B B4 Op1Op2Dest 30.0B1 ResultDest MULT B2 -- Op1Op2Dest -- ResultDest v wx y v: ADD.D $f10, $f2, $f4 w: MUL.D $f10’ $f10, $f6 x: ADD.D $f12, $f10, $f8 y: ADD.D $f4, $f4, $f B1 $f10 ValueRenames T Valid -- B2 $f10F -- B3 $f12F -- B4 $f4F

CSE 8383 Superscalar Processor 28 Instruction v retired But doesn’t change $f10 Instruction w begins execution Moves through 2 stage pipeline Instruction y executed ValueRename 10.0 $f $f4 B $f $f $f10 B $f12 B3 ADD B280.0B3 -- Op1Op2Dest 60.0B4 ResultDest MULT -- Op1Op2Dest -- ResultDest y x B2 w v: ADD.D $f10, $f2, $f4 w: MUL.D $f10’ $f10, $f6 x: ADD.D $f12, $f10, $f8 y: ADD.D $f4, $f4, $f6 -- B1 -- ValueRenames F Valid -- B2 $f10F -- B3 $f12F 60.0 B4 $f4T Execution Example Cycle 3

CSE 8383 Superscalar Processor 29 Execution Example Cycle 4 Instruction w finishes execution Instruction y cannot be retired yet ValueRename 10.0 $f $f4 B $f $f $f10 B $f12 B3 ADD B3 -- Op1Op2Dest -- ResultDest MULT -- Op1Op2Dest 120.0B2 ResultDest w x v: ADD.D $f10, $f2, $f4 w: MUL.D $f10’ $f10, $f6 x: ADD.D $f12, $f10, $f8 y: ADD.D $f4, $f4, $f6 -- B1 -- ValueRenames F Valid B2 $f10T -- B3 $f12F 60.0 B4 $f4T

CSE 8383 Superscalar Processor 30 Execution Example Cycle 5 Instruction w retired update $f10 Instruction y cannot be retired yet Instruction x executed ValueRename 10.0 $f $f4 B $f $f $f $f12 B3 ADD -- Op1Op2Dest 200.0B3 ResultDest MULT -- Op1Op2Dest -- ResultDest x v: ADD.D $f10, $f2, $f4 w: MUL.D $f10’ $f10, $f6 x: ADD.D $f12, $f10, $f8 y: ADD.D $f4, $f4, $f6 -- B1 -- ValueRenames F Valid -- B2 --F B3 $f12T 60.0 B4 $f4T

CSE 8383 Superscalar Processor 31 Execution Example Cycle 6 Instruction x & y retired Update $f12 and $f4 ValueRename 10.0 $f $f $f $f $f10 %f $f12 ADD -- Op1Op2Dest -- ResultDest MULT -- Op1Op2Dest -- ResultDest v: ADD.D $f10, $f2, $f4 w: MUL.D $f10’ $f10, $f6 x: ADD.D $f12, $f10, $f8 y: ADD.D $f4, $f4, $f6 -- B1 -- ValueRenames F Valid -- B2 --F B3 $f12T 60.0 B4 $f4T

CSE 8383 Superscalar Processor 32 Outlines  Pipeline & Hazards  Superscalar  Instruction issue policy  Register renaming  MIPS R10000  Advanced Superscalar  Summary

CSE 8383 Superscalar Processor 33 Example: MIPS R10000  Can decode 4 instructions per cycle  Has 5 execution pipelines  Uses dynamic scheduling and out-of-order execution  Does speculative branching  Functional Units Integer ALU1 Integer ALU2 Load/Store Unit Float Adder Float Multiply

CSE 8383 Superscalar Processor 34 Example: MIPS R10000 Instructions cache Decode Branch unit Issue RF FAdd-1 FAdd-2FAdd-3Result Issue RF FMpy-1 FMpy-2FMpy-3Result Issue RF ALU1 Result Issue RF ALU2 Result Issue RF Add-Calc Data Cache Result Queues 7 Pipeline Stages Stage 1 Fetch Stage 2 Decode Stage 3 Issue Stage 4 Execute Stage 5 Execute Stage 6 Execute Stage 7 Store 4 instructions Fetch and DecodeFunctional Unit (Execute instructions) Branch Address (one branch can be handled every cycle) 5 Execution Pipelines

CSE 8383 Superscalar Processor 35 Outlines  Pipeline & Hazards  Superscalar  Instruction issue policy  Register renaming  MIPS R10000  Advanced Superscalar  Summary

CSE 8383 Superscalar Processor 36 Advanced Superscalar  Future Architecture  Can issue 16 to 32 instructions  Consist of 24 to 48 functional units  Use advance branch prediction  Advantage Enhancing performance  Disadvantage Attempting to extract more instruction level parallelism has diminishing returns on performance as the issue width increases Increasing Microprocessor complexity

CSE 8383 Superscalar Processor 37 Outlines  Pipeline & Hazards  Superscalar  Instruction issue policy  Register renaming  MIPS R10000  Advanced Superscalar  Summary

CSE 8383 Superscalar Processor 38 Summary  Superscalar is ILP mechanism to enhance the performance by increasing throughput.  It is limited by True data dependency Procedural (Control) dependency Resource conflicts Output dependency Antidependency

CSE 8383 Superscalar Processor 39 Summary  Pros The hardware solves everything: Hardware detects potential parallelism between instructions; Hardware tries to issue as many instructions as possible in parallel. Hardware solves register renaming.  Cons Very complex Much hardware is needed for run-time detection. There is a limit in how far we can go with this technique. Power consumption can be very large! The window of executions limited  this limits the capacity to detect potentially parallel instructions