Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000.

Slides:



Advertisements
Similar presentations
Instruction-level Parallelism Compiler Perspectives on Code Movement dependencies are a property of code, whether or not it is a HW hazard depends on.
Advertisements

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
A scheme to overcome data hazards
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
Dyn. Sched. CSE 471 Autumn 0219 Tomasulo’s algorithm “Weaknesses” in scoreboard: –Centralized control –No forwarding (more RAW than needed) Tomasulo’s.
Lecture 6: ILP HW Case Study— CDC 6600 Scoreboard & Tomasulo’s Algorithm Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
COMP25212 Advanced Pipelining Out of Order Processors.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
Chapter 4 Advanced Pipelining and Intruction-Level Parallelism Computer Architecture A Quantitative Approach John L Hennessy & David A Patterson 2 nd Edition,
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Cont. Computer Architecture.
Computer Architecture
1 Zvika Guz Slides modified from Prof. Dave Patterson, Prof. John Kubiatowicz, and Prof. Nancy Warter-Perez Out Of Order Execution.
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
COMP381 by M. Hamdi 1 Pipelining (Dynamic Scheduling Through Hardware Schemes)
CIS 629 Fall 2002 Multiple Issue/Speculation Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to utilize.
ENGS 116 Lecture 71 Scoreboarding Vincent H. Berk October 8, 2008 Reading for today: A.5 – A.6, article: Smith&Pleszkun FRIDAY: NO CLASS Reading for Monday:
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 5, 2005 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Scoreboarding)
Nov. 9, Lecture 6: Dynamic Scheduling with Scoreboarding and Tomasulo Algorithm (Section 2.4)
Out-of-order execution: Scoreboarding and Tomasulo Week 2
Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.
1 Images from Patterson-Hennessy Book Machines that introduced pipelining and instruction-level parallelism. Clockwise from top: IBM Stretch, IBM 360/91,
CSC 4250 Computer Architectures September 29, 2006 Appendix A. Pipelining.
Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.
Chapter 3 Instruction Level Parallelism Dr. Eng. Amr T. Abdel-Hamid Elect 707 Spring 2011 Computer Applications Text book slides: Computer Architec ture:
Anshul Kumar, CSE IITD CSL718 : Superscalar Processors Speculative Execution 2nd Feb, 2006.
04/03/2016 slide 1 Dynamic instruction scheduling Key idea: allow subsequent independent instructions to proceed DIVDF0,F2,F4; takes long time ADDDF10,F0,F8;
COMP25212 Advanced Pipelining Out of Order Processors.
Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm 吳俊興 高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.
Instruction-Level Parallelism and Its Dynamic Exploitation
IBM System 360. Common architecture for a set of machines
CS 352H: Computer Systems Architecture
Images from Patterson-Hennessy Book
/ Computer Architecture and Design
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
COMP 740: Computer Architecture and Implementation
Approaches to exploiting Instruction Level Parallelism (ILP)
Out of Order Processors
Step by step for Tomasulo Scheme
CS203 – Advanced Computer Architecture
Lecture 6 Score Board And Tomasulo’s Algorithm
Lecture 10 Tomasulo’s Algorithm
Lecture 12 Reorder Buffers
Chapter 3: ILP and Its Exploitation
Advantages of Dynamic Scheduling
High-level view Out-of-order pipeline
CMSC 611: Advanced Computer Architecture
A Dynamic Algorithm: Tomasulo’s
COMP s1 Seminar 3: Dynamic Scheduling
Out of Order Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
CS 704 Advanced Computer Architecture
CSCE430/830 Computer Architecture
Advanced Computer Architecture
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Tomasulo Organization
Reduction of Data Hazards Stalls with Dynamic Scheduling
Lecture 5 Scoreboarding: Enforce Register Data Dependence
CS152 Computer Architecture and Engineering Lecture 16 Compiler Optimizations (Cont) Dynamic Scheduling with Scoreboards.
Scoreboarding ENGS 116 Lecture 7 Vincent H. Berk October 5, 2005
/ Computer Architecture and Design
John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
Chapter 3: ILP and Its Exploitation
September 20, 2000 Prof. John Kubiatowicz
CS252 Graduate Computer Architecture Lecture 6 Introduction to Advanced Pipelining: Out-Of-Order Execution John Kubiatowicz Electrical Engineering and.
High-level view Out-of-order pipeline
Lecture 7 Dynamic Scheduling
Conceptual execution on a processor which exploits ILP
Presentation transcript:

Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000 *This lecture was derived from material in the text (Chap. 4). All figures from Computer Architecture: A Quantitative Approach, Second Edition, by John Hennessy and David Patterson, are copyrighted material (COPYRIGHT 1996 MORGAN KAUFMANN PUBLISHERS, INC. ALL RIGHTS RESERVED).

Oct. 18, 2000Machine Organization2 Introduction Objective: To understand how pipeline scheduling, loop unrolling, and branch prediction can be carried out in hardware. –We contrast the dynamic approaches to the compiler techniques discussed previously. Since many of these techniques become more important when there is multiple issue of instructions, we give a brief overview of techniques for multiple instruction issue. –We will also review static techniques for branch prediction. Topics –Review of static branch prediction tcov profiling tool –Dynamic scheduling scoreboard Tomasulu’s algorithm –Branch prediction –Multiple issue superscalar VLIW

Oct. 18, 2000Machine Organization3 Dynamic Scheduling Hardware rearranges instructions at runtime to reduce stalls –simplify compiler –catch some cases where dependencies are not known at compile time –Scoreboard (RAW) –Register renaming (WAW, WAR) Major limitation of previous pipelining techniques is that they use an in-order instruction issue. If an instruction is stalled, no later instructions can proceed. DIVD F0, F2,F4 ; long running instruction ADDD F10,F0,F8 SUBD F12,F8,F14 ; no dependence on DIVD, eliminate by no longer ; requiring in-order execution

Oct. 18, 2000Machine Organization4 Split ID Stage Issue - Decode instruction, check for structural hazards Read operands - Wait until no data hazards, then read operands This allows multiple instructions to be in execution at the same time. Out of order execution is possible (WAR, WAW hazards may occur) –DIVD F0, F2, F4 –ADDD F10, F0, F8 –SUBD F8, F8, F14 ; if dest = F10 WAW

Oct. 18, 2000Machine Organization5 Scoreboard

Oct. 18, 2000Machine Organization6 Pipeline Steps with Scoreboard Issue –if functional unit free and no other active instruction has the same destination register then issue the instruction and update scoreboard –may have a queue for instruction fetches (stall when full) –removes WAW hazards Read operands –scoreboard monitors availability of source operands –when available tells functional unit to read operands –resolves RAW hazards (may be sent to execution out of order) Execution –functional unit begins execution upon receiving operands –notify scoreboards when result is ready Write result –when result available check for WAR hazards and stall if necessary and write to register

Oct. 18, 2000Machine Organization7 Example (Fig )

Oct. 18, 2000Machine Organization8 Example (Fig )

Oct. 18, 2000Machine Organization9 Example (Fig )

Oct. 18, 2000Machine Organization10 Checks and Bookkeeping Issue –Wait until not busy[FU] and not Result[D] –Busy[FU] = yes; Op[FU] = op; Fi[FU] = D; –Fj[FU] = S1; Fk[FU] = S2; Qj = Result[S1]; Qk = Result[S2]; –Rj = not Qj; Rk = not Qk; Result[D] = FU; Read Operands –wait until Rj and Rk –Rj = no; Rk = no; Qj = 0; Qk = 0; Execution Complete –Functional unit done Write result –  f (Fj[f]  Fi[FU] or Rj[f] = No) & (Fk[f]  Fj[FU] or Rk[f] = No) –  f ( if Qj[f] = FU then Rj[f] = yes);  f ( if Qk[f] = FU then Rk[f] = yes); –Result[Fi[FU]] = 0; Busy[FU] = no;

Oct. 18, 2000Machine Organization11 Register Renaming (Tomasulo) Uses reservation stations to buffer instructions waiting to issue –fetch operands as soon as possible –eliminates need to get operand from a register –pending instructions designate reservation station that will provide results –with successive writes to a register only last one actually updates register –register specifiers are renamed to reservation station –eliminates WAW and WAR hazards Uses distributed control (common data bus) Results go directly to functional units from reservation stations rather than through the register file

Oct. 18, 2000Machine Organization12 Hardware for Register Renaming (Fig. 4.8)

Oct. 18, 2000Machine Organization13 Pipeline Steps with Renaming Issue –Get instruction from floating point queue –If FP reservation station free send instruction with operands if in registers –If ld/st issue there is an available buffer –renaming done here Execution –If one or more operands are not available monitor CDB while waiting for it to be computed –When operands are available execute operation –Check for RAW hazards Write result –When result available write on CDB and from there into registers, reservation stations, and waiting store buffers

Oct. 18, 2000Machine Organization14 Example (Fig )

Oct. 18, 2000Machine Organization15 Example (Fig )

Oct. 18, 2000Machine Organization16 Checks and Bookkeeping Issue –Wait until station or buffer empty –if (Register[S1].Qi  0) { RS[r].Qj = Register[S1].Qi } –else {RS[r]/Vj = S1; RS[r].Qj = 0;} –if (Register[S2].Qi  0) { RS[r].Qk = Register[S2].Qi } –else {RS[r]/Vj = S2; RS[r].Qk = 0;} –RS[r].Busy = yes; Register[D].Qi = r; Execution Complete –Wait until (RS[r].Qj = 0) and (RS[r].Qk = 0) –no bookkeeping - operands are in Vj and Vk Write result –Wait until execution completed at r and CDB available –  x (if (Register[x].Qi = r) then {Fx = result; Register[x].Qi=0}) –  x (if (RS[x].Qj = r) then {RS[x].Vj=result; RS[x].Qj = 0}) –  x (if (RS[x].Qk = r) then {RS[x].Vk=result; RS[x].Qk = 0}) –  x (if (Store[x].Qi = r) then {Store[x].V=result; Store[x].Qi = 0}) –RS[r].Busy = no;

Oct. 18, 2000Machine Organization17 Dynamic Unrolling (Fig. 4.12) Loop: LD F0, 0(R1) MULTD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 BNEZ R1, Loop If we predict that the branch is taken, using reservation stations will allow multiple executions of this loop to proceed at once (dynamic unrolling)

Oct. 18, 2000Machine Organization18 Dynamic Unrolling (Fig. 4.12)

Oct. 18, 2000Machine Organization19 Dynamic Branch Prediction Provide hardware to dynamically predict whether a branch is taken or not In order to be effective when we predict that a branch will be taken, it is necessary to be able to compute the address before we would normally determine whether to take the branch –branch target buffer (BTB) provides a cache of branch target addresses Simplest approach uses 1 bit to remember whether the branch was taken the last time or not In a simple loop this leads to two mispredictions –A two bit scheme will improve this situation

Oct. 18, 2000Machine Organization20 Two-Bit Prediction Scheme

Oct. 18, 2000Machine Organization21 Branch Target Buffer

Oct. 18, 2000Machine Organization22 Frequency of Mispredictions (4.14) 4096-entry 2-bit prediction buffer

Oct. 18, 2000Machine Organization23 Handling Instructions with BTB (4.23)

Oct. 18, 2000Machine Organization24 Multiple Issue Previous techniques used to eliminate data and control stalls. They allow us to approach the ideal CPI of 1 To improve performance further, we would like to decrease the CPI to less than 1. This can not happen if we can issue only one instruction per cycle Multiple issue processors –superscalar –VLIW

Oct. 18, 2000Machine Organization25 Code Example Latencies Loop: LD F0,0(R1) ; F0 = array element ADDD F4, F0, F2 ; add scalar in F2 SD 0(R1), F4 ; store result SUBI R1, R1, #8 ; decrement pointer ; 8 bytes per double BNEZ R1, Loop ; branch R1!=zero

Oct. 18, 2000Machine Organization26 Superscalar DLX Can issue two instructions per cycle –integer including ld/st/br –FP –To make this worthwhile, we need either multiple FP units or pipelined FP units –This restriction simplifies the implementation (e.g. use opcode to detect issue restriction) –Extra difficulty with simultaneous ld/st and FP operation (contention for register datapath) Need to fetch and decode 64 bits of instructions –assume that they are aligned on 64-bit boundaries –integer instruction comes first

Oct. 18, 2000Machine Organization27 Superscalar Scheduling Loop: LD F0,0(R1) ; ; 1 LD F6,-8(R1) ; ; 2 LD F10,-16(R1) ; ADDD F4, F0, F2 ; 3 LD F14,-24(R1) ; ADDD F8, F6, F2 ; 4 LD F18,-32(R1) ; ADDD F12, F10, F2 ; 5 SD 0(R1), F4 ; ADDD F16, F14, F2 ; 6 SD -8(R1), F8 ; ADDD F20, F18, F2 ; 7 SD -16(R1), F16 ; ; 8 SUBI R1, R1, #40 ; ; 9 SD 16(R1), F16 ; ; 10 BNEZ R1, Loop ; ; 11 SD 8(R1), F20 ; ; 12

Oct. 18, 2000Machine Organization28 Dynamic Scheduling It Inst Issue Ex Mem WB 1 LD F0,0(R1) ADDD F4,F0,F SD 0(R1),F SUBI R1,R1,# BNEZ R1,Loop LD F0,0(R1) ADDD F4,F0,F SD 0(R1),F SUBI R1,R1,# BNEZ R1,Loop 8 5

Oct. 18, 2000Machine Organization29 VLIW Scheduling Each instruction –two memory references –two FP operations –one integer or branch operation In the following example –7 loop iterations in 9 cycles –23 operations (2.5 ops/cycle) –60% efficiency –needs extra registers

Oct. 18, 2000Machine Organization30 VLIW Scheduling LD F0,0(R1) LD F6,-8(R1) LD F10,-16(R1) LD F14,-24(R1) LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD F8,F6,F2 LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16, F14, F2 ADDD F20,F18,F2 ADDD F24,F22,F2 SD 0(R1), F4 SD -8(R1), F8 ADD F28, F26, F2 SD -16(R1), F12 SD -24(R1),F16 SD -32(R1), F20 SD -40(R1),F24 SUBI R1,R1,#56 SD 8(R1), F28 BNEZ R1, Loop