1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.

Slides:



Advertisements
Similar presentations
CMSC 611: Advanced Computer Architecture Tomasulo Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Advertisements

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s.
Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
Computer Organization and Architecture
Instruction-level Parallelism Compiler Perspectives on Code Movement dependencies are a property of code, whether or not it is a HW hazard depends on.
Lec18.1 Step by step for Dynamic Scheduling by reorder buffer Copyright by John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
A scheme to overcome data hazards
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
Dyn. Sched. CSE 471 Autumn 0219 Tomasulo’s algorithm “Weaknesses” in scoreboard: –Centralized control –No forwarding (more RAW than needed) Tomasulo’s.
Lecture 6: ILP HW Case Study— CDC 6600 Scoreboard & Tomasulo’s Algorithm Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
COMP25212 Advanced Pipelining Out of Order Processors.
CMSC 611: Advanced Computer Architecture Scoreboard Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
ECE 2162 Tomasulo’s Algorithm. Implementing Dynamic Scheduling Tomasulo’s Algorithm –Used in IBM 360/91 (in the 60s) –Tracks when operands are available.
Microprocessor Microarchitecture Dependency and OOO Execution Lynn Choi Dept. Of Computer and Electronics Engineering.
Instruction-Level Parallelism (ILP)
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 14, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.
1 Zvika Guz Slides modified from Prof. Dave Patterson, Prof. John Kubiatowicz, and Prof. Nancy Warter-Perez Out Of Order Execution.
EENG449b/Savvides Lec /20/04 February 12, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
Review of CS 203A Laxmi Narayan Bhuyan Lecture2.
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
COMP381 by M. Hamdi 1 Pipelining (Dynamic Scheduling Through Hardware Schemes)
ENGS 116 Lecture 71 Scoreboarding Vincent H. Berk October 8, 2008 Reading for today: A.5 – A.6, article: Smith&Pleszkun FRIDAY: NO CLASS Reading for Monday:
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 5, 2005 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Scoreboarding)
CSC 4250 Computer Architectures October 13, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.
Nov. 9, Lecture 6: Dynamic Scheduling with Scoreboarding and Tomasulo Algorithm (Section 2.4)
Out-of-order execution: Scoreboarding and Tomasulo Week 2
Instruction-Level Parallelism dynamic scheduling prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University May 2015Instruction-Level Parallelism.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
1 Lecture 6 Tomasulo Algorithm CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading:Textbook 2.4, 2.5.
Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
04/03/2016 slide 1 Dynamic instruction scheduling Key idea: allow subsequent independent instructions to proceed DIVDF0,F2,F4; takes long time ADDDF10,F0,F8;
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
Dataflow Order Execution  Use data copying and/or hardware register renaming to eliminate WAR and WAW ­register name refers to a temporary value produced.
COMP25212 Advanced Pipelining Out of Order Processors.
CS203 – Advanced Computer Architecture ILP and Speculation.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm 吳俊興 高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.
IBM System 360. Common architecture for a set of machines
/ Computer Architecture and Design
Out of Order Processors
Step by step for Tomasulo Scheme
Tomasulo Loop Example Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1
CS203 – Advanced Computer Architecture
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Microprocessor Microarchitecture Dynamic Pipeline
Lecture 12 Reorder Buffers
Advantages of Dynamic Scheduling
High-level view Out-of-order pipeline
CMSC 611: Advanced Computer Architecture
A Dynamic Algorithm: Tomasulo’s
Out of Order Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
Adapted from the slides of Prof
Checking for issue/dispatch
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
How to improve (decrease) CPI
Static vs. dynamic scheduling
Advanced Computer Architecture
Static vs. dynamic scheduling
Tomasulo Organization
Reduction of Data Hazards Stalls with Dynamic Scheduling
Adapted from the slides of Prof
September 20, 2000 Prof. John Kubiatowicz
Lecture 7 Dynamic Scheduling
Conceptual execution on a processor which exploits ILP
Presentation transcript:

1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement of out-of-order execution  Control flow scheduling, when performed centrally at the time of decode: ==> Scoreboarding technique implemented in CDC 6600  Dataflow scheduling, if performed in a distributed manner by the FUs themselves at execute time. Instructions are decoded and issued to reservation stations awaiting their operands. ==> Tomasulo scheme in the IBM System/360 Model 91 processor is the basis of modern superscalar processors

2 Scoreboard Summary  Main advantage:  managing multiple FUs  out-of-order execution of multi-cycle operations  maintaining all data dependences (RAW, WAW, WAR)  Scoreboard limitations:  single issue scheme, however: scheme is extendable to multiple-issue  in-order issue  no renaming  antidependences and output dependences may lead to WAR and WAW stalls,  no forwarding hardware  all results go through the registers  General limitations (not only valid for scoreboarding)  number and types of FUs since contention for FUs leads to structural hazards  the amount of parallelism available in code (dependences lead to stalls)

3  Tomasulo scheme removes some of the scoreboard limitations  by forwarding and  renaming hardware, but is still  single issue and  in-order issue

4 Register Renaming  A name dependence occurs when two instructions Inst 1 and Inst 2 use the same register (or memory location), but there is no data transmitted between Inst 1 and Inst 2.  If the register is renamed so that Inst 1 and Inst 2 do not conflict, the two instructions can execute simultaneously or be reordered.  The technique that dynamically eliminates name dependences in registers to avoid WAR and WAW hazard, is called register renaming.  Register renaming can be done statically (= by compiler) or dynamically (= by hardware).  Tomasulo’s algorithm performs register renaming per hardware!  Dynamic renaming in memory is much harder to perform! Why?? Pointer aliasing problems.

5 Tomasulo Algorithm  Developed for IBM 360/91 in 1967 (about 3 years after CDC 6600)  Hazard detection and execution control are distributed among the functional units (vs. centralized in scoreboard)  Reservation stations at each functional unit control when an instruction can begin execution at that unit.  Common Data Bus broadcasts results to all reservation stations (of all FUs)  Load and Stores treated as FUs as well.  Each Register has additional flags.

Tomasulo Organization

7 Reservation Station Components  Each FU has one or more reservation stations  The reservation station holds:  instructions that have been issued and are awaiting execution at a functional unit,  the operands for that instruction if they have already been computed (or the source of the operands otherwise),  the information needed to control the instruction once it has begun execution.  The reservation stations buffer the operands of instructions waiting to issue, eliminating the need to get the operands from registers (similar to forwarding).  The register specifications store register values (scoreboarding: only pointers to the registers!) or pointers to reservation stations that produce the result.  WAR hazards are avoided because an operand is already stored in reservation station even when a write to the same register is performed out-of-order  WAW hazards are avoided because of the use of pointers to reservation stations instead of register pointers as tags on the CDB

8 Reservation Station Entries  Empty: Indicates reservation station is empty or not  InFU: Indicates the instruction is executed in the FU, remains until completion  Op: Operation to perform in the unit (e.g., + or –)  Dest: Tag of the Reservation  Src1, Src2: Value of source operands  RS1, RS2: Tag of the Reservation stations producing source registers  Vld1, Vld2: Valid flags indicating whether the values are available

Tomasulo Organization

10 CBD and Reservation Stations  After completion of the instruction from RS, a result token is formed and passed on the common data bus (CDB) to the register file and, by snooping, directly to all RSs (thus eliminating the need to get the operand value from a register).  The traffic passing on the CDB is continually monitored.  A result on the CDB is copied into all RSs awaiting it.  CDB allows all units that are waiting for an operand to be loaded simultaneously. Hence, the RS fetches and buffers an operand as soon it becomes available (dataflow principle).  The load buffers and load/store reservation stations hold data or addresses coming from and going to memory.  Register result status in register set: Indicates which reservation station will write each register, if one exists. Blank when no pending instructions that will write that register.

11 Three Stages of Tomasulo Algorithm 1. Issue—get instruction from Instruction Queue If reservation station free, the Tomasulo algorithm issues the instruction and fetches operands from registers if possible.  In-order issue! 2. Execution—operate on operands (EX) When both operands ready then dispatch to FU and execute; if not ready, watch CDB for result (check for RAWs).  Out-of-order dispatch and execution! 3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting units; mark reservation station available.

12 Tomasulo Scheduling We assume: mul and div need 4 EX cycles, sub and add need 1 EX cycle.

13 Tomasulo Scheduling

14 Tomasulo Scheduling

15 Tomasulo Scheduling

16 Tomasulo Scheduling sub writes result on CDB and frees RS; add is issued to RS 2 and gets result from CDB in same cycle

17 Tomasulo Scheduling

18 Tomasulo Scheduling add and mul complete in the same cycle and compete for the CDB; add gets the CDB, mul is deferred; Please note the WAR hazard which is automatically solved: add updates Reg4 before div starts executing; however, div has already stored the previous value in its reservation station (only works with in-order issue!)

19 Tomasulo Scheduling

20 Tomasulo Scheduling

21 Tomasulo Scheduling

22 Tomasulo Scheduling

23 Tomasulo Scheduling

24 Tomasulo Scheduling

25 Comment on the Original Tomasulo Scheme  In the original Tomasulo scheme, the CDB is reserved at least two cycles in advance  each instruction stays at least two cycles in the EX phase  CDB resource conflicts are solved at CDB reservation time (before execution) In contrast, we assume CDB resource conflict resolution in WB stage (see cycle 6 in example).  What happens when an instruction is issued and one of its operands is on the CDB in the same cycle? Uncertain in original Tomasulo paper! We assume the instruction snoops the CDB already in issue phase (see cycle 4 in example).

26 Tomasulo Summary  Prevents register as bottleneck (forwarding from CDB to reservation stations)  Avoids WAR and WAW hazards  Not limited to basic blocks (provided branch prediction)  Lasting Contributions  Dynamic scheduling  Register renaming in reservation stations  However: single-issue scheme, in-order issue scheme!  Implementation in IBM 360/91

27 IBM 360/91  Belongs to the family of the IBM System/360 architecture which all share the ISA.  The IBM System/360 Model 91 was deeply pipelined (overall pipeline length was 20 stages).  Floating-point execution unit: two separate, fully pipelined floating-point FUs, the adder and the multiplier/divider. The FUs could be used concurrently.  Addition took two cycles, multiplication three cycles, and division eleven cycles.  Three reservation stations (RS) associated to adder, and two to the multiplier/divider.  A speculative branch prediction was used that speculated the target will be taken, when the branch target instruction is within the last eight instructions.  Memory had a 10-cycle access, it was fully buffered and 32-way interleaved. The processor could have up to 32 memory accesses pending to reduce latency.  But no cache.

IBM 360/91 Floating-Point Buffers (FLB) Floating-Point Operating Stack Floating-Point Registers (FLR) From Instruction Unit From Store Unit To Store Unit Decoder Add Unit Multiply/Divide Unit Common Data Bus (CDB) Reservation Stations

29 IBM 360/91 Implementation Details  The processor had about gates implemented in ECL technology with a 60 ns basic CPU clock.  IBM produced about 12 of the IBM System/360 Model 91 and perhaps twice that number of Model 195 (which was based on Model 91 but had a faster cycle and incorporated a cache).

30 Lessons Learned from CISC  Modern processors use ideas from RISC and CISC approach.  Out-of-order execution is not a new concept - it existed twenty-five years ago on CISC machines CDC6600 as scoreboarding and on IBM System/360 Model 91 as Tomasulo scheme.  Out-of-order scheduling is quite similar to dataflow and is referred to as micro dataflow by microprocessor researchers.  Next: Chapter 4: Multiple-issue (Superscalar Processors)