Lecture 6: ILP HW Case Study— CDC 6600 Scoreboard & Tomasulo’s Algorithm Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

Slides:

Advertisements

Similar presentations

CMSC 611: Advanced Computer Architecture Tomasulo Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

Advertisements

Scoreboarding & Tomasulos Approach Bazat pe slide-urile lui Vincent H. Berk.

Instruction-level Parallelism Compiler Perspectives on Code Movement dependencies are a property of code, whether or not it is a HW hazard depends on.

Lec18.1 Step by step for Dynamic Scheduling by reorder buffer Copyright by John Kubiatowicz (http.cs.berkeley.edu/~kubitron)

A scheme to overcome data hazards

FTC.W99 1 Advanced Pipelining and Instruction Level Parallelism (ILP) ILP: Overlap execution of unrelated instructions gcc 17% control transfer –5 instructions.

Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.

Eliminating Stalls Using Compiler Support. Instruction Level Parallelism gcc 17% control transfer –5 instructions + 1 branch –Reordering among 5 instructions.

EEL Advanced Pipelining and Instruction Level Parallelism Lotzi Bölöni.

COMP25212 Advanced Pipelining Out of Order Processors.

CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.

CMSC 611: Advanced Computer Architecture Scoreboard Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

Static Scheduling for ILP Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.

1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.

1 Lecture 5 Branch Prediction (2.3) and Scoreboarding (A.7)

Review of CS 203A Laxmi Narayan Bhuyan Lecture2.

1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.

COMP381 by M. Hamdi 1 Pipelining (Dynamic Scheduling Through Hardware Schemes)

1 Recap (Scoreboarding). 2 Dynamic Scheduling Dynamic Scheduling by Hardware – – Allow Out-of-order execution, Out-of-order completion – – Even though.

ENGS 116 Lecture 71 Scoreboarding Vincent H. Berk October 8, 2008 Reading for today: A.5 – A.6, article: Smith&Pleszkun FRIDAY: NO CLASS Reading for Monday:

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 5, 2005 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Scoreboarding)

COMP381 by M. Hamdi 1 Loop Level Parallelism Instruction Level Parallelism: Loop Level Parallelism.

CSC 4250 Computer Architectures October 13, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

Nov. 9, Lecture 6: Dynamic Scheduling with Scoreboarding and Tomasulo Algorithm (Section 2.4)

1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.

Out-of-order execution: Scoreboarding and Tomasulo Week 2

Lecture 5: Pipelining & Instruction Level Parallelism Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

1 Lecture 6 Tomasulo Algorithm CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading:Textbook 2.4, 2.5.

Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.

2/24; 3/1,3/11 (quiz was 2/22, QuizAns 3/8) CSE502-S11, Lec ILP 1 Tomasulo Organization FP adders Add1 Add2 Add3 FP multipliers Mult1 Mult2 From.

1 Images from Patterson-Hennessy Book Machines that introduced pipelining and instruction-level parallelism. Clockwise from top: IBM Stretch, IBM 360/91,

CSC 4250 Computer Architectures September 29, 2006 Appendix A. Pipelining.

Chapter 3 Instruction Level Parallelism Dr. Eng. Amr T. Abdel-Hamid Elect 707 Spring 2011 Computer Applications Text book slides: Computer Architec ture:

04/03/2016 slide 1 Dynamic instruction scheduling Key idea: allow subsequent independent instructions to proceed DIVDF0,F2,F4; takes long time ADDDF10,F0,F8;

Review Professor Alvin R. Lebeck Compsci 220 / ECE 252 Fall 2006.

CIS 662 – Computer Architecture – Fall Class 11 – 10/12/04 1 Scoreboarding  The following four steps replace ID, EX and WB steps  ID: Issue –

COMP25212 Advanced Pipelining Out of Order Processors.

Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm 吳俊興高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.

IBM System 360. Common architecture for a set of machines

/ Computer Architecture and Design

Out of Order Processors

Step by step for Tomasulo Scheme

Tomasulo Loop Example Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1

CS203 – Advanced Computer Architecture

Lecture 6 Score Board And Tomasulo’s Algorithm

Advantages of Dynamic Scheduling

High-level view Out-of-order pipeline

11/14/2018 CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, Electrical and Computer.

CMSC 611: Advanced Computer Architecture

A Dynamic Algorithm: Tomasulo’s

COMP s1 Seminar 3: Dynamic Scheduling

Out of Order Processors

Last Week Talks Any feedback from the talks? What did you like?

John Kubiatowicz (http.cs.berkeley.edu/~kubitron)

CS 704 Advanced Computer Architecture

Checking for issue/dispatch

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

CSCE430/830 Computer Architecture

September 20, 2000 Prof. John Kubiatowicz

Tomasulo Organization

Reduction of Data Hazards Stalls with Dynamic Scheduling

Lecture 5 Scoreboarding: Enforce Register Data Dependence

CS152 Computer Architecture and Engineering Lecture 16 Compiler Optimizations (Cont) Dynamic Scheduling with Scoreboards.

Scoreboarding ENGS 116 Lecture 7 Vincent H. Berk October 5, 2005

John Kubiatowicz (http.cs.berkeley.edu/~kubitron)

September 20, 2000 Prof. John Kubiatowicz

CS252 Graduate Computer Architecture Lecture 6 Introduction to Advanced Pipelining: Out-Of-Order Execution John Kubiatowicz Electrical Engineering and.

High-level view Out-of-order pipeline

Lecture 7 Dynamic Scheduling

Conceptual execution on a processor which exploits ILP

Presentation transcript:

Lecture 6: ILP HW Case Study— CDC 6600 Scoreboard & Tomasulo’s Algorithm Professor Alvin R. Lebeck Computer Science 220 Fall 2001

2 © Alvin R. Lebeck 1999 CPS 220 Admin HW #2 Project Selection by October 2 –Your own ideas? Short proposal due October 2 –Content: problem definition, goal of project, metric for success –3 - 5 page document – minute presentation Status report due November 1. –document only Final report due December 6 –8-10 page document –15-20 minute presentation

3 © Alvin R. Lebeck 1999 CPS 220 Review: ILP Instruction Level Parallelism in SW or HW Loop level parallelism is easiest to see Today SW parallelism dependencies defined for program, hazards if HW cannot resolve dependencies SW dependencies/Compiler sophistication determine if compiler can unroll loops –Memory dependencies hardest to determine

4 © Alvin R. Lebeck 1999 CPS 220 Review: FP Loop Showing Stalls Rewrite code to minimize stalls? InstructionInstructionLatency in producing resultusing result clock cycles FP ALU opAnother FP ALU op3 FP ALU opStore double2 Load doubleFP ALU op1 1 Loop:LDF0,0(R1);F0=vector element 2stall 3 ADDDF4,F0,F2;add scalar in F2 4stall 5stall 6 SD0(R1),F4;store result 7 SUBIR1,R1,8;decrement pointer 8B (DW) 8 BNEZR1,Loop;branch R1!=zero 9stall ;delayed branch slot

5 © Alvin R. Lebeck 1999 CPS 220 What assumptions made when moved code? –OK to move store past SUBI even though changes register –OK to move loads before stores: get right data? –When is it safe for compiler to do such changes? 1 Loop:LDF0,0(R1) 2 LDF6,-8(R1) 3 LDF10,-16(R1) 4 LDF14,-24(R1) 5 ADDDF4,F0,F2 6 ADDDF8,F6,F2 7 ADDDF12,F10,F2 8 ADDDF16,F14,F2 9 SD0(R1),F4 10 SD-8(R1),F8 11 SD-16(R1),F12 12 SUBIR1,R1,#32 13 BNEZR1,LOOP 14 SD8(R1),F16; 8-32 = clock cycles, or 3.5 per iteration Review: Unrolled Loop That Minimizes Stalls

6 © Alvin R. Lebeck 1999 Review: Hazard Detection Assume all hazard detection in ID stage 1.Check for structural hazards. 2.Check for RAW data hazard. 3.Check for WAW data hazard. If any occur stall at ID stage This is called an in-order issue/execute machine, if any instruction stalls all later instructions stall. – Note that instructions may complete execution out of order.

7 © Alvin R. Lebeck 1999 Can we do better? Problem: Stall in ID stage if any data hazard. Your task: Teams of two, propose a design to eliminate these stalls. MULD F2, F3, F4Long latency… ADDD F1, F2, F3 ADDD F3, F4, F5 ADDD F1, F4, F5

8 © Alvin R. Lebeck 1999 CPS 220 HW Schemes: Instruction Parallelism Why in HW at run time? –Works when can’t know dependencies –Simpler Compiler –Code for one machine runs well on another machine Key Idea: Allow instructions behind stall to proceed DIVDF0, F2, F4 ADD F10, F0, F8 SUBD F8, F8, F14 –Enables out-of-order execution => out-of-order completion –ID stage check for both structural & data dependencies

9 © Alvin R. Lebeck 1999 CPS 220 HW Schemes: Instruction Parallelism Out-of-order execution divides ID stage: 1. Issue: decode instructions, check for structural hazards 2. Read: operands wait until no data hazards, then read operands Scoreboards allow instruction to execute whenever 1 & 2 hold, not waiting for prior instructions

10 © Alvin R. Lebeck 1999 CPS 220 Scoreboard Implications Out-of-order completion => WAR, WAW hazards? Solutions for WAR –Queue both the operation and copies of its operands –Read registers only during Read Operands stage For WAW, must detect hazard: stall until other completes Need to have multiple instructions in execution phase => multiple execution units or pipelined execution units Scoreboard keeps track of dependencies, state or operations Scoreboard replaces ID, EX, WB with 4 stages

11 © Alvin R. Lebeck 1999 CPS 220 Four Stages of Scoreboard Control 1. Issue: decode instructions & check for structural hazards (ID1) If a functional unit for the instruction is free and no other active instruction has the same destination register (WAW), the scoreboard issues the instruction to the functional unit and updates its internal data structure. If a structural or WAW hazard exists, then the instruction issue stalls, and no further instructions will issue until these hazards are cleared. 2. Read operands: wait until no data hazards, then read operands (ID2) A source operand is available if no earlier issued active instruction is going to write it, or if the register containing the operand is being written by a currently active functional unit. When the source operands are available, the scoreboard tells the functional unit to proceed to read the operands from the registers and begin execution. The scoreboard resolves RAW hazards dynamically in this step, and instructions may be sent into execution out of order.

12 © Alvin R. Lebeck 1999 CPS 220 Four Stages of Scoreboard Control 3. Execution: operate on operands The functional unit begins execution upon receiving operands. When the result is ready, it notifies the scoreboard that it has completed execution. 4. Write Result: finish execution (WB) Once the scoreboard is aware that the functional unit has completed execution, the scoreboard checks for WAR hazards. If none, it writes results. If WAR, then it stalls the instruction. Example: DIVDF0,F2,F4 ADDDF10,F0,F8 SUBDF8,F8,F14 CDC 6600 scoreboard would stall SUBD until ADDD reads operands

13 © Alvin R. Lebeck 1999 CPS 220 Three Parts of the Scoreboard 1. Instruction status: which of 4 steps the instruction is in 2. Functional unit status: Indicates the state of the functional unit (FU). 9 fields for each functional unit Busy--Indicates whether the unit is busy or not Op--Operation to perform in the unit (e.g., + or -) Fi--Destination register Fj, Fk--Source-register numbers Qj, Qk--Functional units producing source registers Fj, Fk Rj, Rk--Flags indicating when Fj, Fk are ready 3. Register result status: Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register

14 © Alvin R. Lebeck 1999 CPS 220 Scoreboard Example Cycle 1

15 © Alvin R. Lebeck 1999 CPS 220 Scoreboard Example Cycle 2

16 © Alvin R. Lebeck 1999 CPS 220 Scoreboard Example Cycle 3

17 © Alvin R. Lebeck 1999 CPS 220 Scoreboard Example Cycle 4

18 © Alvin R. Lebeck 1999 CPS 220 Scoreboard Example Cycle 5

19 © Alvin R. Lebeck 1999 CPS 220 Scoreboard Example Cycle 6

20 © Alvin R. Lebeck 1999 CPS 220 Scoreboard Example Cycle 7

21 © Alvin R. Lebeck 1999 CPS 220 Scoreboard Example Cycle 8a

22 © Alvin R. Lebeck 1999 CPS 220 Scoreboard Example Cycle 8b

23 © Alvin R. Lebeck 1999 CPS 220 Scoreboard Example Cycle 9

24 © Alvin R. Lebeck 1999 CPS 220 Scoreboard Example Cycle 11

25 © Alvin R. Lebeck 1999 CPS 220 Scoreboard Example Cycle 12

26 © Alvin R. Lebeck 1999 CPS 220 Scoreboard Example Cycle 13

27 © Alvin R. Lebeck 1999 CPS 220 Scoreboard Example Cycle 14

28 © Alvin R. Lebeck 1999 CPS 220 Scoreboard Example Cycle 15

29 © Alvin R. Lebeck 1999 CPS 220 Scoreboard Example Cycle 16

30 © Alvin R. Lebeck 1999 CPS 220 Scoreboard Example Cycle 17

31 © Alvin R. Lebeck 1999 CPS 220 Scoreboard Example Cycle 18

32 © Alvin R. Lebeck 1999 CPS 220 Scoreboard Example Cycle 19

33 © Alvin R. Lebeck 1999 CPS 220 Scoreboard Example Cycle 20

34 © Alvin R. Lebeck 1999 CPS 220 Scoreboard Example Cycle 21

35 © Alvin R. Lebeck 1999 CPS 220 Scoreboard Example Cycle cycle Divide

36 © Alvin R. Lebeck 1999 CPS 220 Scoreboard Example Cycle 61

37 © Alvin R. Lebeck 1999 CPS 220 Scoreboard Summary Speedup 1.7 from compiler; 2.5 by hand BUT slow memory (no cache) Limitations of 6600 scoreboard –No forwarding –Limited to instructions in basic block (small window) –Number of functional units (structural hazards) –Wait for WAR hazards –Prevent WAW hazards How to design a datapath that eliminates these problems?

38 © Alvin R. Lebeck 1999 CPS 220 Tomasulo’s Algorithm: Another Dynamic Scheme For IBM 360/91 about 3 years after CDC 6600 Goal: High Performance without special compilers Differences between IBM 360 & CDC 6600 ISA –IBM has only 2 register specifiers/instr vs. 3 in CDC 6600 –IBM has 4 FP registers vs. 8 in CDC 6600 Differences between Tomasulo Algorithm & Scoreboard –Control & buffers distributed with Function Units vs. centralized in scoreboard; called “reservation stations” –Register specifiers in instructions replaced by pointers to reservation station buffer (Everything can be solved with level of indirection!) –HW renaming of registers to avoid WAR, WAW hazards –Common Data Bus broadcasts results to all FUs –Load and Stores treated as FUs as well

39 © Alvin R. Lebeck 1999 Tomasulo Organization FP adders FP multipliers To Memory From Memory Load Buffers Store Buffers FP Registers From Instruction Unit FP op queue Operand Bus Common Data Bus (CDB)

40 © Alvin R. Lebeck 1999 CPS 220 Op—Operation to perform in the unit (e.g., + or –) Qj, Qk—Reservation stations producing source registers Vj, Vk—Value of Source operands Rj, Rk—Flags indicating when Vj, Vk are ready Busy—Indicates reservation station and FU is busy Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register. Reservation Station Components

41 © Alvin R. Lebeck 1999 CPS Issue—get instruction from FP Op Queue If reservation station free, the scoreboard issues instr & sends operands (renames registers). 2.Execution—operate on operands (EX) When both operands ready then execute; if not ready, watch CDB for result 3.Write result—finish execution (WB) Write on Common Data Bus to all awaiting units; mark reservation station available. Three Stages of Tomasulo Algorithm

42 © Alvin R. Lebeck 1999 CPS 220 Tomasulo Example Cycle 0

43 © Alvin R. Lebeck 1999 CPS 220 Tomasulo Example Cycle 1 Yes

44 © Alvin R. Lebeck 1999 CPS 220 Tomasulo Example Cycle 2

45 © Alvin R. Lebeck 1999 CPS 220 Tomasulo Example Cycle 3

46 © Alvin R. Lebeck 1999 CPS 220 Tomasulo Example Cycle 4

47 © Alvin R. Lebeck 1999 CPS 220 Tomasulo Example Cycle 5

48 © Alvin R. Lebeck 1999 CPS 220 Tomasulo Example Cycle 6

49 © Alvin R. Lebeck 1999 CPS 220 Tomasulo Example Cycle 7

50 © Alvin R. Lebeck 1999 CPS 220 Tomasulo Example Cycle 8

51 © Alvin R. Lebeck 1999 CPS 220 Tomasulo Example Cycle 9

52 © Alvin R. Lebeck 1999 CPS 220 Tomasulo Example Cycle 10 6

53 © Alvin R. Lebeck 1999 CPS 220 Tomasulo Example Cycle 11

54 © Alvin R. Lebeck 1999 CPS 220 Tomasulo Example Cycle 12

55 © Alvin R. Lebeck 1999 CPS 220 Tomasulo Example Cycle 13

56 © Alvin R. Lebeck 1999 CPS 220 Tomasulo Example Cycle 14

57 © Alvin R. Lebeck 1999 CPS 220 Tomasulo Example Cycle 15

58 © Alvin R. Lebeck 1999 CPS 220 Tomasulo Example Cycle 16

59 © Alvin R. Lebeck 1999 CPS 220 Tomasulo Example Cycle 17

60 © Alvin R. Lebeck 1999 CPS 220 Tomasulo Example Cycle 18

61 © Alvin R. Lebeck 1999 CPS 220 Tomasulo Example Cycle 57

62 © Alvin R. Lebeck 1999 CPS 220 Tomasulo Example Cycle 58

63 © Alvin R. Lebeck 1999 CPS 220 Tomasulo Example Cycle 59

64 © Alvin R. Lebeck 1999 Tomasulo vs. Scoreboard Is tomasulo better? Finish in 59 cycles vs. 61 for scoreboard, why? We do reach the divide 3 cycles earlier… Simultaneous read of operand for SUBD and MULT

65 © Alvin R. Lebeck 1999 CPS 220 Tomasulo Loop Example Loop:LDF00R1 MULTDF4F0F2 SDF40R1 SUBIR1R1#8 BNEZR1Loop Multiply takes 4 clocks Loads may have cache misses

66 © Alvin R. Lebeck 1999 CPS 220 Loop Example Cycle 0

67 © Alvin R. Lebeck 1999 CPS 220 Loop Example Cycle 1

68 © Alvin R. Lebeck 1999 CPS 220 Loop Example Cycle 2

69 © Alvin R. Lebeck 1999 CPS 220 Loop Example Cycle 3

70 © Alvin R. Lebeck 1999 CPS 220 Loop Example Cycle 4

71 © Alvin R. Lebeck 1999 CPS 220 Loop Example Cycle 5

72 © Alvin R. Lebeck 1999 CPS 220 Loop Example Cycle 6

73 © Alvin R. Lebeck 1999 CPS 220 Loop Example Cycle 7

74 © Alvin R. Lebeck 1999 CPS 220 Loop Example Cycle 8

75 © Alvin R. Lebeck 1999 CPS 220 Loop Example Cycle 9

76 © Alvin R. Lebeck 1999 CPS 220 Loop Example Cycle 10

77 © Alvin R. Lebeck 1999 CPS 220 Loop Example Cycle 11

78 © Alvin R. Lebeck 1999 CPS 220 Loop Example Cycle 12

79 © Alvin R. Lebeck 1999 CPS 220 Loop Example Cycle 13

80 © Alvin R. Lebeck 1999 CPS 220 Loop Example Cycle 14

81 © Alvin R. Lebeck 1999 CPS 220 Loop Example Cycle 15

82 © Alvin R. Lebeck 1999 CPS 220 Loop Example Cycle 16

83 © Alvin R. Lebeck 1999 CPS 220 Loop Example Cycle 17

84 © Alvin R. Lebeck 1999 CPS 220 Loop Example Cycle 18

85 © Alvin R. Lebeck 1999 CPS 220 Loop Example Cycle 19

86 © Alvin R. Lebeck 1999 CPS 220 Loop Example Cycle 20

87 © Alvin R. Lebeck 1999 CPS 220 Loop Example Cycle 21

88 © Alvin R. Lebeck 1999 CPS 220 Tomasulo Summary Prevents Register as bottleneck Avoids WAR, WAW hazards of Scoreboard Allows loop unrolling in HW Not limited to basic blocks (provided branch prediction) Lasting Contributions –Dynamic scheduling –Register renaming –Load/store disambiguation

89 © Alvin R. Lebeck 1999 Next Time Dynamic Branch Prediction