Instruction-level Parallelism Compiler Perspectives on Code Movement dependencies are a property of code, whether or not it is a HW hazard depends on.

Slides:



Advertisements
Similar presentations
CMSC 611: Advanced Computer Architecture Tomasulo Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Advertisements

Scoreboarding & Tomasulos Approach Bazat pe slide-urile lui Vincent H. Berk.
Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
Lec18.1 Step by step for Dynamic Scheduling by reorder buffer Copyright by John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 19, 2005 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
A scheme to overcome data hazards
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
FTC.W99 1 Advanced Pipelining and Instruction Level Parallelism (ILP) ILP: Overlap execution of unrelated instructions gcc 17% control transfer –5 instructions.
COMP4611 Tutorial 6 Instruction Level Parallelism
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
Eliminating Stalls Using Compiler Support. Instruction Level Parallelism gcc 17% control transfer –5 instructions + 1 branch –Reordering among 5 instructions.
ILP: Loop UnrollingCSCE430/830 Instruction-level parallelism: Loop Unrolling CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
Lecture 6: ILP HW Case Study— CDC 6600 Scoreboard & Tomasulo’s Algorithm Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EEL Advanced Pipelining and Instruction Level Parallelism Lotzi Bölöni.
COMP25212 Advanced Pipelining Out of Order Processors.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
CMSC 611: Advanced Computer Architecture Scoreboard Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Static Scheduling for ILP Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
CS252 Graduate Computer Architecture Lecture 6 Static Scheduling, Scoreboard February 6 th, 2012 John Kubiatowicz Electrical Engineering and Computer Sciences.
1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.
Computer Architecture
EENG449b/Savvides Lec /22/05 March 22, 2005 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
Review of CS 203A Laxmi Narayan Bhuyan Lecture2.
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
Tomasulo’s Approach and Hardware Based Speculation
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
CSC 4250 Computer Architectures October 13, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.
Nov. 9, Lecture 6: Dynamic Scheduling with Scoreboarding and Tomasulo Algorithm (Section 2.4)
Out-of-order execution: Scoreboarding and Tomasulo Week 2
1 Lecture 6 Tomasulo Algorithm CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading:Textbook 2.4, 2.5.
CET 520/ Gannod1 Section A.8 Dynamic Scheduling using a Scoreboard.
Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.
2/24; 3/1,3/11 (quiz was 2/22, QuizAns 3/8) CSE502-S11, Lec ILP 1 Tomasulo Organization FP adders Add1 Add2 Add3 FP multipliers Mult1 Mult2 From.
Chapter 3 Instruction Level Parallelism Dr. Eng. Amr T. Abdel-Hamid Elect 707 Spring 2011 Computer Applications Text book slides: Computer Architec ture:
04/03/2016 slide 1 Dynamic instruction scheduling Key idea: allow subsequent independent instructions to proceed DIVDF0,F2,F4; takes long time ADDDF10,F0,F8;
COMP25212 Advanced Pipelining Out of Order Processors.
Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm 吳俊興 高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.
Instruction-Level Parallelism and Its Dynamic Exploitation
IBM System 360. Common architecture for a set of machines
/ Computer Architecture and Design
Approaches to exploiting Instruction Level Parallelism (ILP)
Out of Order Processors
Step by step for Tomasulo Scheme
CS203 – Advanced Computer Architecture
CSCE430/830 Computer Architecture
Lecture 6 Score Board And Tomasulo’s Algorithm
Chapter 3: ILP and Its Exploitation
Advantages of Dynamic Scheduling
High-level view Out-of-order pipeline
11/14/2018 CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, Electrical and Computer.
CMSC 611: Advanced Computer Architecture
A Dynamic Algorithm: Tomasulo’s
COMP s1 Seminar 3: Dynamic Scheduling
Out of Order Processors
John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
CSCE430/830 Computer Architecture
Advanced Computer Architecture
Tomasulo Organization
Reduction of Data Hazards Stalls with Dynamic Scheduling
CS5100 Advanced Computer Architecture Dynamic Scheduling
/ Computer Architecture and Design
John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
September 20, 2000 Prof. John Kubiatowicz
High-level view Out-of-order pipeline
Lecture 7 Dynamic Scheduling
Conceptual execution on a processor which exploits ILP
Presentation transcript:

Instruction-level Parallelism

Compiler Perspectives on Code Movement dependencies are a property of code, whether or not it is a HW hazard depends on the given pipeline. Compiler must respect (True) Data dependencies (RAW) Instruction i produces a result used by instruction j, or Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i.

Compiler Perspectives on Code Movement Other kinds of dependence also called name (false) dependence: two instructions use same name but don’t exchange data Antidependence (WAR dependence) Instruction j writes a register or memory location that instruction i reads from and instruction i is executed first Output dependence (WAW dependence) Instruction i and instruction j write the same register or memory location; ordering between instructions must be preserved.

Control Dependence Example if (c1) I1; if (c2) I2; I1 is control dependent on c1 and I2 is control dependent on c2 but not on c1.

A sample loop Loop:LDF0,0(R1);F0=array element, R1=X[] MULDF4,F0,F2;multiply scalar in F2 SDF4, 0(R1);store result ADDIR1,R1,8;increment pointer 8B (DW) SEQ R3, R1, R2;R2 = &X[1001] BNEZR3,Loop;branch R3!=zero NOP;delayed branch slot OperationLatency (stalls) FP Mult6 (5) LD2 (1) Int ALU1 (0) Where are the dependencies and stalls?

Instruction Scheduling Loop:LDF0,0(R1) MULDF4,F0,F2 SD0(R1),F4 ADDIR1,R1,8 SEQ R3, R1, R2 BNEZR3,Loop NOP Number of cycle per iteration?

Instruction Scheduling Loop:LDF0,0(R1) MULDF4,F0,F2 SD0(R1),F4 ADDIR1,R1,8 SEQ R3, R1, R2 BNEZR3,Loop NOP Loop:LDF0,0(R1) ADDIR1,R1,8 MULDF4,F0,F2 SEQ R3, R1, R2 BNEZR3,Loop SD-8(R1),F4 Cycles/iteration?

Loop Unrolling Loop:LDF0,0(R1) ADDIR1,R1,8 MULDF4,F0,F2 SEQ R3, R1, R2 BNEZR3,Loop SD-8(R1),F4 Can extract more parallelism

Loop Unrolling Loop:LDF0,0(R1) ADDIR1,R1,8 MULDF4,F0,F2 SEQ R3, R1, R2 BNEZR3,Loop SD-8(R1),F4 Loop:LDF0,0(R1) ADDIR1,R1,8 MULDF4,F0,F2 SEQ R3, R1, R2 BNEZR3,Loop SD-8(R1),F4 LDF0,0(R1) ADDIR1,R1,8 MULDF4,F0,F2 SEQ R3, R1, R2 BNEZR3,Loop SD-8(R1),F4 What is the problem here?

Loop Unrolling Loop:LDF0,0(R1) ADDIR1,R1,8 MULDF4,F0,F2 SEQ R3, R1, R2 BNEZR3,Loop SD-8(R1),F4 Loop:LDF0,0(R1) ADDIR1,R1,8 MULDF4,F0,F2 SEQ R3, R1, R2 BNEZR3,Loop SD-8(R1),F4 LDF0,0(R1) ADDIR1,R1,8 MULDF4,F0,F2 SEQ R3, R1, R2 BNEZR3,Loop SD-8(R1),F4 Unnecessary instructions and redundant instructions

Loop Unrolling Loop:LDF0,0(R1) ADDIR1,R1,8 MULDF4,F0,F2 SEQ R3, R1, R2 BNEZR3,Loop SD-8(R1),F4 Loop:LDF0,0(R1) MULDF4,F0,F2 SD0(R1),F4 LDF0,8(R1) ADDIR1,R1,16 MULDF4,F0,F2 SEQ R3, R1, R2 BNEZR3,Loop SD-8(R1),F4 Still problems with scheduling? Hint

Register Renaming Loop:LDF0,0(R1) ADDIR1,R1,8 MULDF4,F0,F2 SEQ R3, R1, R2 BNEZR3,Loop SD-8(R1),F4 Loop:LDF0,0(R1) MULDF4,F0,F2 SD0(R1),F4 LDF10,8(R1) ADDIR1,R1,16 MULDF14,F10,F2 SEQ R3, R1, R2 BNEZR3,Loop SD-8(R1),F14 Let’s schedule now

Register Renaming Loop:LDF0,0(R1) ADDIR1,R1,8 MULDF4,F0,F2 SEQ R3, R1, R2 BNEZR3,Loop SD-8(R1),F4 Loop:LDF0,0(R1) LDF10,8(R1) MULDF4,F0,F2 MULDF14,F10,F2 ADDIR1,R1,16 SEQ R3, R1, R2 SD0(R1),F4 BNEZR3,Loop SD-8(R1),F14 Cycles/iteration?

How easy is it to determine dependences? Easy to determine for registers (fixed names) Hard for memory: Does 100(R4) = 20(R6)? From different loop iterations, does 20(R6) = 20(R6)? Another Example: ST R5, R6 LD R4, R3

Memory Disambiguation Problem: In many cases, it is likely but not certain that two memory instructions reference different addresses Disambiguation is much harder in languages with pointers Example: void annoy_compiler1(char *foo, char *bar) { foo[2] = bar[2]; bar[3] = foo[3]; } Memory references are independent unless foo = bar

Disambiguation 2 Making things worse, some programs have independent memory references some of the time Example: void annoy_compiler2(int *a, int *b) { int I; for (I = 0; I < 256; I++){ a[I] = b[f(I)]; } Conventional compiler needs to assume that any references that could be to the same location are to the same location and serialize them

HW Schemes: Instruction Parallelism Why in HW at run time? Works when can’t know dependence until run time Variable latency Control dependent data dependence Can schedule differently every time through the code. Compiler simpler Code for one machine runs well on another Hardware techniques to find/extract ILP Tomasulo’s Algorithm for Out-of-order Execution

Tomasulo’s Algorithm Developed for architecture of IBM 360/91 (1967) 360/91 system’s goal was to significantly improve performance (especially floating-point) without requiring people to change their code Sound familiar? 16MHz 2MB Mem 50X faster Than SOA

Tomasulo Organization

Tomasulo Algorithm Consider three input instructions Common Data Bus broadcasts results to all FUs RS’s (FU’s), registers, etc. responsible for collecting own data off CDB Load and Store Queues treated as FUs as well

Reservation Station Components Op—Operation to perform in the unit (e.g., + or –) Qj, Qk—Reservation stations producing source registers Vj, Vk—Value of Source operands Rj, Rk—Flags indicating when Vj, Vk are ready Busy—Indicates reservation station is busy Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register.

Three Stages of Tomasulo Algorithm 1.Issue —get instruction from FP Op Queue If reservation station free, the scoreboard issues instr & sends operands (renames registers). 2.Execution —operate on operands (EX) When both operands ready then execute; if not ready, watch CDB for result 3.Write result —finish execution (WB) Write on Common Data Bus to all waiting units; mark reservation station available.

Tomasulo Example ADDDF4, F2, F0 MULDF8, F4, F2 ADDDF6, F8, F6 SUBDF8, F2, F0 ADDDF2, F8, F0 Multiply takes 10 clocks, add/sub take 4

Tomasulo – cycle 0 ADDD F4, F2, F0 MULDF8, F4, F2 ADDDF6, F8, F6 SUBDF8, F2, F0 ADDDF2, F8, F0 ADDD F6, F8, F6 MULD F8, F4, F2 ADDD F2, F8, F0 SUBD F8, F2, F0 Instruction Queue F0 F2 F4 F6 F FP addersFP mult’s

Tomasulo – cycle 1 ADDDF4, F2, F0 MULDF8, F4, F2 ADDDF6, F8, F6 SUBDF8, F2, F0 ADDDF2, F8, F0 ADDD F6, F8, F6 MULD F8, F4, F2 ADDD F2, F8, F0 SUBD F8, F2, F0 Instruction Queue F0 F2 F4 F6 F add ADDD FP addersFP mult’s

Tomasulo – cycle 2 ADDDF4, F2, F0 MULDF8, F4, F2 ADDDF6, F8, F6 SUBDF8, F2, F0 ADDDF2, F8, F0 ADDD F6, F8, F6 SUBD F8, F2, F0 Instruction Queue F0 F2 F4 F6 F add mult1 ADDD FP adders MULD add12.0 FP mult’s ADDD F2, F8, F0 MULD add Y Op Qj Qk Vj Vk Busy

Tomasulo – cycle 2 ADDDF4, F2, F0 MULDF8, F4, F2 ADDDF6, F8, F6 SUBDF8, F2, F0 ADDDF2, F8, F0 ADDD F6, F8, F6 SUBD F8, F2, F0 Instruction Queue F0 F2 F4 F6 F add mult1 ADDD FP adders MULD add12.0 FP mult’s ADDD F2, F8, F0

Tomasulo – cycle 3 ADDDF4, F2, F0 MULDF8, F4, F2 ADDDF6, F8, F6 SUBDF8, F2, F0 ADDDF2, F8, F0 SUBD F8, F2, F0 Instruction Queue F0 F2 F4 F6 F add1 6.0add2 8.0mult1 ADDD ADDD mult16.0 FP adders MULD add12.0 FP mult’s

Tomasulo – cycle 4 ADDDF4, F2, F0 MULDF8, F4, F2 ADDDF6, F8, F6 SUBDF8, F2, F0 ADDDF2, F8, F0 Instruction Queue F0 F2 F4 F6 F add1 6.0add2 8.0add3 ADDD ADDD mult16.0 SUBD FP adders MULD add12.0 FP mult’s

Tomasulo – cycle 5 ADDDF4, F2, F0 MULDF8, F4, F2 ADDDF6, F8, F6 SUBDF8, F2, F0 ADDDF2, F8, F0 Instruction Queue F0 F2 F4 F6 F add2 8.0add3 ADDD ADDD mult16.0 SUBD FP adders MULD 2.0 FP mult’s (add1 result)

Tomasulo – cycle 6 ADDDF4, F2, F0 MULDF8, F4, F2 ADDDF6, F8, F6 SUBDF8, F2, F0 ADDDF2, F8, F0 Instruction Queue F0 F2 F4 F6 F add add2 8.0add3 ADDD add30.0 ADDD mult16.0 SUBD FP adders MULD 2.0 FP mult’s

Tomasulo – cycle 8 ADDDF4, F2, F0 MULDF8, F4, F2 ADDDF6, F8, F6 SUBDF8, F2, F0 ADDDF2, F8, F0 Instruction Queue F0 F2 F4 F6 F add add ADDD ADDD mult16.0 SUBD FP adders MULD 2.0 FP mult’s (add3 result)

Tomasulo – cycle 9 ADDDF4, F2, F0 MULDF8, F4, F2 ADDDF6, F8, F6 SUBDF8, F2, F0 ADDDF2, F8, F0 Instruction Queue F0 F2 F4 F6 F add add2 2.0 ADDD ADDD mult16.0 FP adders MULD 2.0 FP mult’s

Tomasulo – cycle 12 ADDDF4, F2, F0 MULDF8, F4, F2 ADDDF6, F8, F6 SUBDF8, F2, F0 ADDDF2, F8, F0 Instruction Queue F0 F2 F4 F6 F add2 2.0 ADDD ADDD mult16.0 FP adders MULD 2.0 FP mult’s (add1 result)

Tomasulo – cycle 15 ADDDF4, F2, F0 MULDF8, F4, F2 ADDDF6, F8, F6 SUBDF8, F2, F0 ADDDF2, F8, F0 Instruction Queue F0 F2 F4 F6 F add2 2.0 ADDD FP adders MULD 2.0 FP mult’s (mult1 result)

Tomasulo – cycle 16 ADDDF4, F2, F0 MULDF8, F4, F2 ADDDF6, F8, F6 SUBDF8, F2, F0 ADDDF2, F8, F0 Instruction Queue F0 F2 F4 F6 F add2 2.0 ADDD FP addersFP mult’s

Tomasulo – cycle 19 ADDDF4, F2, F0 MULDF8, F4, F2 ADDDF6, F8, F6 SUBDF8, F2, F0 ADDDF2, F8, F0 Instruction Queue F0 F2 F4 F6 F ADDD FP addersFP mult’s (add2 result)

Tomasulo Summary Prevents Register as bottleneck Avoids WAR, WAW hazards Lasting Contributions Dynamic scheduling Register renaming (in what way does the register name change?) Load/store disambiguation

Limitations Exceptions/interrupts Can’t identify a particular point in the program at which an interrupt/exception occurs How do you know where to go back to after an interrupt handler completes? OOO completion??? Interaction with pipelined ALUs Reservation station couldn’t be released until instruction completes, would need many reservation stations.