EENG449b/Savvides Lec 10.1 2/20/04 February 12, 2004 Prof. Andreas Savvides Spring 2004 EENG 449bG/CPSC 439bG.

Slides:



Advertisements
Similar presentations
Spring 2003CSE P5481 Out-of-Order Execution Several implementations out-of-order completion CDC 6600 with scoreboarding IBM 360/91 with Tomasulos algorithm.
Advertisements

Instruction-level Parallelism Compiler Perspectives on Code Movement dependencies are a property of code, whether or not it is a HW hazard depends on.
ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.
A scheme to overcome data hazards
COMP4611 Tutorial 6 Instruction Level Parallelism
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
COMP25212 Advanced Pipelining Out of Order Processors.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
CMSC 611: Advanced Computer Architecture Scoreboard Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Instruction-Level Parallelism (ILP)
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
COMP4211 Seminar Intro to Instruction-Level Parallelism 04S1 Week 02 Oliver Diessel.
COMP4211 (Seminar) Intro to Instruction-Level Parallelism
1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.
Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation – Concepts 吳俊興 高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.
EENG449b/Savvides Lec /22/05 March 22, 2005 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
Chapter 12 Pipelining Strategies Performance Hazards.
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
EECC551 - Shaaban #1 Fall 2002 lec# Floating Point/Multicycle Pipelining in MIPS Completion of MIPS EX stage floating point arithmetic operations.
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
ENGS 116 Lecture 71 Scoreboarding Vincent H. Berk October 8, 2008 Reading for today: A.5 – A.6, article: Smith&Pleszkun FRIDAY: NO CLASS Reading for Monday:
EENG449b/Savvides Lec 4.1 1/22/04 January 22, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
EENG449b/Savvides Lec 5.1 1/27/04 January 27, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CSC 4250 Computer Architectures October 13, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.
Nov. 9, Lecture 6: Dynamic Scheduling with Scoreboarding and Tomasulo Algorithm (Section 2.4)
1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.
Instruction-Level Parallelism dynamic scheduling prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University May 2015Instruction-Level Parallelism.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
Recap Multicycle Operations –MIPS Floating Point Putting It All Together: the MIPS R4000 Pipeline.
04/03/2016 slide 1 Dynamic instruction scheduling Key idea: allow subsequent independent instructions to proceed DIVDF0,F2,F4; takes long time ADDDF10,F0,F8;
COMP25212 Advanced Pipelining Out of Order Processors.
Instruction-Level Parallelism and Its Dynamic Exploitation
Approaches to exploiting Instruction Level Parallelism (ILP)
CS203 – Advanced Computer Architecture
Advantages of Dynamic Scheduling
Lecture 6: Advanced Pipelines
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Adapted from the slides of Prof
Checking for issue/dispatch
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
How to improve (decrease) CPI
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Instruction Level Parallelism (ILP)
Adapted from the slides of Prof
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Dynamic Hardware Prediction
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Lecture 7 Dynamic Scheduling
CMSC 611: Advanced Computer Architecture
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Loop-Level Parallelism
Lecture 5: Pipeline Wrap-up, Static ILP
Conceptual execution on a processor which exploits ILP
Presentation transcript:

EENG449b/Savvides Lec /20/04 February 12, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer Systems Lecture 10 Instruction Level Parallelism I

EENG449b/Savvides Lec /20/04 Announcements Homeworks returned today, solutions available from the TA Midterm next Thursday –Chapters 1, 2, Appendix A and 2 papers »Paper on choosing a DSP processor »Paper & Lecture on Dynamic Voltage Scaling Lab office hours tomorrow 12:00 – 1:30 –Stop by AKW000 if you have problems starting your projects on the motes or OKI boards Homework solutions available from the TA

EENG449b/Savvides Lec /20/04 Instruction Level Parallelism Reading for this lecture: Chapter 3, pages 172 – 196 Chapter 3: ILP in hardware Recall ILP tries to minimize these terms through the overlapped execution of instructions

EENG449b/Savvides Lec /20/04 Where is the maximal gain in ILP? Basic block – a straight line code sequence with no branches in it except the entry and the exit point Limited amount of parallelism within a basic block –Instructions depend on each other so they cannot be reordered –In typical MIPS programs dynamic branch frequency between 15 – 25% ( 4 – 7 ) instructions between a pair of branches Need to exploit parallelism across multiple blocks

EENG449b/Savvides Lec /20/04 Loops : an example for parallelism for (i=1; i <= 1000; i=i+1) x[i] = x[i] + y[i]; Loop iterations can overlap – loop level parallelism Main technique – loop unrolling –Can be done either in hardware or software So what kind of dependencies do we need to worry about?

EENG449b/Savvides Lec /20/04 Data Dependences An instruction i is data depended on instruction j if: –Instruction i produces a result used by instruction j –Instruction j is data dependent on instruction k, and instruction k is data depended on instruction i

EENG449b/Savvides Lec /20/04 Data Dependences Data dependencies are properties of programs Detection of hazards and stalls are properties of the pipeline organization A dependence can be overcomed by: Maintaining the dependence and avoiding the hazard Transforming the code to eliminate the dependence

EENG449b/Savvides Lec /20/04 Detecting Data Dependences Data values can flow through registers or memory Data dependences that flow through registers are easy to detect –Register names are the same so it is easy to check –More complicated when branches intervene Data dependences are harder to detect in memory 100(R4) and 20(R6) may point to the same memory location!! Crucial aspect to consider in compiler techniques

EENG449b/Savvides Lec /20/04 Name Dependences Name dependence: two instructions use the same register or memory, without any flow of data that is actually associated with that register or memory location Types of name dependences Antidependence – instruction j writes a register that instruction i reads Output dependence – instruction i and instruction j write the same memory location or register Name dependences are not real dependences Just change the names – register renaming – can be done by the hardware or the compiler

EENG449b/Savvides Lec /20/04 Data Hazards (Revisited) Changes the access to the operand ordering Read After Write (RAW) – j tries to read a source before i writes it – program order must be reserved Write After Write (WAW) – j tried to write an operand before it is written by i – output dependence. Can only happen in pipelines that write in more than one stage or let an instruction to proceed when another instruction is stalled Write After Read (WAR) – j tries to write an instruction before it is read by i – antidependence – mostly occurs when instructions write results early in the pipeline, or when instructions are reordered

EENG449b/Savvides Lec /20/04 Control Dependences Control dependences control the ordering of instructions with respect to branch instructions –Instructions should execute in correct program order –Ex. Should not execute instructions from the then clause of an if statement if not needed Control dependence constraints –Instructions control dependent on a branch cannot be moved before a branch »E.g an instruction from the then component of a statement cannot be move before the if component –An instruction that is not control dependent on a branch cannot be moved after the branch so that is execution is depended on the branch

EENG449b/Savvides Lec /20/04 Control Dependence Control dependence is not the critical property to preserve –May be willing to execute extra instructions if that does not compromise program correctness Need to preserve –Exception behavior – the way exceptions raise in a program should not be altered –Data flow – flow of data among instructions that produce results and those that consume them

EENG449b/Savvides Lec /20/04 Dynamic Scheduling Statically scheduled pipelines –When a data dependence cannot be hidden with bypassing or forwarding, the processor stalls until the data is cleared Dynamic scheduling –Hardware reorders instructions to reduce the stalls while maintaining data flow and instruction behavior Advantages –Handles dependences not known at compile time »Simplifies compiler design –Allows code compiled for one pipeline to run efficiently on another Disadvantage – hardware complexity

EENG449b/Savvides Lec /20/04 Dynamic Scheduled Pipelines (Lecture 5) Simple pipelines result in hazards that require stalling. Static scheduling – compilers rearrange instructions to avoid stalls. Dynamic scheduling – processor executes instructions out-of-order to minimize stalls Dynamic scheduling requires splitting the ID stage into stages: –Issue – Decode instructions, check for structural hazards –Read operands – Wait until there are no data hazards, then read operands –Also need to know when each instruction begins and ends execution Requires a lot more bookkeeping! More when we discuss Tomasulo’s algorithm in chapter 3…

EENG449b/Savvides Lec /20/04 Scoreboarding Scoreboarding – a technique that allows out- of-order execution when resources are available and there are no data dependencies – originated in CDC6600 in the mid 60s. Scoreboard fully responsible for instruction execution and hazard detection –Requires changes in # of functional units and latency of operations –Needs to keep track of status of all instructions in execution

EENG449b/Savvides Lec /20/04 Scoreboarding II

EENG449b/Savvides Lec /20/04 Tomasulo’s Algorithm Hardware based technique for ILP –Tracks when operands are available to avoid RAW hazards –Introduces register renaming to avoid WAW and WAR hazards »What does this mean? More sophisticated approach than the scoreboard from Appendix A Initially designed for the IBM 360/91 –Designed in the late 60s –Scoreboarding + register renaming –4 FP registers, long memory access delays, long FP times – compiler level optimizations were limited

EENG449b/Savvides Lec /20/04 Register Renaming DIV.DF0, F2, F4 ADD.DF6, F0, F8 S.DF6, 0(R1) SUB.DF8, F10, F14 MUL.DF6, F10, F8 Where is the antidependence (WAR)? –This is a name dependence

EENG449b/Savvides Lec /20/04 Register Renaming DIV.DF0, F2, F4 ADD.DF6, F0, F8 S.DF6, 0(R1) SUB.DF8, F10, F14 MUL.DF6, F10, F8 Where is the output dependence (WAW)? –This is a name dependence

EENG449b/Savvides Lec /20/04 Register Renaming DIV.DF0, F2, F4 ADD.DF6, F0, F8 S.DF6, 0(R1) SUB.DF8, F10, F14 MUL.DF6, F10, F8 Where are the true data dependences (RAW)?

EENG449b/Savvides Lec /20/04 Getting Rid of Name Dependencies Assume we have 2 temporary registers S and T the code sequence can be re-written as: DIV.DF0, F2, F4DIV.D F0, F2, F4 ADD.DF6, F0, F8ADD.D S, F0, F8 S.DF6, 0(R1)S.D S, 0(R1) SUB.DF8, F10, F14SUB.D T, F10, F14 MUL.DF6, F10, F8MUL.D F6, F10, T Any subsequent uses of F8 should be replaced with register T –Requires sophisticated compiler analysis since intervining branches may change the meaning of F8 –Tomasulo’s algorithm can handle renaming across branches

EENG449b/Savvides Lec /20/04 Tomasulo’s Scheme for Avoiding Name Dependences Use Reservation Stations –Buffer the operands of instructions waiting to issue –Buffers the operand as soon as it is available, eliminating the need to get an operand from a register –Operands are renamed to the names of the reservation station, avoiding register name conflicts –There are more reservation stations than registers »Eliminates more hazards than the compiler

EENG449b/Savvides Lec /20/04 MIPS FPU with Tomasulo Issue: In order instructions to Preserve correct data flow If there is an empty reservation station issue the instruction with operands Else stall –stuctural hazard

EENG449b/Savvides Lec /20/04 MIPS FPU with Tomasulo If operands not available, keep track of the FUs that produce them – Register renaming

EENG449b/Savvides Lec /20/04 An Instruction goes through 3 basic steps 1.Issue – described in the previous slide 2.Execute – Operands placed in the reservation tables as they become available When all operands available the instruction is executed - this execution delay eliminates RAW hazards Loads and stores have 2 execution steps 1. Compute the effective address and place in load or store buffer 2. Execute as soon as memory unit is available No instruction is executed until all preceding branches have been determined to preserve exception behavior

EENG449b/Savvides Lec /20/04 Step 3 Write result –Results written on common data bus (CDB) »End up in corresponding registers and reservation tables –Write data to memory also happens at this step

EENG449b/Savvides Lec /20/04 Things to note about Tomasulo’s Scheme Data structures to detect and eliminate hazards are attached to: –Reservation stations –Register file –Load and store buffers Reservation stations act as a set of virtual registers –More than FP registers so register renaming is possible

EENG449b/Savvides Lec /20/04 Reservation Table Fields To track the state of the algorithm: Op – operation to perform on source operands Q j, Q k – the reservation stations that will produce the operand V j, V k – The value of the source operands A – holds information on the memory address calculation (immediate and address calculation are stored here) Busy – Reservation station and its accompanying functional unit is busy The register file also contains a field Q i – The number of the reservation station that contains the value that should be stored in this register

EENG449b/Savvides Lec /20/04 Scoreboarding vs. Tomasulo No checking needed for WAR or WAW as registers are renamed Hazard detection logic is distributed Loads and stores are treated as basic functional units Has larger register sets – reservation tables Exploits ILP well but requires more complex hardware

EENG449b/Savvides Lec /20/04 Tomasulo’s Algorithm Details Refer to figure 3.5 in the text for a detailed register level description of Tomasulo’s algorithm

EENG449b/Savvides Lec /20/04 Next time Hardware branch prediction