In-Order Execution In-order execution does not always give the best performance on superscalar machines. The following example uses in-order execution.

Slides:

Advertisements

Similar presentations

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.

Advertisements

CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has.

1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.

HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.

Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.

ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:

Superscalar processors Review. Dependence graph S1S2 Nodes: instructions Edges: ordered relations among the instructions Any ordering-based transformation.

COMP4611 Tutorial 6 Instruction Level Parallelism

Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Instruction Set Issues MIPS easy –Instructions are only committed at MEM  WB transition Other architectures are more difficult –Instructions may update.

Intro to Computer Org. Pipelining, Part 2 – Data hazards + Stalls.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

In-Order Execution In-order execution does not always give the best performance on superscalar machines. The following example uses in-order execution.

Computer Systems. Computer System Components Computer Networks.

1 Lecture 4: Advanced Pipelines Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)

Chapter 12 Pipelining Strategies Performance Hazards.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.

Multiscalar processors

PhD/Master course, Uppsala  Understanding the interaction between your program and computer  Structuring the code  Optimizing the code  Debugging.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

In-Order Execution In-order execution does not always give the best performance on superscalar machines.  The following example uses in-order execution.

1 Manchester Mark I, This was the second (the first was a small- scale prototype) machine built at Cambridge. A production version of this computer.

Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.

Processor Structure & Operations of an Accumulator Machine

Parallelism Processing more than one instruction at a time. Pipelining

Pipeline Hazard CT101 – Computing Systems. Content Introduction to pipeline hazard Structural Hazard Data Hazard Control Hazard.

Computer Architecture Computer Architecture Superscalar Processors Ola Flygt Växjö University +46.

TDC 311 The Microarchitecture. Introduction As mentioned earlier in the class, one Java statement generates multiple machine code statements Then one.

1 Appendix A Pipeline implementation Pipeline hazards, detection and forwarding Multiple-cycle operations MIPS R4000 CDA5155 Spring, 2007, Peir / University.

The Central Processing Unit (CPU) and the Machine Cycle.

CMPE 421 Parallel Computer Architecture

CS 1104 Help Session IV Five Issues in Pipelining Colin Tan, S

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

5/13/99 Ashish Sabharwal1 Pipelining and Hazards n Hazards occur because –Don’t have enough resources (ALU’s, memory,…) Structural Hazard –Need a value.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

Recap Multicycle Operations –MIPS Floating Point Putting It All Together: the MIPS R4000 Pipeline.

LECTURE 7 Pipelining. DATAPATH AND CONTROL We started with the single-cycle implementation, in which a single instruction is executed over a single cycle.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

CPIT Program Execution. Today, general-purpose computers use a set of instructions called a program to process data. A computer executes the.

PipeliningPipelining Computer Architecture (Fall 2006)

1 CE 454 Computer Architecture Lecture 8 Ahmed Ezzat The Microarchitecture Level, Ch-4.4, 4.5,

Instruction Level Parallelism

Computer Architecture Principles Dr. Mike Frank

CS203 – Advanced Computer Architecture

Appendix C Pipeline implementation

Advantages of Dynamic Scheduling

Lecture 6: Advanced Pipelines

Out of Order Processors

Superscalar Processors & VLIW Processors

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

How to improve (decrease) CPI

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Instruction Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

How to improve (decrease) CPI

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

William Stallings Computer Organization and Architecture

Presentation transcript:

In-Order Execution In-order execution does not always give the best performance on superscalar machines. The following example uses in-order execution and in-order completion. Multiplication takes one more cycle to complete than addition/subtraction. A scoreboard keeps track of register usage. User-visible registers are R0 to R8. Multiple instructions can read a register, but only one can write a register.

In-Order Execution

The scoreboard has a small counter for each register telling how many times that register is in use by currently-executing instructions. If a maximum of, say, 15 instructions may be executing at once, then a 4-bit counter will do. The scoreboard also has a counter to keep track of registers being used as destinations. Since only one write at a time is allowed, these registers can be 1-bit wide. In a real machine, the scoreboard also keeps track of functional unit usage.

In-Order Execution We can notice three kinds of dependencies which can cause problems (instruction stalls): RAW (Read After Write) dependence WAR (Write After Read) dependence WAW (Write After Write) dependence In a WAR dependence, one instruction is trying to overwrite a register that a previous instruction may not yet have finished reading. A WAW dependence is similar.

In-Order Execution In-order completion is important as well in order to have the property of precise interrupts. Out-of-order completion leads to imprecise interrupts (we dont know what has completed at the time of an interrupt - this is not good). In order to avoid stalls, let us now permit out-of-order execution and out-of-order retirement.

Out-of-Order Execution

The previous example also introduces a new technique called register renaming. The decode unit has changed the use of R1 in I6 and I7 to a secret register, S1, not visible to the programmer. Now I6 can be issued concurrently with I5. Modern CPUs often have dozens of secret registers for use with register renaming. This can often eliminate WAR and WAW dependencies.

Speculative Execution Computer programs can be broken up into basic blocks, with each basic block consisting of a linear sequence of code with one entry point and one exit. A basic block does not contain any control structures. Therefore its machine language translation does not contain any branches. Basic blocks are connected by control statements. Programs in this form can be represented by directed graphs.

Basic Blocks

Speculative Execution Within each basic block, the reordering techniques seen work well. Unfortunately, most basic blocks are short and there is insufficient parallelism to exploit. The next step is to allow reordering to cross block boundaries. The biggest gains come when a potentially slow operation can be moved upward in the graph to get it going earlier. Moving code upward over a branch is called hoisting.

Speculative Execution Imagine that all of the variables of the previous example except evensum and oddsum are kept in registers. It might make sense to move their LOAD instructions to the top of the loop, before computing k, to get them started early on, so the values will be available when they are needed. Of course only one of them will be needed on each iteration, so the other LOAD will be wasted.

Speculative Execution Speculative execution introduces some interesting problems. It is essential that none of the speculative instructions have irrevocable results because it may turn out later that they should not have been executed. One way to do this is to rename all the destination registers to be used by speculative code. In this way, only scratch registers are modified.

Speculative Execution Another problem arises if a speculatively executed instruction causes an exception. A LOAD instruction may cause a cache miss on a machine with a large cache line and a memory far slower than the CPU and cache. One solution is to have a special SPECULATIVE-LOAD instruction that tries to fetch the word from the cache, but if it is not there, just gives up.

Speculative Execution A worse situation happens with the following statement: if (x > 0) z = y/x; Suppose that the variables are all fetched into registers in advance and that the (slow) floating-point division is hoisted above the if test. If x is 0, the resulting divide-by-zero trap terminates the program even though the programmer has put in explicit code to prevent this situation. One solution is to have special versions of instructions that might cause exceptions.

Core i7s Sandy Bridge Microarchitecture The block diagram of the Core i7s Sandy Bridge microarchitecture.

Core i7s Sandy Bridge Pipeline (1) A simplified view of the Core i7 data path.

Core i7s Sandy Bridge Pipeline (2) Scheduler queues send micro-ops into the 6 functional units: ALU 1 and the floating-point multiply unit ALU 2 and the floating-point add/subtract unit ALU 3 and branch processing and floating-point compare unit Store instructions Load instructions 1 Load instructions 2

OMAP4430s Cortex A9 Microarchitecture The block diagram of the OMAP4430s Cortex A9 microarchitecture.

OMAP4430s Cortex A9 Pipeline (1) A simplified representation of the OMAP4430s Cortex A9 pipeline.

OMAP4430s Cortex A9 Pipeline (2) A simplified representation of the OMAP4430s Cortex A9 pipeline.

Microarchitecture of the ATmega168 Microcontroller The microarchitecture of the ATmega168.