CSC 4250 Computer Architectures November 7, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

Slides:

Advertisements

Similar presentations

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.

Advertisements

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.

CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Cont. Computer Architecture.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

Chapter 12 Pipelining Strategies Performance Hazards.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

Computer Architecture Lecture 18 Superscalar Processor and High Performance Computing.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

Chapter 2 Instruction-Level Parallelism and Its Exploitation

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Computer Architecture Lecture 6 Overview of Branch Prediction.

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.

CSC 4250 Computer Architectures October 13, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.

Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.

CSC 4250 Computer Architectures September 26, 2006 Appendix A. Pipelining.

1 Chapter 2: ILP and Its Exploitation Review simple static pipeline ILP Overview Dynamic branch prediction Dynamic scheduling, out-of-order execution Hardware-based.

CSCI 6461: Computer Architecture Branch Prediction Instructor: M. Lancaster Corresponding to Hennessey and Patterson Fifth Edition Section 3.3 and Part.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

Branch Hazards and Static Branch Prediction Techniques

CSC 4250 Computer Architectures October 31, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

Pentium III Instruction Stream. Introduction Pentium III uses several key features to exploit ILP This part of our presentation will cover the methods.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

PipeliningPipelining Computer Architecture (Fall 2006)

CS203 – Advanced Computer Architecture ILP and Speculation.

Use of Pipelining to Achieve CPI < 1

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

Instruction-Level Parallelism and Its Dynamic Exploitation

CS 352H: Computer Systems Architecture

Dynamic Scheduling Why go out of style?

/ Computer Architecture and Design

CS203 – Advanced Computer Architecture

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Adapted from the slides of Prof

CS 704 Advanced Computer Architecture

How to improve (decrease) CPI

* From AMD 1996 Publication #18522 Revision E

Adapted from the slides of Prof

Chapter 3: ILP and Its Exploitation

CSC3050 – Computer Architecture

Dynamic Hardware Prediction

Lecture 10: ILP Innovations

Lecture 9: ILP Innovations

Lecture 9: Dynamic ILP Topics: out-of-order processors

Presentation transcript:

CSC 4250 Computer Architectures November 7, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation

Branch Misprediction Machines try to recover as early as possible after a branch is mispredicted A faster recovery can be achieved by clearing the ROB for all entries that appear after the mispredicted branch, allowing those that are before the branch in the ROB to continue, and restarting the fetch at the correct branch successor

Exceptions Exceptions are handled by not recognizing an exception until it is ready to commit If a speculated instruction raises an exception, the exception is recorded in the ROB If a branch misprediction arises and the instruction should not have been executed, the exception is flushed along with the instruction when the ROB is cleared If the instruction reaches the head of the ROB, then it is no longer speculative and the exception is taken

Precise Exceptions A processor with ROB can dynamically execute code and maintain precise exceptions. Consider the example on page 229. If MULT.D causes an exception, we wait until it reaches the head of the ROB to take the exception, flushing all other pending instructions. Since instruction commit happens in order, this approach yields a precise exception. L.DF6,34(R2) L.DF2,45(R3) MUL.DF0,F2,F4 SUB.DF8,F6,F2 DIV.DF10,F0,F6 ADD.DF6,F8,F2

Data Hazards through Memory (1) WAW and WAR Hazards:  WAW and WAR hazards through memory are eliminated with speculation because the actual updating of memory occurs in order, when a store is at the head of the ROB, and hence, no earlier loads or stores can still be pending

Data Hazards through Memory (2) RAW Hazards through memory are maintained by two restrictions:  Not allowing a load to initiate the second step of its execution if any active ROB entry occupied by a store has a Destination field that matches the value of the A field of the load, and  Maintaining the program order for the computation of an effective address of a load with respect to all the earlier stores

Dual Issue With and Without Speculation Loop:LDR2,0(R1) DADDIUR2,R2,#1 SDR2,0(R1) DADDIUR1,R1,#4 BNER2,R3,Loop Assume that there are separate integer functional units for effective address calculation, for ALU operations and for branch condition evaluation Assume that up to two instructions of any type can commit per clock

Fig Dual-Issue Pipeline Without Speculation It # InstructionsIssues at Executes at Memory access at Write CDB at Comment 1LD R2,0(R1)1234First issue 1DADDIU R2,R2,#1156Wait for LD 1SD R2,0(R1)237Wait for DADDIU 1DADDIU R1,R1,#4234Execute directly 1BNE R2,R3,Loop37Wait for DADDIU 2LD R2,0(R1)48910Wait for BNE 2DADDIU R2,R2,#141112Wait for LD 2SD R2,0(R1)5913Wait for DADDIU 2DADDIU R1,R1,#4589Wait for BNE 2BNE R2,R3,Loop613Wait for DADDIU 3LD R2,0(R1) Wait for BNE 3DADDIU R2,R2,#171718Wait for LD 3SD R2,0(R1)81519Wait for DADDIU 3DADDIU R1,R1,#481415Wait for BNE 3BNE R2,R3,Loop919Wait for DADDIU

Fig Dual-Issue Pipeline With Speculation It # InstructionsIss. at Exec. at Read acc. at Write CDB at Commits at Comment 1LD R2,0(R1)12345First issue 1DADDIU R2,R2,#11567Wait for LD 1SD R2,0(R1)237Wait for DADDIU 1DADDIU R1,R1,#42348Commit in order 1BNE R2,R3,Loop378Wait for DADDIU 2LD R2,0(R1)45679No execute delay 2DADDIU R2,R2,#148910Wait for LD 2SD R2,0(R1)5610Wait for DADDIU 2DADDIU R1,R1,#456711Commit in order 2BNE R2,R3,Loop61011Wait for DADDIU 3LD R2,0(R1) Earliest possible 3DADDIU R2,R2,# Wait for LD 3SD R2,0(R1)8913Wait for DADDIU 3DADDIU R1,R1,# Executes earlier 3BNE R2,R3,Loop91314Wait for DADDIU

P6 Microarchitecture (Section 3.10) It is a dynamically scheduled processor that translates each IA- 32 instruction into a series of micro-operations (μops) that are executed by the pipeline (the μops are similar to typical RISC instructions) Up to three IA-32 instructions are fetched, decoded, and translated into μops every clock cycle The maximum number of μops that may be generated per clock cycle is six The μops are executed by an out-of-order speculative pipeline using register renaming and a ROB Up to three μops per clock can be renamed and dispatched to the reservation stations; instruction commit can also complete up to three μops per clock

Fourteen Stage P6 Pipeline Eight stages are used for in-order instruction fetch, decode, and dispatch. The next instruction is selected during fetch using a 512-entry, two-level branch predictor. The decode and issue stages include register renaming and dispatch to one of 20 reservation stations and one of 40 ROB entries. Three stages are used for out-of-order execution in one of five separate functional units (integer unit, FP unit, branch unit, memory address unit, and memory access unit). The execution pipeline is from 1 cycle (for simple integer ALU operations) to 32 cycles for FP divide. Three stages are used for instruction commit.

Figure P6 Microarchitecture Pentium Pro: Multichip module. Pentium II: MMX instruction extension Pentium III: On-chip 256KB L2 cache or off-chip 512 KB cache Xeon: Server applications; off-chip L2 and multiprocessing ProcessorFirst ship date Clock rate (MHz) L1 cache (KB)L2 cache (KB) Pentium Pro – 2008 I + 8 D256 – 1024 Pentium II – I + 16 D256 – 512 Pentium II Xeon – I + 16 D512 – 2048 Celeron – I + 16 D128 Pentium III – I + 16 D256 – 512 Pentium III Xeon – I + 16 D1024 – 2048

Figure P6 Processor Pipeline The P6 processor pipeline shows the throughput of each stage and the total buffering between stages. During renaming an instruction reserves a ROB entry. Stalls can occur when the ROB is full. The instruction fetch unit can fill the entire prefetch buffer in one cycle; if the buffer is partially full, fewer bytes will be fetched.

Blockage of Instruction in Pentium Pro 1. Fewer than three IA-32 instructions are fetched due to instruction cache misses 2. Fewer than three instructions are issued because one of the 3 IA-32 instructions generates more than the allocated number of μops (four for the first instruction and one for each of the other two) 3. Not all μops generated in a clock cycle could be issued due to a shortage of reservation stations or reorder buffers 4. A data dependence leads to a stall because every reservation station or the reorder buffer is filled with instructions that are dependent 5. A data cache miss leads to a stall because every reservation station or the reorder buffer is filled with instructions waiting for a cache miss 6. Branch mispredicts cause stalls directly, since the pipeline needs to be flushed and refilled. A mispredict may also cause a stall that arises from interference between speculated instructions that will be cancelled and instructions that will be completed

Stalls in Decode Cycle (1) Although the processor attempts to fetch three instructions every cycle, it cannot maintain this rate if the instruction cache generates a miss, if one of the instructions requires more that the number of μops available to it, or if the six- entry μop issue buffer is full. Figure 3.50 shows the fraction of time in which 0, 1, 2, or 3 IA-32 instructions are decoded. On average for these SPEC CPU95 benchmarks, 0.87 instructions are issued per cycle.

Stalls in Decode Cycle (2) Figure 3.51 breaks down the stalls at decode time according to whether they are due to instruction cache stalls, which lead to fewer than three instructions available to decode, or resource capacity limitations, which means that a lack of reservation stations or reorder buffers prevents a μop from issuing. Failure to issue a μop eventually leads to a full μop buffer (recall that it has six entries), which then blocks instruction decode.

Stalls in Decode Cycle (3) Figure 3.52 shows that most IA-32 instructions map to a single μop, and that on average there are 1.37 μops per IA-32 instructions. Thus, the μop buffer fills primarily because of delays in the execution unit.

Data Cache Behavior Figure 3.53 shows the number of first-level (L1) and second-level (L2) cache misses per thousand instructions. The L2 misses, although smaller in number, cost more than five times as much as L1 misses, and thus dominate in some applications. Instruction cache misses have a minor effect in most programs.

Branch Performance and Speculation Costs Branch-target addresses are predicted with a 512-entry BTB. If the BTB does not hit, a static prediction is used: backward branches are predicted taken (and have a one-cycle penalty if correctly predicted) and forward branches are predicted not taken (and have no penalty if correctly predicted). Branch mispredicts have both a direct performance penalty, which is between 10 and 15 cycles, and an indirect penalty due to the overhead of incorrectly speculated instructions (which is impossible to measure). Figure 3.54 shows the fraction of branches mispredicted either because of BTB misses or because of incorrect predictions.

Figure Branch Miss Frequencies BTB miss frequency dominates mispredict frequency, arguing for a larger predictor

Overall Performance of P6 Pipeline Overall performance depends on the rate at which instructions actually complete and commit. Figure 3.56 shows the fraction of the time that 0, 1, 2, or 3 μops commit. One average, 1 μop commits per cycle. But, as shown in the figure, 3 μ ops commit in a cycle 23% of the time. The distribution demonstrates the ability of a dynamically scheduled pipeline to fall behind (on 55% of the cycle, no μop commit) and later catch up (31% of the cycles have either 2 or 3 μops committing).

Fig Fraction of Commitment in a Cycle The average number of μop completions per cycle is distributed as 0 completions, 55% of the cycles; 1 completion, 13% of the cycles; 2 completions, 8% of the cycles; and 3 completions, 23% of the cycles.

Pentium 4 versus the Pentium III (P6) The microarchitecture of the Pentium 4 (called NetBurst) is similar to that of the Pentium III (called P6): Both fetch up to three IA-32 instructions per cycle, decode them into μ ops, and send the μops to an out-of-order execution engine that can graduate up to three μops per cycle. There are many differences (shown on next three slides) that are designed to allow the NetBurst to operate at a significantly higher clock rate and to help sustain a higher execution throughput.

Differences between NetBurst and P6 (1) NetBurst has a much deeper pipeline. P6 requires 10 cycles from the time a simple add instruction is fetched until the availability of its results. In comparison, NetBurst takes about 20 cycles, including 2 cycles reserved simply to drive results across the chips. NetBurst uses register renaming and not the ROB (used in P6). Use of register renaming allows many more outstanding results (up to 128) in NetBurst versus the 40 that are permitted in P6. There are 7 integer execution units in NetBurst versus 5 in P6. The additions are an additional integer ALU and an additional address computation unit.

Differences between NetBurst and P6 (2) An aggressive ALU (operating at twice the clock rate) and an aggressive data cache lead to lower latencies for the basic ALU operations (effectively ½ clock cycle in NetBurst versus 1 in P6) and for data loads (effectively 2 cycles in NetBurst versus 1 in P6). NetBurst uses a sophisticated trace cache to improve instruction fetch performance, while P6 uses a conventional prefetch buffer and instruction cache. NetBurst has a BTB that is 8 times larger and uses an improved prediction algorithm.

Differences between NetBurst and P6 (3) NetBurst has a L1 data cache that is 8KB compared to P6’s 16KB L1 data cache (why smaller?). NetBurst’s larger L2 cache (256KB) with higher bandwidth helps offset the disadvantage. NetBurst implements new FP instructions that allow two FP operations per instruction; these operations are structured as a 128-bit SIMD or short-vector structure.

Figure Performance Comparison of NetBurst and P6 We compare the NetBurst at 1.7GHz and the P6 at 1GHz on four benchmarks that are in both SPEC95 and SPEC2000. We observe that the performance of NetBurst exceeds that of P6 by a factor of between 1.2 and 2.9. The factor exceeds the purely clock speed advantage for the FP benchmarks (why?) and is less than the clock speed advantage for the integer programs.

Limitations on Power Dynamic power is proportional to the product of the number of switching transistors and the switching rate A microprocessor trying to achieve both a low CPI and a high clock rate fights both factors constituting the product Achieving an improved CPI means more instructions in flight and more transistors switching every clock cycle

Pentium III and Pentium 4 Pentium 4 has a much deeper pipeline and exploits more ILP. Peak issue rates are about the same. Operating voltage of Pentium 4 at 1.7GHz is 1.75v, while that of the Pentium III at 1GHz is 1.70v. Pentium 4 consumes ??w, while Pentium III consumes ??w. (p.279) Pentium 4 is faster. But its higher clock rate, deeper pipeline, and high sustained execution rate make it significantly less power efficient.

Figure The relative performance per watt of the Pentium 4 is 15% to 40% less than the Pentium III on these benchmarks.