1 Chapter 3: ILP and Its Dynamic Exploitation Review simple static pipeline Dynamic scheduling, out-of-order execution Dynamic branch prediction, Instruction.

Slides:

Advertisements

Similar presentations

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

Advertisements

A scheme to overcome data hazards

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 14, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)

CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.

/ Computer Architecture and Design Instructor: Dr. Michael Geiger Summer 2014 Lecture 6: Speculation.

CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

DAP Spr.‘98 ©UCB 1 Lecture 6: ILP Techniques Contd. Laxmi N. Bhuyan CS 162 Spring 2003.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Cont. Computer Architecture.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

CPSC614 Lec 5.1 Instruction Level Parallelism and Dynamic Execution #4: Based on lectures by Prof. David A. Patterson E. J. Kim.

1 Zvika Guz Slides modified from Prof. Dave Patterson, Prof. John Kubiatowicz, and Prof. Nancy Warter-Perez Out Of Order Execution.

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

Chapter 12 Pipelining Strategies Performance Hazards.

EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 9, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)

Review of CS 203A Laxmi Narayan Bhuyan Lecture2.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

CIS 629 Fall 2002 Multiple Issue/Speculation Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to utilize.

Chapter 12 CPU Structure and Function. Example Register Organizations.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Chapter 03 Authors: John Hennessy & David Patterson.

Instruction-Level Parallelism dynamic scheduling prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University May 2015Instruction-Level Parallelism.

1 Chapter 2: ILP and Its Exploitation Review simple static pipeline ILP Overview Dynamic branch prediction Dynamic scheduling, out-of-order execution Hardware-based.

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

Recap Multicycle Operations –MIPS Floating Point Putting It All Together: the MIPS R4000 Pipeline.

PART 5: (1/2) Processor Internals CHAPTER 14: INSTRUCTION-LEVEL PARALLELISM AND SUPERSCALAR PROCESSORS 1.

Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.

PipeliningPipelining Computer Architecture (Fall 2006)

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

Instruction-Level Parallelism and Its Dynamic Exploitation

CS 352H: Computer Systems Architecture

Dynamic Scheduling Why go out of style?

/ Computer Architecture and Design

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

Limits on ILP and Multithreading

Out of Order Processors

Tomasulo Loop Example Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1

CS203 – Advanced Computer Architecture

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Chapter 3: ILP and Its Exploitation

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Adapted from the slides of Prof

Advanced Computer Architecture

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Adapted from the slides of Prof

Chapter 3: ILP and Its Exploitation

CSC3050 – Computer Architecture

Dynamic Hardware Prediction

Loop-Level Parallelism

Presentation transcript:

1 Chapter 3: ILP and Its Dynamic Exploitation Review simple static pipeline Dynamic scheduling, out-of-order execution Dynamic branch prediction, Instruction Issue unit Multiple issue (superscalar) Hardware-based speculation ILP limitation Intel Core i7 and ARM Cortex-A8

2 Multiple Issue Goal: Enable multiple instructions to be issued in a single clock cycle. (Can get CPI 1) Two basic “flavors” of multiple-issue: –Superscalar: Maintain ordinary serial instruction stream format. Instructions per clock (IPC) varies widely. Instruction Issue can be dynamic or static (in-order). –VLIW (Very Long Instruction Word) a.k.a. EPIC (Explicitly Parallel Instruction Computing). New format: Parallel instructions grouped into blocks. Instructions per block are fixed (by block size). Mostly statically scheduled by compiler.

3 Superscalar Pipeline Typical superscalar: 1-8 insts. issued per cycle –Actual IPC depends on dependences, hazards Simple example: 2 insts./cycle, static scheduling –Instructions statically pre-paired to ease decoding: 1st: One load/store/branch/integer-ALU op. 2nd: One floating-point op.

4 Code Example to be Used C code fragment: double *p; do { *(p--) += c } while (p); MIPS code fragment: Loop: LD F0,0(R1) ; F0 = *p ADDD F4,F0,F2 ; F4 = F0 + c SD 0(R1),F4 ; *p = F4 ADDI R1,R1,#-8 ; p-- BNEZ R1,Loop ; until p=0

5 Multiple Issue + Dynamic Sched. Why? Usual advantages of dynamic scheduling… –Compiler independent, data-dependent scheduling Multiple-issue Tomasulo: –Issue 1 integer + 1 FP instruction to RS each cycle –Problem (again) issuing multiple inst. simultaneously If instructions dependent, hazard detection is complex. –Two solutions to this problem: Enter inst. into tables in only 1/2 a clock Build hardware to issue two instructions in parallel; must be careful to detect proper dependences –Memory dependence: loads/stores dependences through load/store queue

6 Example of Dual-Issue Tomasulo The clock cycle of Issue, Exec, and Writeback for a Duel- Issue Tomasulo pipeline (no speculation)

7 Example of Dual-Issue Tomasulo Resource usage table for the last figure

8 Example of Dual-Issue Tomasulo The clock cycle of Issue, Exec, and Writeback for a Duel- Issue Tomasulo pipeline with additional ALU and CDB

9 Example of Dual-Issue Tomasulo Resource usage table for the last figure

10 Hardware-Based Speculation Dynamic scheduling + Speculative execution : –Dynamic branch prediction chooses which instructions will be pre-executed. –Speculation executes instructions conditionally early (before branch conditions are resolved). –Dynamic scheduling handles scheduling of different dynamic sequences of basic blocks encountered. Dataflow execution: Execute instructions as soon as their operands are available. May be canceled if the prediction is incorrect!

11 Advantages of HW-based Spec. Allow more overlap of instruction executions Dynamic speculation can disambiguate memory references, so a load can be moved before a store (if the locations addressed are different). Speculation work better if more accurate dynamic branch predictions can be used. Precise exception handling needed for speculated instructions. No extra bookkeeping code (speculation bits, register renaming code) in the program. Program code independent of implementation

12 Implementing HW-based Spec. Separate the execution of speculative instructions (including dataflow between them) from the committing of results permanently to registers/memory (when speculations are correct). reorder bufferNew structure called the reorder buffer holds results of instructions that have executed speculatively (or non-speculatively) but cannot yet be committed (commit in order). –The reorder buffer represents non-programmer-visible temporary storage, like the reservation stations in Tomasulo’s algorithm.

13 Steps of Execution in HWBS Issue (or dispatch): –Get next fetched instruction (in-order). –Issue if reservation station & reorder buffer not full. Execute: –Monitor CDB for operands until ready, then execute Write result: –Write to CDB, reorder buffer, & reservation stations Commit: –When instruction is first in reorder buffer (& wasn’t mispredicted), commit value to register/memory. Committing mispredicted branch flushes reorder buffer.

14 HWBS Implementation Sketch

15 A Simple Example (Fig 3.12) Ready to commit Not commit due to MUL

16 Loop Example with Reorder Buffer Completed but not able to commit

17

18 Comparison with/without Speculation

19 Comparison with/without Speculation

20 ILP Limitations An Ideal processor: Infinite registers for renaming; Perfect branch and jump predictions; and Perfect memory disambiguation

21 Increasing the window size and Maximum Issue Count How close a real dynamically scheduled, speculative processor come to the ideal one? –Look arbitrarily far ahead predicting all branches –Rename all register uses to avoid WAR/WAW –Determine data dependencies –Determine memory dependencies –Enough parallel units

22 Limitation on Window Size

23 Effect of Branch Prediction

24 Effect on Finite Registers

25 Effect on Memory Disambiguation

ARM Cortex-A8 Pipeline 26 Figure 3.36 The basic structure of the A8 pipeline is 13 stages. Three cycles are used for instruction fetch and four for instruction decode, in addition to a five-cycle integer pipeline. This yields a 13-cycle branch misprediction penalty. The instruction fetch unit tries to keep the 12-entry instruction queue filled.

Decode Stage 27 Figure 3.37 The five-stage instruction decode of the A8. In the first stage, a PC produced by the fetch unit (either from the branch target buffer or the PC incrementer) is used to retrieve an 8-byte block from the cache. Up to two instructions are decoded and placed into the decode queue; if neither instruction is a branch, the PC is incremented for the next fetch. Once in the decode queue, the scoreboard logic decides when the instructions can issue. In the issue, the register operands are read; recall that in a simple scoreboard, the operands always come from the registers. The register operands and opcode are sent to the instruction execution portion of the pipeline.

Execution Stage 28

CPI 29 Figure 3.39 The estimated composition of the CPI on the ARM A8 shows that pipeline stalls are the primary addition to the base CPI. eon deserves some special mention, as it does integer-based graphics calculations (ray tracing) and has very few cache misses. It is computationally intensive with heavy use of multiples, and the single multiply pipeline becomes a major bottleneck. This estimate is obtained by using the L1 and L2 miss rates and penalties to compute the L1 and L2 generated stalls per instruction. These are subtracted from the CPI measured by a detailed simulator to obtain the pipeline stalls. Pipeline stalls include all three hazards plus minor effects such as way misprediction.

Intel Core i7 30 Figure 3.41 The Intel Core i7 pipeline structure shown with the memory system components. The total pipeline depth is 14 stages, with branch mispredictions costing 17 cycles. There are 48 load and 32 store buffers. The six independent functional units can each begin execution of a ready micro-op in the same cycle.

Wasted Work in Core i7 31 Figure 3.42 The amount of “wasted work” is plotted by taking the ratio of dispatched micro-ops that do not graduate to all dispatched micro-ops. For example, the ratio is 25% for sjeng, meaning that 25% of the dispatched and executed micro- ops are thrown away. The data in this section were collected by Professor Lu Peng and Ph.D. student Ying Zhang, both of Louisiana State University.

CPI of Intel Core i7 32 Figure 3.43 The CPI for the 19 SPECCPU2006 benchmarks shows an average CPI for 0.83 for both the FP and integer benchmarks, although the behavior is quite different. In the integer case, the CPI values range from 0.44 to 2.66 with a standard deviation of 0.77, while the variation in the FP case is from 0.62 to 1.38 with a standard deviation of The data in this section were collected by Professor Lu Peng and Ph.D. student Ying Zhang, both of Louisiana State University.

Relative Performance and Energy Efficiency 33 Figure 3.45 The relative performance and energy efficiency for a set of single-threaded benchmarks shows the i7 920 is 4 to over 10 times faster than the Atom 230 but that it is about 2 times less power efficient on average! Performance is shown in the columns as i7 relative to Atom, which is execution time (i7)/execution time (Atom). Energy is shown with the line as Energy (Atom)/Energy (i7). The i7 never beats the Atom in energy efficiency, although it is essentially as good on four benchmarks, three of which are floating point. The data shown here were collected by Esmaeilzadeh et al. [2011]. The SPEC benchmarks were compiled with optimization on using the standard Intel compiler, while the Java benchmarks use the Sun (Oracle) Hotspot Java VM. Only one core is active on the i7, and the rest are in deep power saving mode. Turbo Boost is used on the i7, which increases its performance advantage but slightly decreases its relative energy efficiency.