1 Chapter 3: ILP and Its Dynamic Exploitation Review simple static pipeline Dynamic scheduling, out-of-order execution Dynamic branch prediction, Instruction.

1 Chapter 3: ILP and Its Dynamic Exploitation Review simple static pipeline Dynamic scheduling, out-of-order execution Dynamic branch prediction, Instruction Issue unit Multiple issue (superscalar) Hardware-based speculation ILP limitation Intel Core i7 and ARM Cortex-A8

2 Multiple Issue Goal: Enable multiple instructions to be issued in a single clock cycle. (Can get CPI 1) Two basic “flavors” of multiple-issue: –Superscalar: Maintain ordinary serial instruction stream format. Instructions per clock (IPC) varies widely. Instruction Issue can be dynamic or static (in-order). –VLIW (Very Long Instruction Word) a.k.a. EPIC (Explicitly Parallel Instruction Computing). New format: Parallel instructions grouped into blocks. Instructions per block are fixed (by block size). Mostly statically scheduled by compiler.

3 Superscalar Pipeline Typical superscalar: 1-8 insts. issued per cycle –Actual IPC depends on dependences, hazards Simple example: 2 insts./cycle, static scheduling –Instructions statically pre-paired to ease decoding: 1st: One load/store/branch/integer-ALU op. 2nd: One floating-point op.

4 Code Example to be Used C code fragment: double *p; do { *(p--) += c } while (p); MIPS code fragment: Loop: LD F0,0(R1) ; F0 = *p ADDD F4,F0,F2 ; F4 = F0 + c SD 0(R1),F4 ; *p = F4 ADDI R1,R1,#-8 ; p-- BNEZ R1,Loop ; until p=0

5 Multiple Issue + Dynamic Sched. Why? Usual advantages of dynamic scheduling… –Compiler independent, data-dependent scheduling Multiple-issue Tomasulo: –Issue 1 integer + 1 FP instruction to RS each cycle –Problem (again) issuing multiple inst. simultaneously If instructions dependent, hazard detection is complex. –Two solutions to this problem: Enter inst. into tables in only 1/2 a clock Build hardware to issue two instructions in parallel; must be careful to detect proper dependences –Memory dependence: loads/stores dependences through load/store queue

6 Example of Dual-Issue Tomasulo The clock cycle of Issue, Exec, and Writeback for a Duel- Issue Tomasulo pipeline (no speculation)

7 Example of Dual-Issue Tomasulo Resource usage table for the last figure

8 Example of Dual-Issue Tomasulo The clock cycle of Issue, Exec, and Writeback for a Duel- Issue Tomasulo pipeline with additional ALU and CDB

9 Example of Dual-Issue Tomasulo Resource usage table for the last figure

10 Hardware-Based Speculation Dynamic scheduling + Speculative execution : –Dynamic branch prediction chooses which instructions will be pre-executed. –Speculation executes instructions conditionally early (before branch conditions are resolved). –Dynamic scheduling handles scheduling of different dynamic sequences of basic blocks encountered. Dataflow execution: Execute instructions as soon as their operands are available. May be canceled if the prediction is incorrect!

11 Advantages of HW-based Spec. Allow more overlap of instruction executions Dynamic speculation can disambiguate memory references, so a load can be moved before a store (if the locations addressed are different). Speculation work better if more accurate dynamic branch predictions can be used. Precise exception handling needed for speculated instructions. No extra bookkeeping code (speculation bits, register renaming code) in the program. Program code independent of implementation

12 Implementing HW-based Spec. Separate the execution of speculative instructions (including dataflow between them) from the committing of results permanently to registers/memory (when speculations are correct). reorder bufferNew structure called the reorder buffer holds results of instructions that have executed speculatively (or non-speculatively) but cannot yet be committed (commit in order). –The reorder buffer represents non-programmer-visible temporary storage, like the reservation stations in Tomasulo’s algorithm.

13 Steps of Execution in HWBS Issue (or dispatch): –Get next fetched instruction (in-order). –Issue if reservation station & reorder buffer not full. Execute: –Monitor CDB for operands until ready, then execute Write result: –Write to CDB, reorder buffer, & reservation stations Commit: –When instruction is first in reorder buffer (& wasn’t mispredicted), commit value to register/memory. Committing mispredicted branch flushes reorder buffer.

14 HWBS Implementation Sketch

15 A Simple Example (Fig 3.12) Ready to commit Not commit due to MUL

16 Loop Example with Reorder Buffer Completed but not able to commit

18 Comparison with/without Speculation

19 Comparison with/without Speculation

20 ILP Limitations An Ideal processor: Infinite registers for renaming; Perfect branch and jump predictions; and Perfect memory disambiguation

21 Increasing the window size and Maximum Issue Count How close a real dynamically scheduled, speculative processor come to the ideal one? –Look arbitrarily far ahead predicting all branches –Rename all register uses to avoid WAR/WAW –Determine data dependencies –Determine memory dependencies –Enough parallel units

22 Limitation on Window Size

23 Effect of Branch Prediction

24 Effect on Finite Registers

25 Effect on Memory Disambiguation

ARM Cortex-A8 Pipeline 26 Figure 3.36 The basic structure of the A8 pipeline is 13 stages. Three cycles are used for instruction fetch and four for instruction decode, in addition to a five-cycle integer pipeline. This yields a 13-cycle branch misprediction penalty. The instruction fetch unit tries to keep the 12-entry instruction queue filled.

Decode Stage 27 Figure 3.37 The five-stage instruction decode of the A8. In the first stage, a PC produced by the fetch unit (either from the branch target buffer or the PC incrementer) is used to retrieve an 8-byte block from the cache. Up to two instructions are decoded and placed into the decode queue; if neither instruction is a branch, the PC is incremented for the next fetch. Once in the decode queue, the scoreboard logic decides when the instructions can issue. In the issue, the register operands are read; recall that in a simple scoreboard, the operands always come from the registers. The register operands and opcode are sent to the instruction execution portion of the pipeline.

Execution Stage 28

CPI 29 Figure 3.39 The estimated composition of the CPI on the ARM A8 shows that pipeline stalls are the primary addition to the base CPI. eon deserves some special mention, as it does integer-based graphics calculations (ray tracing) and has very few cache misses. It is computationally intensive with heavy use of multiples, and the single multiply pipeline becomes a major bottleneck. This estimate is obtained by using the L1 and L2 miss rates and penalties to compute the L1 and L2 generated stalls per instruction. These are subtracted from the CPI measured by a detailed simulator to obtain the pipeline stalls. Pipeline stalls include all three hazards plus minor effects such as way misprediction.

Intel Core i7 30 Figure 3.41 The Intel Core i7 pipeline structure shown with the memory system components. The total pipeline depth is 14 stages, with branch mispredictions costing 17 cycles. There are 48 load and 32 store buffers. The six independent functional units can each begin execution of a ready micro-op in the same cycle.

Wasted Work in Core i7 31 Figure 3.42 The amount of “wasted work” is plotted by taking the ratio of dispatched micro-ops that do not graduate to all dispatched micro-ops. For example, the ratio is 25% for sjeng, meaning that 25% of the dispatched and executed micro- ops are thrown away. The data in this section were collected by Professor Lu Peng and Ph.D. student Ying Zhang, both of Louisiana State University.

CPI of Intel Core i7 32 Figure 3.43 The CPI for the 19 SPECCPU2006 benchmarks shows an average CPI for 0.83 for both the FP and integer benchmarks, although the behavior is quite different. In the integer case, the CPI values range from 0.44 to 2.66 with a standard deviation of 0.77, while the variation in the FP case is from 0.62 to 1.38 with a standard deviation of 0.25. The data in this section were collected by Professor Lu Peng and Ph.D. student Ying Zhang, both of Louisiana State University.

Relative Performance and Energy Efficiency 33 Figure 3.45 The relative performance and energy efficiency for a set of single-threaded benchmarks shows the i7 920 is 4 to over 10 times faster than the Atom 230 but that it is about 2 times less power efficient on average! Performance is shown in the columns as i7 relative to Atom, which is execution time (i7)/execution time (Atom). Energy is shown with the line as Energy (Atom)/Energy (i7). The i7 never beats the Atom in energy efficiency, although it is essentially as good on four benchmarks, three of which are floating point. The data shown here were collected by Esmaeilzadeh et al. [2011]. The SPEC benchmarks were compiled with optimization on using the standard Intel compiler, while the Java benchmarks use the Sun (Oracle) Hotspot Java VM. Only one core is active on the i7, and the rest are in deep power saving mode. Turbo Boost is used on the i7, which increases its performance advantage but slightly decreases its relative energy efficiency.

1 Chapter 3: ILP and Its Dynamic Exploitation Review simple static pipeline Dynamic scheduling, out-of-order execution Dynamic branch prediction, Instruction.

Similar presentations

Presentation on theme: "1 Chapter 3: ILP and Its Dynamic Exploitation Review simple static pipeline Dynamic scheduling, out-of-order execution Dynamic branch prediction, Instruction."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Chapter 3: ILP and Its Dynamic Exploitation Review simple static pipeline Dynamic scheduling, out-of-order execution Dynamic branch prediction, Instruction.

Similar presentations

Presentation on theme: "1 Chapter 3: ILP and Its Dynamic Exploitation Review simple static pipeline Dynamic scheduling, out-of-order execution Dynamic branch prediction, Instruction."— Presentation transcript:

Similar presentations

About project

Feedback