1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14)

Slides:

Advertisements

Similar presentations

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

Advertisements

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative.

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

Instruction-Level Parallelism (ILP)

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Cont. Computer Architecture.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)

1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )

1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

Review of CS 203A Laxmi Narayan Bhuyan Lecture2.

1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )

1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)

1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

CSC 4250 Computer Architectures November 7, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

Architecture Basics ECE 454 Computer Systems Programming

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.

1 Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

现代计算机体系结构主讲教师：张钢天津大学计算机学院 2009 年.

PipeliningPipelining Computer Architecture (Fall 2006)

CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.

1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

Lecture: Out-of-order Processors

Dynamic Scheduling Why go out of style?

/ Computer Architecture and Design

Lynn Choi Dept. Of Computer and Electronics Engineering

Lecture: Out-of-order Processors

CC 423: Advanced Computer Architecture Limits to ILP

CS203 – Advanced Computer Architecture

Lecture: SMT, Cache Hierarchies

Lecture 6: Advanced Pipelines

Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Lecture 10: Out-of-order Processors

Lecture 11: Out-of-order Processors

Lecture: Out-of-order Processors

Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Lecture: SMT, Cache Hierarchies

Lecture: Out-of-order Processors

Lecture 8: Dynamic ILP Topics: out-of-order processors

How to improve (decrease) CPI

Lecture 20: OOO, Memory Hierarchy

Lecture: SMT, Cache Hierarchies

* From AMD 1996 Publication #18522 Revision E

Lecture: SMT, Cache Hierarchies

Lecture 10: ILP Innovations

Lecture 9: ILP Innovations

Lecture 9: Dynamic ILP Topics: out-of-order processors

Presentation transcript:

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

2 ILP Limits The perfect processor:  Infinite registers (no WAW or WAR hazards)  Perfect branch direction and target prediction  Perfect memory disambiguation  Perfect instruction and data caches  Single-cycle latencies for all ALUs  Infinite ROB size (window of in-flight instructions)  No limit on number of instructions in each pipeline stage The last instruction may be scheduled in the first cycle What is the only constraint in this processor?

3 ILP Limits The perfect processor:  Infinite registers (no WAW or WAR hazards)  Perfect branch direction and target prediction  Perfect memory disambiguation  Perfect instruction and data caches  Single-cycle latencies for all ALUs  Infinite ROB size (window of in-flight instructions)  No limit on number of instructions in each pipeline stage The last instruction may be scheduled in the first cycle The only constraint is a true dependence (register or memory RAW hazards) (with value prediction, how would the perfect processor behave?)

4 Infinite Window Size and Issue Rate

5 Effect of Window Size Window size is effected by register file/ROB size, branch mispredict rate, fetch bandwidth, etc. We will use a window size of 2K instrs and a max issue rate of 64 for subsequent experiments

6 Imperfect Branch Prediction Note: no branch mispredict penalty; branch mispredict restricts window size Assume a large tournament predictor for subsequent experiments

7 Effect of Name Dependences More registers  fewer WAR and WAW constraints (usually register file size goes hand in hand with in-flight window size) 256 int and fp registers for subsequent experiments

8 Memory Dependences

9 Limits of ILP – Summary Int programs are more limited by branches, memory disambiguation, etc., while FP programs are limited most by window size We have not yet examined the effect of branch mispredict penalty and imperfect caching All of the studied factors have relatively comparable influence on CPI: window/register size, branch prediction, memory disambiguation Can we do better? Yes: better compilers, value prediction, memory dependence prediction, multi-path execution

10 Pentium III (P6 Microarchitecture) Case Study 14-stage pipeline: 8 for fetch/decode/dispatch, 3+ for o-o-o, 3 for commit  branch mispredict penalty of cycles Out-of-order execution with a 40-entry ROB (40 temporary or virtual registers) and 20 reservation stations Each x86 instruction gets converted into RISC-like micro-ops – on average, one CISC instr  1.37 micro-ops Three instructions in each pipeline stage  3 instructions can simultaneously leave the pipeline  ideal CP  I = 0.33  ideal CPI = 0.45

11 Branch Prediction 512-entry global two-level branch predictor and 512-entry BTB  20% combined mispredict rate For every instruction committed, 0.2 instructions on the mispredicted path are also executed (wasted power!) Mispredict penalty is cycles

12 CPI Performance Owing to stalls, the processor can fall behind (no instructions are committed for 55% of all cycles), but then recover with multi-instruction commits (31% of all cycles)  average CPI = 1.15 (Int) and 2.0 (FP) Overlap of different stalls  CPI is not the sum of individual stalls IPC is also an attractive metric

13 Alternative Designs Tomasulo’s algorithm  When an instruction is decoded and “dispatched”, it is assigned to a “reservation station”  The reservation station has an ALU and space for storing operands – ready operands are copied from the register file into the reservation station  If an operand is not ready, the reservation station keeps track of which reservation station will produce it – this is a form of register renaming  Instructions are “dispatched” in order (dispatch stalls if a reservation station is not available), instructions begin execution out-of-order

14 Tomasulo’s Algorithm Instr Fetch Decode & Issue Stall if no functional unit available ALU in1in2 ALU in1in2 ALU in1in2 ALU in1in2 Register File Instrs read register operands immediately (avoids WAR), wait for results produced by other ALUs (more names), and then execute, the earlier write to R1 never happens (avoids WAW) Common Data Bus R1  … RS3  … …  R1  …  RS3 R1  … RS4  …

15 Improving Performance What is the best strategy for improving performance?  Take a single design and keep pipelining  Increase capacity: use a larger branch predictor, larger cache, larger ROB/register file  Increase bandwidth: more instructions fetched/ decoded/issued/committed per cycle

16 Deep Pipelining If instructions are independent, deep pipelining allows an instruction to leave the pipeline sooner If instructions are dependent, the gap between them (in nanoseconds) widens – there is an optimal pipeline depth Some structures are hard to pipeline – register files Pipelining can increase bypassing complexity

17 Capacity/Bandwidth Even if we design a 6-wide processor, most units go under-utilized (average IPC of 2.0!) – hence, increased bandwidth is not going to buy much Higher capacity (being able to examine 500 speculative instructions) can increase the chances of finding work and boost IPC – what is the big bottleneck? Higher capacity (and higher bandwidth) increases the complexity of each structure and its access time – for example, access times: 32KB cache in 1 cycle, 128KB cache in 2 cycles

18 Future Microprocessors By increasing branch predictor, window size, number of ALUs, pipeline stages, IPC and clock speed can improve  however, this is a case of diminishing returns! For example, with a window size of 400 and with 10 ALUs, we are likely to find fewer than four instructions to issue every cycle  under-utilization, wasted work, low throughput per watt consumed Hence, a more cost-effective solution: build four simple processors in the same area – each processor executes a different thread  high thread throughput, but probably poorer single application performance

19 Thread-Level Parallelism Motivation:  a single thread leaves a processor under-utilized for most of the time  by doubling processor area, single thread performance barely improves Strategies for thread-level parallelism:  multiple threads share the same large processor  reduces under-utilization, efficient resource allocation Simultaneous Multi-Threading (SMT)  each thread executes on its own mini processor  simple design, low interference between threads Chip Multi-Processing (CMP)

20 Title Bullet