ECE/CS 552: Pipelining to Superscalar

Slides:

Advertisements

Similar presentations

Lecture 4: CPU Performance

Advertisements

CIS 501: Comp. Arch. | Prof. Joe Devietti | Superscalar1 CIS 501: Computer Architecture Unit 8: Superscalar Pipelines Slides developed by Joe Devietti,

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.

Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Hazard Analysis Procedure  Memory Data Dependences Output Dependence (WAW) Anti Dependence (WAR) True Data Dependence (RAW)  Register Data Dependences.

Embedded Computer Architectures Hennessy & Patterson Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation Gerard Smit (Zilverling 4102),

EECE476 Lecture 28: Simultaneous Multithreading (aka HyperThreading) (ISCA’96 research paper “Exploiting Choice…Simultaneous Multithreading Processor”

ECE/CS 757: Advanced Computer Architecture II Instructor:Mikko H Lipasti Spring 2013 University of Wisconsin-Madison Lecture notes based on slides created.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Review of CS 203A Laxmi Narayan Bhuyan Lecture2.

Chapter 12 CPU Structure and Function. Example Register Organizations.

Computer ArchitectureFall 2007 © October 31, CS-447– Computer Architecture M,W 10-11:20am Lecture 17 Review.

Appendix A Pipelining: Basic and Intermediate Concepts

Pipelining to Superscalar Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Memory/Storage Architecture Lab Computer Architecture Pipelining Basics.

Comp Sci pipelining 1 Ch. 13 Pipelining. Comp Sci pipelining 2 Pipelining.

ECE 252 / CPS 220 Pipelining Professor Alvin R. Lebeck Compsci 220 / ECE 252 Fall 2008.

Pipelining and Parallelism Mark Staveley

ECE 4100/6100 Advanced Computer Architecture Lecture 2 Instruction-Level Parallelism (ILP) Prof. Hsien-Hsin Sean Lee School of Electrical and Computer.

EKT303/4 Superscalar vs Super-pipelined.

Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.

LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,

ECE/CS 552: Pipeline Hazards © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim.

Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.

ECE/CS 757: Advanced Computer Architecture II Instructor:Mikko H Lipasti Spring 2009 University of Wisconsin-Madison Lecture notes based on slides created.

PipeliningPipelining Computer Architecture (Fall 2006)

ECE/CS 552: Introduction to Superscalar Processors and the MIPS R10000 © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill,

Use of Pipelining to Achieve CPI < 1

ECE/CS 552: Single Cycle Control Path

CS 352H: Computer Systems Architecture

Advanced Architectures

Computer Organization CS224

ECE/CS 552: Pipelining © Prof. Mikko Lipasti

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

Simultaneous Multithreading

Out of Order Processors

Pipeline Implementation (4.6)

ECE/CS 552: Midterm Review

ECE/CS 552: Pipelining to Superscalar

Flow Path Model of Superscalars

ECE/CS 552: Single Cycle Datapath

Pipelining: Advanced ILP

Instruction Level Parallelism and Superscalar Processors

The processor: Pipelining and Branching

ECE/CS 552: Cache Design © Prof. Mikko Lipasti

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

CSCE 432/832 High Performance Processor Architectures Scalar to Superscalar Adopted from Lecture notes based in part on slides created by Mikko H. Lipasti,

Lecture 11: Memory Data Flow Techniques

CSCE 432/832 High Performance Processor Architectures Instruction Flow & Branch Prediction Adopted from Lecture notes based in part on slides created by.

CSC 4250 Computer Architectures

Dr. Javier Navaridas Pipelining Dr. Javier Navaridas COMP25212 System Architecture.

Pipelining to Superscalar ECE/CS 752 Fall 2017

Control unit extension for data hazards

Instruction Flow Techniques

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

CS203 – Advanced Computer Architecture

Control unit extension for data hazards

CSC3050 – Computer Architecture

Penalty Reduction via Forwarding Paths

Dynamic Hardware Prediction

Pipelining Appendix A and Chapter 3.

Control unit extension for data hazards

The University of Adelaide, School of Computer Science

Lecture 5: Pipeline Wrap-up, Static ILP

Lecture 11: Machine-Dependent Optimization

Conceptual execution on a processor which exploits ILP

Computer Structure Pipeline

Procedure Return Predictors

Presentation transcript:

ECE/CS 552: Pipelining to Superscalar © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith

Pipelining to Superscalar Forecast Real pipelines IBM RISC Experience The case for superscalar Instruction-level parallel machines Superscalar pipeline organization Superscalar pipeline design

MIPS R2000/R3000 Pipeline Stage Phase Function performed IF φ1 φ2 RD Separate Adder Stage Phase Function performed IF φ1 Translate virtual instr. addr. using TLB φ2 Access I-cache RD Return instruction from I-cache, check tags & parity Read RF; if branch, generate target ALU Start ALU op; if branch, check condition Finish ALU op; if ld/st, translate addr MEM Access D-cache Return data from D-cache, check tags & parity WB Write RF

Intel i486 5-stage Pipeline Prefetch Queue Holds 2 x 16B ??? instructions Stage Function Performed IF Fetch instruction from 32B prefetch buffer (separate fetch unit fills and flushes prefetch buffer) ID-1 Translate instr. Into control signals or microcode address Initiate address generation and memory access ID-2 Access microcode memory Send microinstruction(s) to execute unit EX Execute ALU and memory operations WB Write back to RF

IBM RISC Experience [Agerwala and Cocke 1987] Internal IBM study: Limits of a scalar pipeline? Memory Bandwidth Fetch 1 instr/cycle from I-cache 40% of instructions are load/store (D-cache) Code characteristics (dynamic) Loads – 25% Stores 15% ALU/RR – 40% Branches & jumps – 20% 1/3 unconditional (always taken) 1/3 conditional taken, 1/3 conditional not taken

IBM Experience Cache Performance Load and branch scheduling Assume 100% hit ratio (upper bound) Cache latency: I = D = 1 cycle default Load and branch scheduling Loads 25% cannot be scheduled (delay slot empty) 65% can be moved back 1 or 2 instructions 10% can be moved back 1 instruction Branches & jumps Unconditional – 100% schedulable (fill one delay slot) Conditional – 50% schedulable (fill one delay slot)

CPI Optimizations Goal and impediments CPI = 1, prevented by pipeline stalls No cache bypass of RF, no load/branch scheduling Load penalty: 2 cycles: 0.25 x 2 = 0.5 CPI Branch penalty: 2 cycles: 0.2 x 2/3 x 2 = 0.27 CPI Total CPI: 1 + 0.5 + 0.27 = 1.77 CPI Bypass, no load/branch scheduling Load penalty: 1 cycle: 0.25 x 1 = 0.25 CPI Total CPI: 1 + 0.25 + 0.27 = 1.52 CPI

More CPI Optimizations Bypass, scheduling of loads/branches Load penalty: 65% + 10% = 75% moved back, no penalty 25% => 1 cycle penalty 0.25 x 0.25 x 1 = 0.0625 CPI Branch Penalty 1/3 unconditional 100% schedulable => 1 cycle 1/3 cond. not-taken, => no penalty (predict not-taken) 1/3 cond. Taken, 50% schedulable => 1 cycle 1/3 cond. Taken, 50% unschedulable => 2 cycles 0.20 x [1/3 x 1 + 1/3 x 0.5 x 1 + 1/3 x 0.5 x 2] = 0.167 Total CPI: 1 + 0.063 + 0.167 = 1.23 CPI

15% Overhead from program dependences Simplify Branches Assume 90% can be PC-relative No register indirect, no register access Separate adder (like MIPS R3000) Branch penalty reduced Total CPI: 1 + 0.063 + 0.085 = 1.15 CPI = 0.87 IPC 15% Overhead from program dependences PC-relative Schedulable Penalty Yes (90%) Yes (50%) 0 cycle No (50%) 1 cycle No (10%) 2 cycles

Processor Performance Time Processor Performance = --------------- Program Instructions Cycles Program Instruction Time Cycle (code size) = X (CPI) (cycle time) In the 1980’s (decade of pipelining): CPI: 5.0 => 1.15 In the 1990’s (decade of superscalar): CPI: 1.15 => 0.5 (best case)

Revisit Amdahl’s Law h = fraction of time in serial code No. of Processors h 1 - h f 1 1 - f Time h = fraction of time in serial code f = fraction that is vectorizable v = speedup for f Overall speedup:

Revisit Amdahl’s Law Sequential bottleneck Even if v is infinite Performance limited by nonvectorizable portion (1-f) N No. of Processors h 1 - h f 1 1 - f Time

Pipelined Performance Model g = fraction of time pipeline is filled 1-g = fraction of time pipeline is not filled (stalled)

Pipelined Performance Model Depth 1 1-g g g = fraction of time pipeline is filled 1-g = fraction of time pipeline is not filled (stalled)

Pipelined Performance Model Depth 1 1-g g Tyranny of Amdahl’s Law [Bob Colwell] When g is even slightly below 100%, a big performance hit will result Stalled cycles are the key adversary and must be minimized as much as possible

Motivation for Superscalar [Agerwala and Cocke] Speedup jumps from 3 to 4.3 for N=6, f=0.8, but s =2 instead of s=1 (scalar) Typical Range

Superscalar Proposal Moderate tyranny of Amdahl’s Law Ease sequential bottleneck More generally applicable Robust (less sensitive to f) Revised Amdahl’s Law:

Limits on Instruction Level Parallelism (ILP) Weiss and Smith [1984] 1.58 Sohi and Vajapeyam [1987] 1.81 Tjaden and Flynn [1970] 1.86 (Flynn’s bottleneck) Tjaden and Flynn [1973] 1.96 Uht [1986] 2.00 Smith et al. [1989] Jouppi and Wall [1988] 2.40 Johnson [1991] 2.50 Acosta et al. [1986] 2.79 Wedig [1982] 3.00 Butler et al. [1991] 5.8 Melvin and Patt [1991] 6 Wall [1991] 7 (Jouppi disagreed) Kuck et al. [1972] 8 Riseman and Foster [1972] 51 (no control dependences) Nicolau and Fisher [1984] 90 (Fisher’s optimism)

Superscalar Proposal Go beyond single instruction pipeline, achieve IPC > 1 Dispatch multiple instructions per cycle Provide more generally applicable form of concurrency (not just vectors) Geared for sequential code that is hard to parallelize otherwise Exploit fine-grained or instruction-level parallelism (ILP)

Classifying ILP Machines [Jouppi, DECWRL 1991] Baseline scalar RISC Issue parallelism = IP = 1 Operation latency = OP = 1 Peak IPC = 1

Classifying ILP Machines [Jouppi, DECWRL 1991] Superpipelined: cycle time = 1/m of baseline Issue parallelism = IP = 1 inst / minor cycle Operation latency = OP = m minor cycles Peak IPC = m instr / major cycle (m x speedup?)

Classifying ILP Machines [Jouppi, DECWRL 1991] Superscalar: Issue parallelism = IP = n inst / cycle Operation latency = OP = 1 cycle Peak IPC = n instr / cycle (n x speedup?)

Classifying ILP Machines [Jouppi, DECWRL 1991] VLIW: Very Long Instruction Word Issue parallelism = IP = n inst / cycle Operation latency = OP = 1 cycle Peak IPC = n instr / cycle = 1 VLIW / cycle

Classifying ILP Machines [Jouppi, DECWRL 1991] Superpipelined-Superscalar Issue parallelism = IP = n inst / minor cycle Operation latency = OP = m minor cycles Peak IPC = n x m instr / major cycle

Superscalar vs. Superpipelined Roughly equivalent performance If n = m then both have about the same IPC Parallelism exposed in space vs. time

Superscalar Challenges