Pipelining to Superscalar ECE/CS 752 Fall 2017

Slides:

Advertisements

Similar presentations

Asanovic/Devadas Spring VLIW/EPIC: Statically Scheduled ILP Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology.

Advertisements

CS136, Advanced Architecture Limits to ILP Simultaneous Multithreading.

1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.

CIS 501: Comp. Arch. | Prof. Joe Devietti | Superscalar1 CIS 501: Computer Architecture Unit 8: Superscalar Pipelines Slides developed by Joe Devietti,

ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.

Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

EECE476 Lecture 28: Simultaneous Multithreading (aka HyperThreading) (ISCA’96 research paper “Exploiting Choice…Simultaneous Multithreading Processor”

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Pipelining to Superscalar Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Memory/Storage Architecture Lab Computer Architecture Pipelining Basics.

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

ECE 252 / CPS 220 Pipelining Professor Alvin R. Lebeck Compsci 220 / ECE 252 Fall 2008.

1 Pipelining Part I CS What is Pipelining? Like an Automobile Assembly Line for Instructions –Each step does a little job of processing the instruction.

Pipelining and Parallelism Mark Staveley

Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.

ECE 4100/6100 Advanced Computer Architecture Lecture 2 Instruction-Level Parallelism (ILP) Prof. Hsien-Hsin Sean Lee School of Electrical and Computer.

EKT303/4 Superscalar vs Super-pipelined.

11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.

LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,

ECE/CS 552: Pipeline Hazards © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim.

PipeliningPipelining Computer Architecture (Fall 2006)

Use of Pipelining to Achieve CPI < 1

CS 352H: Computer Systems Architecture

Advanced Architectures

Computer Organization CS224

15-740/ Computer Architecture Lecture 21: Superscalar Processing

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

Simultaneous Multithreading

Morgan Kaufmann Publishers

Pipeline Implementation (4.6)

CS203 – Advanced Computer Architecture

/ Computer Architecture and Design

ECE/CS 552: Pipelining to Superscalar

Flow Path Model of Superscalars

Pipelining: Advanced ILP

Instruction Level Parallelism and Superscalar Processors

Morgan Kaufmann Publishers The Processor

The processor: Pipelining and Branching

CSCE 432/832 High Performance Processor Architectures Scalar to Superscalar Adopted from Lecture notes based in part on slides created by Mikko H. Lipasti,

CSCE 432/832 High Performance Processor Architectures Instruction Flow & Branch Prediction Adopted from Lecture notes based in part on slides created by.

CSC 4250 Computer Architectures

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Instruction Flow Techniques

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

/ Computer Architecture and Design

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Overview What are pipeline hazards? Types of hazards

CS203 – Advanced Computer Architecture

The Vector-Thread Architecture

ECE/CS 552: Pipelining to Superscalar

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CSC3050 – Computer Architecture

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Design of Digital Circuits Lecture 19a: VLIW

The University of Adelaide, School of Computer Science

Lecture 5: Pipeline Wrap-up, Static ILP

Lecture 11: Machine-Dependent Optimization

Spring’19 Prof. Eric Rotenberg

Presentation transcript:

Pipelining to Superscalar ECE/CS 752 Fall 2017 Prof. Mikko H. Lipasti University of Wisconsin-Madison

Pipelining to Superscalar Forecast Limits of pipelining The case for superscalar Instruction-level parallel machines Superscalar pipeline organization Superscalar pipeline design

Generic Instruction Processing Pipeline

IBM RISC Experience [Agerwala and Cocke 1987] Internal IBM study: Limits of a scalar pipeline? Memory Bandwidth Fetch 1 instr/cycle from I-cache 40% of instructions are load/store (D-cache) Code characteristics (dynamic) Loads – 25% Stores 15% ALU/RR – 40% Branches & jumps – 20% 1/3 unconditional (always taken) 1/3 conditional taken, 1/3 conditional not taken

IBM Experience Cache Performance Load and branch scheduling Assume 100% hit ratio (upper bound) Cache latency: I = D = 1 cycle default Load and branch scheduling Loads 25% cannot be scheduled (delay slot empty) 65% can be moved back 1 or 2 instructions 10% can be moved back 1 instruction Branches & jumps Unconditional – 100% schedulable (fill one delay slot) Conditional – 50% schedulable (fill one delay slot)

CPI Optimizations Goal and impediments CPI = 1, prevented by pipeline stalls No cache bypass of RF, no load/branch scheduling Load penalty: 2 cycles: 0.25 x 2 = 0.5 CPI Branch penalty: 2 cycles: 0.2 x 2/3 x 2 = 0.27 CPI Total CPI: 1 + 0.5 + 0.27 = 1.77 CPI Bypass, no load/branch scheduling Load penalty: 1 cycle: 0.25 x 1 = 0.25 CPI Total CPI: 1 + 0.25 + 0.27 = 1.52 CPI

More CPI Optimizations Bypass, scheduling of loads/branches Load penalty: 65% + 10% = 75% moved back, no penalty 25% => 1 cycle penalty 0.25 x 0.25 x 1 = 0.0625 CPI Branch Penalty 1/3 unconditional 100% schedulable => 1 cycle 1/3 cond. not-taken, => no penalty (predict not-taken) 1/3 cond. Taken, 50% schedulable => 1 cycle 1/3 cond. Taken, 50% unschedulable => 2 cycles 0.20 x [1/3 x 1 + 1/3 x 0.5 x 1 + 1/3 x 0.5 x 2] = 0.167 Total CPI: 1 + 0.063 + 0.167 = 1.23 CPI

15% Overhead from program dependences Simplify Branches Assume 90% can be PC-relative No register indirect, no register access Separate adder (like MIPS R3000) Branch penalty reduced Total CPI: 1 + 0.063 + 0.085 = 1.15 CPI = 0.87 IPC 15% Overhead from program dependences PC-relative Schedulable Penalty Yes (90%) Yes (50%) 0 cycle No (50%) 1 cycle No (10%) 2 cycles

Limits of Pipelining IBM RISC Experience Control and data dependences add 15% Best case CPI of 1.15, IPC of 0.87 Deeper pipelines (higher frequency) magnify dependence penalties This analysis assumes 100% cache hit rates Hit rates approach 100% for some programs Many important programs have much worse hit rates Later!

Mikko Lipasti-University of Wisconsin CPU, circa 1986 Stage Phase Function performed IF φ1 Translate virtual instr. addr. using TLB φ2 Access I-cache RD Return instruction from I-cache, check tags & parity Read RF; if branch, generate target ALU Start ALU op; if branch, check condition Finish ALU op; if ld/st, translate addr MEM Access D-cache Return data from D-cache, check tags & parity WB Write RF Separate Adder Source: https://imgtec.com MIPS R2000, ~“most elegant pipeline ever devised” J. Larus Enablers: RISC ISA, pipelining, on-chip cache memory Mikko Lipasti-University of Wisconsin

Processor Performance Time Processor Performance = --------------- Program Instructions Cycles Program Instruction Time Cycle (code size) = X (CPI) (cycle time) In the 1980’s (decade of pipelining): CPI: 5.0 => 1.15 In the 1990’s (decade of superscalar): CPI: 1.15 => 0.5 (best case) In the 2000’s (decade of multicore): Focus on thread-level parallelism, CPI => 0.33 (best case)

Amdahl’s Law h = fraction of time in serial code No. of Processors h 1 - h f 1 1 - f Time h = fraction of time in serial code f = fraction that is vectorizable v = speedup for f Overall speedup:

Revisit Amdahl’s Law Sequential bottleneck Even if v is infinite Performance limited by nonvectorizable portion (1-f) N No. of Processors h 1 - h f 1 1 - f Time

Pipelined Performance Model g = fraction of time pipeline is filled 1-g = fraction of time pipeline is not filled (stalled)

Pipelined Performance Model Depth 1 1-g g g = fraction of time pipeline is filled 1-g = fraction of time pipeline is not filled (stalled)

Pipelined Performance Model Depth 1 1-g g Tyranny of Amdahl’s Law [Bob Colwell] When g is even slightly below 100%, a big performance hit will result Stalled cycles are the key adversary and must be minimized as much as possible

Motivation for Superscalar [Agerwala and Cocke] Speedup jumps from 3 to 4.3 for N=6, f=0.8, but s =2 instead of s=1 (scalar) Typical Range

Superscalar Proposal Moderate tyranny of Amdahl’s Law Ease sequential bottleneck More generally applicable Robust (less sensitive to f) Revised Amdahl’s Law:

Limits on Instruction Level Parallelism (ILP) Weiss and Smith [1984] 1.58 Sohi and Vajapeyam [1987] 1.81 Tjaden and Flynn [1970] 1.86 (Flynn’s bottleneck) Tjaden and Flynn [1973] 1.96 Uht [1986] 2.00 Smith et al. [1989] Jouppi and Wall [1988] 2.40 Johnson [1991] 2.50 Acosta et al. [1986] 2.79 Wedig [1982] 3.00 Butler et al. [1991] 5.8 Melvin and Patt [1991] 6 Wall [1991] 7 (Jouppi disagreed) Kuck et al. [1972] 8 Riseman and Foster [1972] 51 (no control dependences) Nicolau and Fisher [1984] 90 (Fisher’s optimism) Flynn’s bottleneck: IPC < 2 because of branches, control flow Fisher’s optimism: benchmarks very loop-intensive scientific workloads, led to Multiflow startup, Trace VLIW machine Variance due : benchmarks, machine models, cache latency & hit rate, compilers, religious bias, gen. Purpose vs. special purpose/scientific, C vs. Fortran Not monotonic with time Jouppi disagreed with 7

Superscalar Proposal Go beyond single instruction pipeline, achieve IPC > 1 Dispatch multiple instructions per cycle Provide more generally applicable form of concurrency (not just vectors) Geared for sequential code that is hard to parallelize otherwise Exploit fine-grained or instruction-level parallelism (ILP) Not 100 or 1000 degree of parallelism, but 2-3-4 Fine-grained vs. medium-grained (loop iterations) vs. coarse-grained (threads)

Classifying ILP Machines [Jouppi, DECWRL 1991] Baseline scalar RISC Issue parallelism = IP = 1 Operation latency = OP = 1 Peak IPC = 1 IP = max instructions/cycle Op latency = # cycles till result available Issuing latency

Classifying ILP Machines [Jouppi, DECWRL 1991] Superpipelined: cycle time = 1/m of baseline Issue parallelism = IP = 1 inst / minor cycle Operation latency = OP = m minor cycles Peak IPC = m instr / major cycle (m x speedup?) Issuing m times faster

Classifying ILP Machines [Jouppi, DECWRL 1991] Superscalar: Issue parallelism = IP = n inst / cycle Operation latency = OP = 1 cycle Peak IPC = n instr / cycle (n x speedup?) Full replication => OK Partial replication (specialized execution units) => must manage structural hazards

Classifying ILP Machines [Jouppi, DECWRL 1991] VLIW: Very Long Instruction Word Issue parallelism = IP = n inst / cycle Operation latency = OP = 1 cycle Peak IPC = n instr / cycle = 1 VLIW / cycle Early – Multiflow Trace (8 wide), later Cydrome (both folded) Later – Philips Trimedia – for media/image processing, GPUs, etc. Just folded last year Current- Intel IA-64 Characteristics: -parallelism packaged by compiler, hazards managed by compiler, no (or few) hardware interlocks, low code density (NOPs), clean/regular hardware

Classifying ILP Machines [Jouppi, DECWRL 1991] Superpipelined-Superscalar Issue parallelism = IP = n inst / minor cycle Operation latency = OP = m minor cycles Peak IPC = n x m instr / major cycle Superduper

Superscalar vs. Superpipelined Roughly equivalent performance If n = m then both have about the same IPC Parallelism exposed in space vs. time Space vs. time: given degree of concurrency can be achieved in either dimension

Superpipelining: Result Latency

Superscalar Challenges