CSCE 432/832 High Performance Processor Architectures Scalar to Superscalar Adopted from Lecture notes based in part on slides created by Mikko H. Lipasti,

CSCE 432/832 High Performance Processor Architectures Scalar to Superscalar
Adopted from Lecture notes based in part on slides created by Mikko H. Lipasti, John Shen, Mark Hill, David Wood, Guri Sohi, and Jim Smith CS252 S05

CSCE 432/832, Scalar to Superscalar
Forecast Real pipelines IBM RISC Experience The case for superscalar Instruction-level parallel machines Superscalar pipeline organization Superscalar pipeline design 11/24/2018 CSCE 432/832, Scalar to Superscalar CS252 S05

Processor Performance
Time Processor Performance = Program Instructions Cycles Program Instruction Time Cycle (code size) = X (CPI) (cycle time) In the 1980’s (decade of pipelining): CPI: 5.0 => 1.15 In the 1990’s (decade of superscalar): CPI: 1.15 => 0.5 (best case) 11/24/2018 CSCE 432/832, Scalar to Superscalar CS252 S05

Amdahl’s Law N No. of Processors h 1 - h f 1 1 - f Time h = fraction of time in serial code f = fraction that is vectorizable v = speedup for f Overall speedup: 11/24/2018 CSCE 432/832, Scalar to Superscalar CS252 S05

Revisit Amdahl’s Law Sequential bottleneck Even if v is infinite Performance limited by nonvectorizable portion (1-f) N No. of Processors h 1 - h f 1 1 - f Time 11/24/2018 CSCE 432/832, Scalar to Superscalar CS252 S05

Pipelined Performance Model
g = fraction of time pipeline is filled 1-g = fraction of time pipeline is not filled (stalled) 11/24/2018 CSCE 432/832, Scalar to Superscalar CS252 S05

Pipelined Performance Model N Pipeline Depth 1 1-g g g = fraction of time pipeline is filled 1-g = fraction of time pipeline is not filled (stalled) 11/24/2018 CSCE 432/832, Scalar to Superscalar CS252 S05

Pipelined Performance Model
Depth 1 1-g g Tyranny of Amdahl’s Law [Bob Colwell] When g is even slightly below 100%, a big performance hit will result Stalled cycles are the key adversary and must be minimized as much as possible 11/24/2018 CSCE 432/832, Scalar to Superscalar CS252 S05

Motivation for Superscalar [Agerwala and Cocke]
Speedup jumps from 3 to 4.3 for N=6, f=0.8, but s =2 instead of s=1 (scalar) Typical Range 11/24/2018 CSCE 432/832, Scalar to Superscalar CS252 S05

Superscalar Proposal Moderate tyranny of Amdahl’s Law Ease sequential bottleneck More generally applicable Robust (less sensitive to f) Revised Amdahl’s Law: 11/24/2018 CSCE 432/832, Scalar to Superscalar CS252 S05

Limits on Instruction Level Parallelism (ILP)
Weiss and Smith [1984] 1.58 Sohi and Vajapeyam [1987] 1.81 Tjaden and Flynn [1970] 1.86 (Flynn’s bottleneck) Tjaden and Flynn [1973] 1.96 Uht [1986] 2.00 Smith et al. [1989] Jouppi and Wall [1988] 2.40 Johnson [1991] 2.50 Acosta et al. [1986] 2.79 Wedig [1982] 3.00 Butler et al. [1991] 5.8 Melvin and Patt [1991] 6 Wall [1991] 7 (Jouppi disagreed) Kuck et al. [1972] 8 Riseman and Foster [1972] 51 (no control dependences) Nicolau and Fisher [1984] 90 (Fisher’s optimism) Flynn’s bottleneck: IPC < 2 because of branches, control flow Fisher’s optimism: benchmarks very loop-intensive scientific workloads, led to Multiflow startup, Trace VLIW machine Variance due : benchmarks, machine models, cache latency & hit rate, compilers, religious bias, gen. Purpose vs. special purpose/scientific, C vs. Fortran Not monotonic with time Jouppi disagreed with 7 11/24/2018 CSCE 432/832, Scalar to Superscalar CS252 S05

Superscalar Proposal Go beyond single instruction pipeline, achieve IPC > 1 Dispatch multiple instructions per cycle Provide more generally applicable form of concurrency (not just vectors) Geared for sequential code that is hard to parallelize otherwise Exploit fine-grained or instruction-level parallelism (ILP) Not 100 or 1000 degree of parallelism, but 2-3-4 Fine-grained vs. medium-grained (loop iterations) vs. coarse-grained (threads) 11/24/2018 CSCE 432/832, Scalar to Superscalar CS252 S05

Classifying ILP Machines
[Jouppi, DECWRL 1991] Baseline scalar RISC Issue parallelism = IP = 1 Operation latency = OP = 1 Peak IPC = 1 IP = max instructions/cycle Op latency = # cycles till result available Issuing latency 11/24/2018 CSCE 432/832, Scalar to Superscalar CS252 S05

[Jouppi, DECWRL 1991] Superpipelined: cycle time = 1/m of baseline Issue parallelism = IP = 1 inst / minor cycle Operation latency = OP = m minor cycles Peak IPC = m instr / major cycle (m x speedup?) Issuing m times faster 11/24/2018 CSCE 432/832, Scalar to Superscalar CS252 S05

[Jouppi, DECWRL 1991] Superscalar: Issue parallelism = IP = n inst / cycle Operation latency = OP = 1 cycle Peak IPC = n instr / cycle (n x speedup?) Full replication => OK Partial replication (specialized execution units) => must manage structural hazards 11/24/2018 CSCE 432/832, Scalar to Superscalar CS252 S05

[Jouppi, DECWRL 1991] VLIW: Very Long Instruction Word Issue parallelism = IP = n inst / cycle Operation latency = OP = 1 cycle Peak IPC = n instr / cycle = 1 VLIW / cycle Early – Multiflow Trace (8 wide), later Cydrome (both folded) Later – Philips Trimedia – for media/image processing, GPUs, etc. Just folded last year Current- Intel IA-64 Characteristics: -parallelism packaged by compiler, hazards managed by compiler, no (or few) hardware interlocks, low code density (NOPs), clean/regular hardware 11/24/2018 CSCE 432/832, Scalar to Superscalar CS252 S05

[Jouppi, DECWRL 1991] Superpipelined-Superscalar Issue parallelism = IP = n inst / minor cycle Operation latency = OP = m minor cycles Peak IPC = n x m instr / major cycle Superduper 11/24/2018 CSCE 432/832, Scalar to Superscalar CS252 S05

Superscalar vs. Superpipelined
Roughly equivalent performance If n = m then both have about the same IPC Parallelism exposed in space vs. time Space vs. time: given degree of concurrency can be achieved in either dimension 11/24/2018 CSCE 432/832, Scalar to Superscalar CS252 S05

Superpipelining “Superpipelining is a new and special term meaning pipelining. The prefix is attached to increase the probability of funding for research proposals. There is no theoretical basis distinguishing superpipelining from pipelining. Etymology of the term is probably similar to the derivation of the now-common terms, methodology and functionality as pompous substitutes for method and function. The novelty of the term superpipelining lies in its reliance on a prefix rather than a suffix for the pompous extension of the root word.” - Nick Tredennick, 1991 11/24/2018 CSCE 432/832, Scalar to Superscalar CS252 S05

Superpipelining: Hype vs. Reality
11/24/2018 CSCE 432/832, Scalar to Superscalar CS252 S05

Superscalar Challenges
11/24/2018 CSCE 432/832, Scalar to Superscalar CS252 S05

CSCE 432/832 High Performance Processor Architectures Scalar to Superscalar Adopted from Lecture notes based in part on slides created by Mikko H. Lipasti,

Similar presentations

Presentation on theme: "CSCE 432/832 High Performance Processor Architectures Scalar to Superscalar Adopted from Lecture notes based in part on slides created by Mikko H. Lipasti,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSCE 432/832 High Performance Processor Architectures Scalar to Superscalar Adopted from Lecture notes based in part on slides created by Mikko H. Lipasti,

Similar presentations

Presentation on theme: "CSCE 432/832 High Performance Processor Architectures Scalar to Superscalar Adopted from Lecture notes based in part on slides created by Mikko H. Lipasti,"— Presentation transcript:

Similar presentations

About project

Feedback