CSCE 432/832 High Performance Processor Architectures Scalar to Superscalar Adopted from Lecture notes based in part on slides created by Mikko H. Lipasti,

Slides:

Advertisements

Similar presentations

Asanovic/Devadas Spring VLIW/EPIC: Statically Scheduled ILP Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology.

Advertisements

Instruction Level Parallelism and Superscalar Processors

Superscalar and VLIW Architectures Miodrag Bolic CEG3151.

CS136, Advanced Architecture Limits to ILP Simultaneous Multithreading.

ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.

CIS 501: Comp. Arch. | Prof. Joe Devietti | Superscalar1 CIS 501: Computer Architecture Unit 8: Superscalar Pipelines Slides developed by Joe Devietti,

HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic

Static Scheduling for ILP Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.

Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

C SINGH, JUNE 7-8, 2010IWW 2010, ISATANBUL, TURKEY Advanced Computers Architecture, UNIT 1 Advanced Computers Architecture Lecture 4 By Rohit Khokher Department.

3.13. Fallacies and Pitfalls Fallacy: Processors with lower CPIs will always be faster Fallacy: Processors with faster clock rates will always be faster.

EECE476 Lecture 28: Simultaneous Multithreading (aka HyperThreading) (ISCA’96 research paper “Exploiting Choice…Simultaneous Multithreading Processor”

Slide 1Michael Flynn EE382 Winter/99 EE382 Processor Design Winter Chapter 7 and Green Book Lectures Concurrent Processors, including SIMD and.

Pipelining to Superscalar Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

10-1 Chapter 10 - Trends in Computer Architecture Principles of Computer Architecture by M. Murdocca and V. Heuring © 1999 M. Murdocca and V. Heuring Principles.

Chapter One Introduction to Pipelined Processors.

18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.

Pipelining and Parallelism Mark Staveley

Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.

ECE 4100/6100 Advanced Computer Architecture Lecture 2 Instruction-Level Parallelism (ILP) Prof. Hsien-Hsin Sean Lee School of Electrical and Computer.

CSIS Parallel Architectures and Algorithms Dr. Hoganson Speedup Summary Balance Point The basis for the argument against “putting all your (speedup)

EKT303/4 Superscalar vs Super-pipelined.

1 Concurrent and Distributed Programming Lecture 2 Parallel architectures Performance of parallel programs References: Based on: Mark Silberstein, ,

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

高等計算機結構 Advanced Computer Architecture Text Book ： Modern Processor Design: Fundamentals of Superscalar Processors, by John P. Shen, McGraw-Hill ( 滄海圖書代理.

ECE/CS 552: Performance and Cost © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and.

CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.

ECE/CS 552: Pipeline Hazards © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim.

Computer Architecture Lecture 27: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015.

Processor Level Parallelism 1

Computer Organization CS345 David Monismith Based upon notes by Dr. Bill Siever and from the Patterson and Hennessy Text.

18-447: Computer Architecture Lecture 30B: Multiprocessors

Computer Architecture: Parallel Processing Basics

How do we evaluate computer architectures?

Computer Architecture Principles Dr. Mike Frank

ECE/CS 552: Pipelining © Prof. Mikko Lipasti

15-740/ Computer Architecture Lecture 21: Superscalar Processing

高等計算機結構 Advanced Computer Architecture

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

ECE/CS 552: Midterm Review

ECE/CS 552: Pipelining to Superscalar

Chapter 1 Fundamentals of Computer Design

Flow Path Model of Superscalars

ECE/CS 552: Microprogramming and Exceptions

Levels of Parallelism within a Single Processor

ECE/CS 757: Advanced Computer Architecture II

CSCE 432/832 High Performance Processor Architectures Instruction Flow & Branch Prediction Adopted from Lecture notes based in part on slides created by.

Pipelining to Superscalar ECE/CS 752 Fall 2017

Performance of computer systems

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Instruction Flow Techniques

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.

The Vector-Thread Architecture

ECE/CS 552: Pipelining to Superscalar

Superscalar and VLIW Architectures

Levels of Parallelism within a Single Processor

Performance of computer systems

Design of Digital Circuits Lecture 19a: VLIW

Spring’19 Prof. Eric Rotenberg

Presentation transcript:

CSCE 432/832 High Performance Processor Architectures Scalar to Superscalar Adopted from Lecture notes based in part on slides created by Mikko H. Lipasti, John Shen, Mark Hill, David Wood, Guri Sohi, and Jim Smith CS252 S05

CSCE 432/832, Scalar to Superscalar Forecast Real pipelines IBM RISC Experience The case for superscalar Instruction-level parallel machines Superscalar pipeline organization Superscalar pipeline design 11/24/2018 CSCE 432/832, Scalar to Superscalar CS252 S05

Processor Performance Time Processor Performance = --------------- Program Instructions Cycles Program Instruction Time Cycle (code size) = X (CPI) (cycle time) In the 1980’s (decade of pipelining): CPI: 5.0 => 1.15 In the 1990’s (decade of superscalar): CPI: 1.15 => 0.5 (best case) 11/24/2018 CSCE 432/832, Scalar to Superscalar CS252 S05

CSCE 432/832, Scalar to Superscalar Amdahl’s Law N No. of Processors h 1 - h f 1 1 - f Time h = fraction of time in serial code f = fraction that is vectorizable v = speedup for f Overall speedup: 11/24/2018 CSCE 432/832, Scalar to Superscalar CS252 S05

CSCE 432/832, Scalar to Superscalar Revisit Amdahl’s Law Sequential bottleneck Even if v is infinite Performance limited by nonvectorizable portion (1-f) N No. of Processors h 1 - h f 1 1 - f Time 11/24/2018 CSCE 432/832, Scalar to Superscalar CS252 S05

Pipelined Performance Model g = fraction of time pipeline is filled 1-g = fraction of time pipeline is not filled (stalled) 11/24/2018 CSCE 432/832, Scalar to Superscalar CS252 S05

CSCE 432/832, Scalar to Superscalar Pipelined Performance Model N Pipeline Depth 1 1-g g g = fraction of time pipeline is filled 1-g = fraction of time pipeline is not filled (stalled) 11/24/2018 CSCE 432/832, Scalar to Superscalar CS252 S05

Pipelined Performance Model Depth 1 1-g g Tyranny of Amdahl’s Law [Bob Colwell] When g is even slightly below 100%, a big performance hit will result Stalled cycles are the key adversary and must be minimized as much as possible 11/24/2018 CSCE 432/832, Scalar to Superscalar CS252 S05

Motivation for Superscalar [Agerwala and Cocke] Speedup jumps from 3 to 4.3 for N=6, f=0.8, but s =2 instead of s=1 (scalar) Typical Range 11/24/2018 CSCE 432/832, Scalar to Superscalar CS252 S05

CSCE 432/832, Scalar to Superscalar Superscalar Proposal Moderate tyranny of Amdahl’s Law Ease sequential bottleneck More generally applicable Robust (less sensitive to f) Revised Amdahl’s Law: 11/24/2018 CSCE 432/832, Scalar to Superscalar CS252 S05

Limits on Instruction Level Parallelism (ILP) Weiss and Smith [1984] 1.58 Sohi and Vajapeyam [1987] 1.81 Tjaden and Flynn [1970] 1.86 (Flynn’s bottleneck) Tjaden and Flynn [1973] 1.96 Uht [1986] 2.00 Smith et al. [1989] Jouppi and Wall [1988] 2.40 Johnson [1991] 2.50 Acosta et al. [1986] 2.79 Wedig [1982] 3.00 Butler et al. [1991] 5.8 Melvin and Patt [1991] 6 Wall [1991] 7 (Jouppi disagreed) Kuck et al. [1972] 8 Riseman and Foster [1972] 51 (no control dependences) Nicolau and Fisher [1984] 90 (Fisher’s optimism) Flynn’s bottleneck: IPC < 2 because of branches, control flow Fisher’s optimism: benchmarks very loop-intensive scientific workloads, led to Multiflow startup, Trace VLIW machine Variance due : benchmarks, machine models, cache latency & hit rate, compilers, religious bias, gen. Purpose vs. special purpose/scientific, C vs. Fortran Not monotonic with time Jouppi disagreed with 7 11/24/2018 CSCE 432/832, Scalar to Superscalar CS252 S05

CSCE 432/832, Scalar to Superscalar Superscalar Proposal Go beyond single instruction pipeline, achieve IPC > 1 Dispatch multiple instructions per cycle Provide more generally applicable form of concurrency (not just vectors) Geared for sequential code that is hard to parallelize otherwise Exploit fine-grained or instruction-level parallelism (ILP) Not 100 or 1000 degree of parallelism, but 2-3-4 Fine-grained vs. medium-grained (loop iterations) vs. coarse-grained (threads) 11/24/2018 CSCE 432/832, Scalar to Superscalar CS252 S05

Classifying ILP Machines [Jouppi, DECWRL 1991] Baseline scalar RISC Issue parallelism = IP = 1 Operation latency = OP = 1 Peak IPC = 1 IP = max instructions/cycle Op latency = # cycles till result available Issuing latency 11/24/2018 CSCE 432/832, Scalar to Superscalar CS252 S05

Classifying ILP Machines [Jouppi, DECWRL 1991] Superpipelined: cycle time = 1/m of baseline Issue parallelism = IP = 1 inst / minor cycle Operation latency = OP = m minor cycles Peak IPC = m instr / major cycle (m x speedup?) Issuing m times faster 11/24/2018 CSCE 432/832, Scalar to Superscalar CS252 S05

Classifying ILP Machines [Jouppi, DECWRL 1991] Superscalar: Issue parallelism = IP = n inst / cycle Operation latency = OP = 1 cycle Peak IPC = n instr / cycle (n x speedup?) Full replication => OK Partial replication (specialized execution units) => must manage structural hazards 11/24/2018 CSCE 432/832, Scalar to Superscalar CS252 S05

Classifying ILP Machines [Jouppi, DECWRL 1991] VLIW: Very Long Instruction Word Issue parallelism = IP = n inst / cycle Operation latency = OP = 1 cycle Peak IPC = n instr / cycle = 1 VLIW / cycle Early – Multiflow Trace (8 wide), later Cydrome (both folded) Later – Philips Trimedia – for media/image processing, GPUs, etc. Just folded last year Current- Intel IA-64 Characteristics: -parallelism packaged by compiler, hazards managed by compiler, no (or few) hardware interlocks, low code density (NOPs), clean/regular hardware 11/24/2018 CSCE 432/832, Scalar to Superscalar CS252 S05

Classifying ILP Machines [Jouppi, DECWRL 1991] Superpipelined-Superscalar Issue parallelism = IP = n inst / minor cycle Operation latency = OP = m minor cycles Peak IPC = n x m instr / major cycle Superduper 11/24/2018 CSCE 432/832, Scalar to Superscalar CS252 S05

Superscalar vs. Superpipelined Roughly equivalent performance If n = m then both have about the same IPC Parallelism exposed in space vs. time Space vs. time: given degree of concurrency can be achieved in either dimension 11/24/2018 CSCE 432/832, Scalar to Superscalar CS252 S05

CSCE 432/832, Scalar to Superscalar Superpipelining “Superpipelining is a new and special term meaning pipelining. The prefix is attached to increase the probability of funding for research proposals. There is no theoretical basis distinguishing superpipelining from pipelining. Etymology of the term is probably similar to the derivation of the now-common terms, methodology and functionality as pompous substitutes for method and function. The novelty of the term superpipelining lies in its reliance on a prefix rather than a suffix for the pompous extension of the root word.” - Nick Tredennick, 1991 11/24/2018 CSCE 432/832, Scalar to Superscalar CS252 S05

Superpipelining: Hype vs. Reality 11/24/2018 CSCE 432/832, Scalar to Superscalar CS252 S05

Superscalar Challenges 11/24/2018 CSCE 432/832, Scalar to Superscalar CS252 S05