Pipelining to Superscalar Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Slides:

Advertisements

Similar presentations

Instruction Level Parallelism and Superscalar Processors

Advertisements

CIS 501: Comp. Arch. | Prof. Joe Devietti | Superscalar1 CIS 501: Computer Architecture Unit 8: Superscalar Pipelines Slides developed by Joe Devietti,

HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.

CA 714CA Midterm Review. C5 Cache Optimization Reduce miss penalty –Hardware and software Reduce miss rate –Hardware and software Reduce hit time –Hardware.

Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.

Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

C SINGH, JUNE 7-8, 2010IWW 2010, ISATANBUL, TURKEY Advanced Computers Architecture, UNIT 1 Advanced Computers Architecture Lecture 4 By Rohit Khokher Department.

EECE476 Lecture 28: Simultaneous Multithreading (aka HyperThreading) (ISCA’96 research paper “Exploiting Choice…Simultaneous Multithreading Processor”

1 Burroughs B5500 multiprocessor. These machines were designed to support HLLs, such as Algol. They used a stack architecture, but part of the stack was.

No Pipeline IFIDEXMEMWB IFIDEXMEMWB IFIDEXMEMWB instr time IFIDEXMEMWB Assuming 1 cycle for IF/ID/EX/MEM/WB, Total # cycles for n instructions:

ECE/CS 757: Advanced Computer Architecture II Instructor:Mikko H Lipasti Spring 2013 University of Wisconsin-Madison Lecture notes based on slides created.

ECE/CS 552: Cache Performance Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes based on notes by Mark Hill Updated by.

Slide 1Michael Flynn EE382 Winter/99 EE382 Processor Design Winter Chapter 7 and Green Book Lectures Concurrent Processors, including SIMD and.

IBM’s Experience on Pipelined Processors [Agerwala and Cocke 1987] Attributes and Assumptions:  Memory Bandwidth at least one word/cycle to fetch 1 instruction/cycle.

Computer ArchitectureFall 2007 © October 29th, 2007 Majd F. Sakr CS-447– Computer Architecture.

Trace Processors Presented by Nitin Kumar Eric Rotenberg Quinn Jacobson, Yanos Sazeides, Jim Smith Computer Science Department University of Wisconsin-Madison.

EECS 470 Control Hazards and ILP Lecture 3 – Fall 2013 Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen,

Chapter One Introduction to Pipelined Processors.

18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

Page 1 Trace Caches Michele Co CS 451. Page 2 Motivation  High performance superscalar processors  High instruction throughput  Exploit ILP –Wider.

ECE 252 / CPS 220 Pipelining Professor Alvin R. Lebeck Compsci 220 / ECE 252 Fall 2008.

EECS 470 Control Hazards and ILP Lecture 3 – Winter 2015 Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth,

Classic Model of Parallel Processing

Lecture 8Fall 2006 Chapter 6: Superscalar Adapted from Mary Jane Irwin at Penn State University for Computer Organization and Design, Patterson & Hennessy,

Pipelining and Parallelism Mark Staveley

ECE 4100/6100 Advanced Computer Architecture Lecture 2 Instruction-Level Parallelism (ILP) Prof. Hsien-Hsin Sean Lee School of Electrical and Computer.

CSIS Parallel Architectures and Algorithms Dr. Hoganson Speedup Summary Balance Point The basis for the argument against “putting all your (speedup)

EKT303/4 Superscalar vs Super-pipelined.

1 Concurrent and Distributed Programming Lecture 2 Parallel architectures Performance of parallel programs References: Based on: Mark Silberstein, ,

Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,

高等計算機結構 Advanced Computer Architecture Text Book ： Modern Processor Design: Fundamentals of Superscalar Processors, by John P. Shen, McGraw-Hill ( 滄海圖書代理.

ECE/CS 552: Performance and Cost © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and.

CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.

Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.

Computer Architecture Lecture 27: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015.

ECE/CS 757: Advanced Computer Architecture II Instructor:Mikko H Lipasti Spring 2009 University of Wisconsin-Madison Lecture notes based on slides created.

PipeliningPipelining Computer Architecture (Fall 2006)

Computer Organization CS345 David Monismith Based upon notes by Dr. Bill Siever and from the Patterson and Hennessy Text.

Performance. Moore's Law Moore's Law Related Curves.

18-447: Computer Architecture Lecture 30B: Multiprocessors

15-740/ Computer Architecture Lecture 4: Pipelining

Computer Architecture: Parallel Processing Basics

How do we evaluate computer architectures?

ECE/CS 552: Pipelining © Prof. Mikko Lipasti

15-740/ Computer Architecture Lecture 21: Superscalar Processing

高等計算機結構 Advanced Computer Architecture

Parallel Processing - introduction

15-740/ Computer Architecture Lecture 7: Pipelining

Concurrent and Distributed Programming

ECE/CS 552: Midterm Review

ECE/CS 552: Pipelining to Superscalar

Chapter 1 Fundamentals of Computer Design

Flow Path Model of Superscalars

Instruction Scheduling for Instruction-Level Parallelism

CSCE 432/832 High Performance Processor Architectures Scalar to Superscalar Adopted from Lecture notes based in part on slides created by Mikko H. Lipasti,

ECE/CS 757: Advanced Computer Architecture II

CSCE 432/832 High Performance Processor Architectures Instruction Flow & Branch Prediction Adopted from Lecture notes based in part on slides created by.

Pipelining to Superscalar ECE/CS 752 Fall 2017

Performance of computer systems

Instruction Flow Techniques

ECE/CS 552: Pipelining to Superscalar

Performance of computer systems

Design of Digital Circuits Lecture 19a: VLIW

Procedure Return Predictors

Presentation transcript:

Pipelining to Superscalar Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti

Pipelining to Superscalar Forecast –Limits of pipelining –The case for superscalar –Instruction-level parallel machines –Superscalar pipeline organization –Superscalar pipeline design

Limits of Pipelining IBM RISC Experience –Control and data dependences add 15% –Best case CPI of 1.15, IPC of 0.87 –Deeper pipelines (higher frequency) magnify dependence penalties This analysis assumes 100% cache hit rates –Hit rates approach 100% for some programs –Many important programs have much worse hit rates –Later!

Processor Performance In the 1980’s (decade of pipelining): –CPI: 5.0 => 1.15 In the 1990’s (decade of superscalar): –CPI: 1.15 => 0.5 (best case) In the 2000’s (decade of multicore): –Marginal CPI improvement Processor Performance = Time Program Instructions Cycles Program Instruction Time Cycle (code size) = XX (CPI) (cycle time)

Amdahl’s Law h = fraction of time in serial code f = fraction that is vectorizable v = speedup for f Overall speedup: No. of Processors N Time 1 h1 - h 1 - f f

Revisit Amdahl’s Law Sequential bottleneck Even if v is infinite –Performance limited by nonvectorizable portion (1-f) No. of Processors N Time 1 h1 - h 1 - f f

Pipelined Performance Model g = fraction of time pipeline is filled 1-g = fraction of time pipeline is not filled (stalled)

g = fraction of time pipeline is filled 1-g = fraction of time pipeline is not filled (stalled) 1-g g Pipeline Depth N 1 Pipelined Performance Model

Tyranny of Amdahl’s Law [Bob Colwell] –When g is even slightly below 100%, a big performance hit will result –Stalled cycles are the key adversary and must be minimized as much as possible 1-g g Pipeline Depth N 1

Motivation for Superscalar [Agerwala and Cocke] Typical Range Speedup jumps from 3 to 4.3 for N=6, f=0.8, but s =2 instead of s=1 (scalar)

Superscalar Proposal Moderate tyranny of Amdahl’s Law –Ease sequential bottleneck –More generally applicable –Robust (less sensitive to f) –Revised Amdahl’s Law:

Limits on Instruction Level Parallelism (ILP) Weiss and Smith [1984]1.58 Sohi and Vajapeyam [1987]1.81 Tjaden and Flynn [1970]1.86 (Flynn’s bottleneck) Tjaden and Flynn [1973]1.96 Uht [1986]2.00 Smith et al. [1989]2.00 Jouppi and Wall [1988]2.40 Johnson [1991]2.50 Acosta et al. [1986]2.79 Wedig [1982]3.00 Butler et al. [1991]5.8 Melvin and Patt [1991]6 Wall [1991]7 (Jouppi disagreed) Kuck et al. [1972]8 Riseman and Foster [1972]51 (no control dependences) Nicolau and Fisher [1984]90 (Fisher’s optimism)

Superscalar Proposal Go beyond single instruction pipeline, achieve IPC > 1 Dispatch multiple instructions per cycle Provide more generally applicable form of concurrency (not just vectors) Geared for sequential code that is hard to parallelize otherwise Exploit fine-grained or instruction-level parallelism (ILP)

Classifying ILP Machines [Jouppi, DECWRL 1991] Baseline scalar RISC –Issue parallelism = IP = 1 –Operation latency = OP = 1 –Peak IPC = 1

Classifying ILP Machines [Jouppi, DECWRL 1991] Superpipelined: cycle time = 1/m of baseline –Issue parallelism = IP = 1 inst / minor cycle –Operation latency = OP = m minor cycles –Peak IPC = m instr / major cycle (m x speedup?)

Classifying ILP Machines [Jouppi, DECWRL 1991] Superscalar: –Issue parallelism = IP = n inst / cycle –Operation latency = OP = 1 cycle –Peak IPC = n instr / cycle (n x speedup?)

Classifying ILP Machines [Jouppi, DECWRL 1991] VLIW: Very Long Instruction Word –Issue parallelism = IP = n inst / cycle –Operation latency = OP = 1 cycle –Peak IPC = n instr / cycle = 1 VLIW / cycle

Classifying ILP Machines [Jouppi, DECWRL 1991] Superpipelined-Superscalar –Issue parallelism = IP = n inst / minor cycle –Operation latency = OP = m minor cycles –Peak IPC = n x m instr / major cycle

Superscalar vs. Superpipelined Roughly equivalent performance –If n = m then both have about the same IPC –Parallelism exposed in space vs. time

Superpipelining: Result Latency

Superscalar Challenges