15-740/ Computer Architecture Lecture 7: Pipelining

Slides:

Advertisements

Similar presentations

Computer Organization and Architecture

Advertisements

Fall 2012SYSC 5704: Elements of Computer Systems 1 MicroArchitecture Murdocca, Chapter 5 (selected parts) How to read Chapter 5.

EEM 486 EEM 486: Computer Architecture Lecture 4 Designing a Multicycle Processor.

Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.

S. Barua – CPSC 440 CHAPTER 6 ENHANCING PERFORMANCE WITH PIPELINING This chapter presents pipelining.

CS-447– Computer Architecture Lecture 12 Multiple Cycle Datapath

Computer ArchitectureFall 2007 © October 24nd, 2007 Majd F. Sakr CS-447– Computer Architecture.

Computer ArchitectureFall 2007 © October 3rd, 2007 Majd F. Sakr CS-447– Computer Architecture.

Computer ArchitectureFall 2007 © October 22nd, 2007 Majd F. Sakr CS-447– Computer Architecture.

DLX Instruction Format

Computer ArchitectureFall 2007 © October 31, CS-447– Computer Architecture M,W 10-11:20am Lecture 17 Review.

Computer ArchitectureFall 2008 © October 6th, 2008 Majd F. Sakr CS-447– Computer Architecture.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

TDC 311 The Microarchitecture. Introduction As mentioned earlier in the class, one Java statement generates multiple machine code statements Then one.

1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.

Chapter 2 Summary Classification of architectures Features that are relatively independent of instruction sets “Different” Processors –DSP and media processors.

CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.

ECE 252 / CPS 220 Pipelining Professor Alvin R. Lebeck Compsci 220 / ECE 252 Fall 2008.

CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.

Computer Architecture CPSC 350

1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.

ECE3056A Architecture, Concurrency, and Energy Lecture: Pipelined Microarchitectures Prof. Moinuddin Qureshi Slides adapted from: Prof. Mutlu (CMU)

1 Pipelining Part I CS What is Pipelining? Like an Automobile Assembly Line for Instructions –Each step does a little job of processing the instruction.

Computer Architecture: Out-of-Order Execution

Microarchitecture. Outline Architecture vs. Microarchitecture Components MIPS Datapath 1.

CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and

Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.

Computer Organization and Design Pipelining Montek Singh Dec 2, 2015 Lecture 16 (SELF STUDY – not covered on the final exam)

11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.

Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.

Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 10 Computer Hardware Design (Pipeline Datapath and Control Design) Prof. Dr.

Computer Architecture Lecture 7: Microprogrammed Microarchitectures Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/30/2013.

Computer Architecture Lecture 7: Pipelining Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 1/29/2014.

PipeliningPipelining Computer Architecture (Fall 2006)

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

15-740/ Computer Architecture Lecture 7: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University.

11/4/2017 CIS-550 Advanced Computer Architecture Lecture 6: Multi-Cycle and Microprogrammed Microarchitectures Dr. Muhammad Abid DCIS, PIEAS Spring 2017.

Design of Digital Circuits Lecture 14: Microprogramming

Design of Digital Circuits Lecture 13: Multi-Cycle Microarch.

Computer Organization

15-740/ Computer Architecture Lecture 4: Pipelining

15-740/ Computer Architecture Lecture 3: Performance

CIS-550 Advanced Computer Architecture Lecture 7: Pipelining

ECE 4100/6100 Advanced Computer Architecture Lecture 1 Performance

Precise Exceptions and Out-of-Order Execution

Design of Digital Circuits Lecture 18: Out-of-Order Execution

Morgan Kaufmann Publishers

Performance of Single-cycle Design

CIS-550 Advanced Computer Architecture Lecture 10: Precise Exceptions

Instant replay The semester was split into roughly four parts.

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Computer Architecture CSCE 350

Chapter 4 The Processor Part 2

Instruction Level Parallelism and Superscalar Processors

Lecture 5: Pipelining Basics

Serial versus Pipelined Execution

15-740/ Computer Architecture Lecture 5: Precise Exceptions

The Processor Lecture 3.4: Pipelining Datapath and Control

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

Designing a Pipelined CPU

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution

Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture

Morgan Kaufmann Publishers The Processor

Design of Digital Circuits Lecture 19a: VLIW

Guest Lecturer: Justin Hsia

Presentation transcript:

15-740/18-740 Computer Architecture Lecture 7: Pipelining Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/26/2011

Review of Last Lecture More ISA Tradeoffs Von Neumann Model Programmer vs. microarchitect Transactional memory Dataflow Von Neumann Model Performance metrics Performance equation Ways of improving performance IPS, FLOPS, Perf/Frequency

Today More performance metrics Issues in pipelining Precise exceptions

Papers to Read and Review Review Set 5 – due October 3 Monday Smith and Plezskun, “Implementing Precise Interrupts in Pipelined Processors,” IEEE Transactions on Computers 1988 (earlier version: ISCA 1985). Sprangle and Carmean, “Increasing Processor Performance by Implementing Deeper Pipelines,” ISCA 2002. Review Set 6 – due October 7 Friday Tomasulo, “An Efficient Algorithm for Exploiting Multiple Arithmetic Units,” IBM Journal of R&D, Jan. 1967. Smith and Sohi, “The Microarchitecture of Superscalar Processors,” Proc. IEEE, Dec. 1995.

Review: The Performance Equation time program Execution time = # instructions program # cycles instruction time cycle = X X Algorithm Program ISA Compiler Microarchitecture Logic design Circuit implementation Technology ISA Microarchitecture

Review: Other Performance Metrics: Perf/Frequency SPEC/MHz Remember Performance/Frequency What is wrong with comparing only “cycle count”? Unfairly penalizes machines with high frequency For machines of equal frequency, fairly reflects performance assuming equal amount of “work” is done Fair if used to compare two different same-ISA processors on the same binaries time program 1 Performance Execution time = = time cycle = # instructions program # cycles instruction time cycle X X # cycles program = 1 / { }

Review: An Example Use of Perf/Frequency Metric Ronen et al, IEEE Proceedings 2001

Other Potential Performance Metrics Memory bandwidth Sustained vs. peak Compute/bandwidth balance Arithmetic intensity Energy consumption metrics

Amdahl’s Law: Bottleneck Analysis Speedup= timewithout enhancement / timewith enhancement Suppose an enhancement speeds up a fraction f of a task by a factor of S timeenhanced = timeoriginal·(1-f) + timeoriginal·(f/S) Speedupoverall = 1 / ( (1-f) + f/S ) f (1 - f) timeoriginal (1 - f) timeenhanced f/S Focus on bottlenecks with large f (and large S)

Microarchitecture Design Principles Bread and butter design Spend time and resources on where it matters (i.e. improving what the machine is designed to do) Common case vs. uncommon case Balanced design Balance instruction/data flow through uarch components Design to eliminate bottlenecks Critical path design Find the maximum speed path and decrease it Break a path into multiple cycles?

Cycle Time (Frequency) vs. CPI (IPC) Usually at odds with each other Why? Memory access latency: Increased frequency increases the number of cycles it takes to access main memory Pipelining: A deeper pipeline increases frequency, but also increases the “stall” cycles: Data dependency stalls Control dependency stalls Resource contention stalls Average CPI (IPC) affected by exploitation of instruction-level parallelism. Many techniques devised for this.

Intro to Pipelining (I) Single-cycle machines Each instruction executed in one cycle The slowest instruction determines cycle time Multi-cycle machines Instruction execution divided into multiple cycles Fetch, decode, eval addr, fetch operands, execute, store result Advantage: the slowest “stage” determines cycle time Microcoded machines Microinstruction: Control signals for the current cycle Microcode: Set of all microinstructions needed to implement instructions  Translates each instruction into a set of microinstructions Wilkes, “The Best Way to Design an Automatic Calculating Machine,” Manchester Univ. Computer Inaugural Conf., 1951.

Microcoded Execution of an ADD ADD DR  SR1, SR2 Fetch: MAR  IP MDR  MEM[MAR] IR  MDR Decode: Control Signals  DecodeLogic(IR) Execute: TEMP  SR1 + SR2 Store result (Writeback): DR  TEMP IP  IP + 4 MEMORY Mem Addr Reg What if this is SLOW? Mem Data Reg DATAPATH ALU GP Registers Control Signals CONTROL UNIT Inst Pointer Inst Register

Intro to Pipelining (II) In the microcoded machine, some resources are idle in different stages of instruction processing Fetch logic is idle when ADD is being decoded or executed Pipelined machines Use idle resources to process other instructions Each stage processes a different instruction When decoding the ADD, fetch the next instruction Think “assembly line” Pipelined vs. multi-cycle machines Advantage: Improves instruction throughput (reduces CPI) Disadvantage: Requires more logic, higher power consumption

A Simple Pipeline

Execution of Four Independent ADDs Multi-cycle: 4 cycles per instruction Pipelined: 4 cycles per 4 instructions (steady state) F D E W F D E W F D E W F D E W Time F D E W F D E W F D E W F D E W Time

Issues in Pipelining: Increased CPI Data dependency stall: what if the next ADD is dependent Solution: data forwarding. Can this always work? How about memory operations? Cache misses? If data is not available by the time it is needed: STALL What if the pipeline was like this? R3 cannot be forwarded until read from memory Is there a way to make ADD not stall? F D E W ADD R3  R1, R2 ADD R4  R3, R7 W LD R3  R2(0) ADD R4  R3, R7 F D E M W F D E E M W

Implementing Stalling Hardware based interlocking Common way: scoreboard i.e. valid bit associated with each register in the register file Valid bits also associated with each forwarding/bypass path Func Unit Register File Instruction Cache Func Unit Func Unit

Data Dependency Types Types of data-related dependencies Flow dependency (true data dependency – read after write) Output dependency (write after write) Anti dependency (write after read) Which ones cause stalls in a pipelined machine? Answer: It depends on the pipeline design In our simple strictly-4-stage pipeline, only flow dependencies cause stalls What if instructions completed out of program order?