15-740/ Computer Architecture Lecture 4: Pipelining

Slides:

Advertisements

Similar presentations

Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.

Advertisements

EEM 486 EEM 486: Computer Architecture Lecture 4 Designing a Multicycle Processor.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

CS-447– Computer Architecture Lecture 12 Multiple Cycle Datapath

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

Appendix A Pipelining: Basic and Intermediate Concepts

Computer ArchitectureFall 2008 © October 6th, 2008 Majd F. Sakr CS-447– Computer Architecture.

(6.1) Central Processing Unit Architecture  Architecture overview  Machine organization – von Neumann  Speeding up CPU operations – multiple registers.

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

Instruction Sets and Pipelining Cover basics of instruction set types and fundamental ideas of pipelining Later in the course we will go into more depth.

18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.

1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.

Chapter 2 Summary Classification of architectures Features that are relatively independent of instruction sets “Different” Processors –DSP and media processors.

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

Computer Architecture Pipelines & Superscalars Sunset over the Pacific Ocean Taken from Iolanthe II about 100nm north of Cape Reanga.

Chapter 4 The Processor CprE 381 Computer Organization and Assembly Level Programming, Fall 2012 Revised from original slides provided by MKP.

CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.

CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.

CMPE 421 Parallel Computer Architecture

Computer Architecture CPSC 350

1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.

1 Pipelining Part I CS What is Pipelining? Like an Automobile Assembly Line for Instructions –Each step does a little job of processing the instruction.

How Computers Work Lecture 12 Page 1 How Computers Work Lecture 12 Introduction to Pipelining.

Microarchitecture. Outline Architecture vs. Microarchitecture Components MIPS Datapath 1.

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and

Performance Performance

Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.

Computer Organization and Design Pipelining Montek Singh Dec 2, 2015 Lecture 16 (SELF STUDY – not covered on the final exam)

Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.

Computer Architecture Lecture 7: Microprogrammed Microarchitectures Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/30/2013.

Computer Architecture Lecture 7: Pipelining Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 1/29/2014.

11/4/2017 CIS-550 Advanced Computer Architecture Lecture 6: Multi-Cycle and Microprogrammed Microarchitectures Dr. Muhammad Abid DCIS, PIEAS Spring 2017.

Design of Digital Circuits Lecture 14: Microprogramming

Design of Digital Circuits Lecture 13: Multi-Cycle Microarch.

18-447: Computer Architecture Lecture 30B: Multiprocessors

CS161 – Design and Architecture of Computer Systems

Computer Organization

15-740/ Computer Architecture Lecture 3: Performance

ECE 4100/6100 Advanced Computer Architecture Lecture 1 Performance

Central Processing Unit Architecture

15-740/ Computer Architecture Lecture 4: ISA Tradeoffs

William Stallings Computer Organization and Architecture 8th Edition

Morgan Kaufmann Publishers

Performance of Single-cycle Design

CIS-550 Advanced Computer Architecture Lecture 10: Precise Exceptions

15-740/ Computer Architecture Lecture 7: Pipelining

Processor Architecture: Introduction to RISC Datapath (MIPS and Nios II) CSCE 230.

Morgan Kaufmann Publishers The Processor

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Computer Architecture CSCE 350

Chapter 4 The Processor Part 2

Computer Architecture: Multithreading (I)

Lecture 5: Pipelining Basics

Serial versus Pipelined Execution

A Multiple Clock Cycle Instruction Implementation

15-740/ Computer Architecture Lecture 5: Precise Exceptions

The Processor Lecture 3.4: Pipelining Datapath and Control

Computer Architecture

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution

Computer Architecture

Morgan Kaufmann Publishers The Processor

Introduction to Computer Organization and Architecture

The University of Adelaide, School of Computer Science

Guest Lecturer: Justin Hsia

Chapter 4 The Von Neumann Model

Presentation transcript:

15-740/18-740 Computer Architecture Lecture 4: Pipelining Prof. Onur Mutlu Carnegie Mellon University

Last Time … Addressing modes Other ISA-level tradeoffs Programmer vs. microarchitect Virtual memory Unaligned access Transactional memory Control flow vs. data flow The Von Neumann Model The Performance Equation

Review: Other ISA-level Tradeoffs Load/store vs. Memory/Memory Condition codes vs. condition registers vs. compare&test Hardware interlocks vs. software-guaranteed interlocking VLIW vs. single instruction 0, 1, 2, 3 address machines Precise vs. imprecise exceptions Virtual memory vs. not Aligned vs. unaligned access Supported data types Software vs. hardware managed page fault handling Granularity of atomicity Cache coherence (hardware vs. software) …

Review: The Von-Neumann Model MEMORY Mem Addr Reg Mem Data Reg PROCESSING UNIT INPUT OUTPUT ALU TEMP CONTROL UNIT IP Inst Register

Review: The Von-Neumann Model Stored program computer (instructions in memory) One instruction at a time Sequential execution Unified memory The interpretation of a stored value depends on the control signals All major ISAs today use this model Underneath (at uarch level), the execution model is very different Multiple instructions at a time Out-of-order execution Separate instruction and data caches

Review: Fundamentals of Uarch Performance Tradeoffs Instruction Supply Data Path (Functional Units) Data Supply - Zero-cycle latency (no cache miss) - No branch mispredicts No fetch breaks Perfect data flow (reg/memory dependencies) Zero-cycle interconnect (operand communication) Enough functional units Zero latency compute? Zero-cycle latency Infinite capacity Zero cost We will examine all these throughout the course (especially data supply)

Review: How to Evaluate Performance Tradeoffs time program Execution time = # instructions program # cycles instruction time cycle = X X Algorithm Program ISA Compiler Microarchitecture Logic design Circuit implementation Technology ISA Microarchitecture

Improving Performance (Reducing Exec Time) Reducing instructions/program More efficient algorithms and programs Better ISA? Reducing cycles/instruction (CPI) Better microarchitecture design Execute multiple instructions at the same time Reduce latency of instructions (1-cycle vs. 100-cycle memory access) Reducing time/cycle (clock period) Technology scaling Pipelining

Other Performance Metrics: IPS Machine A: 10 billion instructions per second Machine B: 1 billion instructions per second Which machine has higher performance? Instructions Per Second (IPS, MIPS, BIPS) How does this relate to execution time? When is this a good metric for comparing two machines? Same instruction set, same binary (i.e., same compiler), same operating system Meaningless if “Instruction count” does not correspond to “work” E.g., some optimizations add instructions, but do not change “work” # of instructions cycle cycle time X

Other Performance Metrics: FLOPS Machine A: 10 billion FP instructions per second Machine B: 1 billion FP instructions per second Which machine has higher performance? Floating Point Operations per Second (FLOPS, MFLOPS, GFLOPS) Popular in scientific computing FP operations used to be very slow (think Amdahl’s law) Why not a good metric? Ignores all other instructions what if your program has 0 FP instructions? Not all FP ops are the same

Other Performance Metrics: Perf/Frequency SPEC/MHz Remember Performance/Frequency What is wrong with comparing only “cycle count”? Unfairly penalizes machines with high frequency For machines of equal frequency, fairly reflects performance assuming equal amount of “work” is done Fair if used to compare two different same-ISA processors on the same binaries time program 1 Performance Execution time = = time cycle = # instructions program # cycles instruction time cycle X X # cycles program = 1 / { }

An Example Ronen et al, IEEE Proceedings 2001

Amdahl’s Law: Bottleneck Analysis Speedup= timewithout enhancement / timewith enhancement Suppose an enhancement speeds up a fraction f of a task by a factor of S timeenhanced = timeoriginal·(1-f) + timeoriginal·(f/S) Speedupoverall = 1 / ( (1-f) + f/S ) f (1 - f) timeoriginal (1 - f) timeenhanced f/S Focus on bottlenecks with large f (and large S)

Microarchitecture Design Principles Bread and butter design Spend time and resources on where it matters (i.e. improving what the machine is designed to do) Common case vs. uncommon case Balanced design Balance instruction/data flow through uarch components Design to eliminate bottlenecks Critical path design Find the maximum speed path and decrease it Break a path into multiple cycles?

Cycle Time (Frequency) vs. CPI (IPC) Usually at odds with each other Why? Memory access latency: Increased frequency increases the number of cycles it takes to access main memory Pipelining: A deeper pipeline increases frequency, but also increases the “stall” cycles: Data dependency stalls Control dependency stalls Resource contention stalls

Intro to Pipelining (I) Single-cycle machines Each instruction executed in one cycle The slowest instruction determines cycle time Multi-cycle machines Instruction execution divided into multiple cycles Fetch, decode, eval addr, fetch operands, execute, store result Advantage: the slowest “stage” determines cycle time Microcoded machines Microinstruction: Control signals for the current cycle Microcode: Set of all microinstructions needed to implement instructions  Translates each instruction into a set of microinstructions

Microcoded Execution of an ADD ADD DR  SR1, SR2 Fetch: MAR  IP MDR  MEM[MAR] IR  MDR Decode: Control Signals  DecodeLogic(IR) Execute: TEMP  SR1 + SR2 Store result (Writeback): DR  TEMP IP  IP + 4 MEMORY Mem Addr Reg What if this is SLOW? Mem Data Reg DATAPATH ALU GP Registers Control Signals CONTROL UNIT Inst Pointer Inst Register

Intro to Pipelining (II) In the microcoded machine, some resources are idle in different stages of instruction processing Fetch logic is idle when ADD is being decoded or executed Pipelined machines Use idle resources to process other instructions Each stage processes a different instruction When decoding the ADD, fetch the next instruction Think “assembly line” Pipelined vs. multi-cycle machines Advantage: Improves instruction throughput (reduces CPI) Disadvantage: Requires more logic, higher power consumption

A Simple Pipeline

Execution of Four Independent ADDs Multi-cycle: 4 cycles per instruction Pipelined: 4 cycles per 4 instructions (steady state) F D E W F D E W F D E W F D E W Time F D E W F D E W F D E W F D E W Time

Issues in Pipelining: Increased CPI Data dependency stall: what if the next ADD is dependent Solution: data forwarding. Can this always work? How about memory operations? Cache misses? If data is not available by the time it is needed: STALL What if the pipeline was like this? R3 cannot be forwarded until read from memory Is there a way to make ADD not stall? F D E W ADD R3  R1, R2 ADD R4  R3, R7 LD R3  R2(0) ADD R4  R3, R7 F D E M W F D E E M W

Implementing Stalling Hardware based interlocking Common way: scoreboard i.e. valid bit associated with each register in the register file Valid bits also associated with each forwarding/bypass path Func Unit Register File Instruction Cache Func Unit Func Unit

Data Dependency Types Types of data-related dependencies Flow dependency (true data dependency – read after write) Output dependency (write after write) Anti dependency (write after read) Which ones cause stalls in a pipelined machine? Answer: It depends on the pipeline design In our simple strictly-4-stage pipeline, only flow dependencies cause stalls What if instructions completed out of program order?

Issues in Pipelining: Increased CPI Control dependency stall: what to fetch next Solution: predict which instruction comes next What if prediction is wrong? Another solution: hardware-based fine-grained multithreading Can tolerate both data and control dependencies Read: James Thornton, “Parallel operation in the Control Data 6600,” AFIPS 1964. Read: Burton Smith, “A pipelined, shared resource MIMD computer,” ICPP 1978. BEQ R1, R2, TARGET F D E W F F F D E W