Computer Organization and Design Pipelining

Slides:

Advertisements

Similar presentations

Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.

Advertisements

ELEN 468 Advanced Logic Design

Now that’s what I call dirty laundry

Review: Pipelining. Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer.

10/11: Lecture Topics Slides on starting a program from last time Where we are, where we’re going RISC vs. CISC reprise Execution cycle Pipelining Hazards.

Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.

Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 1.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

1 COMP541 Sequencing – III (Sequencing a Computer) Montek Singh April 9, 2007.

L18 – Pipeline Issues 1 Comp 411 – Spring /03/08 CPU Pipelining Issues Finishing up Chapter 6 This pipe stuff makes my head hurt! What have you.

L17 – Pipeline Issues 1 Comp 411 – Fall /1308 CPU Pipelining Issues Finishing up Chapter 6 This pipe stuff makes my head hurt! What have you been.

Appendix A Pipelining: Basic and Intermediate Concepts

1  1998 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining.

Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 17 - Pipelined.

CS1104: Computer Organisation School of Computing National University of Singapore.

1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.

CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.

L16 – Pipelining 1 Comp 411 – Spring /20/2011 Pipelining Between 411 problems sets, I haven’t had a minute to do laundry Now that’s what I call dirty.

CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.

CMPE 421 Parallel Computer Architecture

1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.

How Computers Work Lecture 12 Page 1 How Computers Work Lecture 12 Introduction to Pipelining.

Computer Organization and Design Wrap Up! Montek Singh Wed, Dec 4, 2013.

Computer Organization and Design Wrap Up! Montek Singh Dec 2, 2015.

Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining.

CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and

1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.

Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.

Computer Organization and Design Pipelining Montek Singh Mon, Dec 2, 2013 Lecture 16.

Computer Organization and Design Pipelining Montek Singh Dec 2, 2015 Lecture 16 (SELF STUDY – not covered on the final exam)

Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.

L17 – Pipeline Issues 1 Comp 411 – Fall /23/09 CPU Pipelining Issues Read Chapter This pipe stuff makes my head hurt! What have you been.

Problem with Single Cycle Processor Design

Lecture 18: Pipelining I.

Computer Organization

CSCI206 - Computer Organization & Programming

Morgan Kaufmann Publishers

Performance of Single-cycle Design

ELEN 468 Advanced Logic Design

CMSC 611: Advanced Computer Architecture

Single Clock Datapath With Control

Pipeline Implementation (4.6)

Computer Organization and Design Building a Computer!

CDA 3101 Spring 2016 Introduction to Computer Organization

Chapter 4 The Processor Part 2

Pipelining Read Chapter

Building a Computer I wonder where this goes? MIPS Kit ALU 1

How Computers Work Lecture 13 Details of the Pipelined Beta

Lecturer: Alan Christopher

Serial versus Pipelined Execution

Building a Computer I wonder where this goes? MIPS Kit ALU 1

Pipelining in more detail

CSCI206 - Computer Organization & Programming

Guest Lecturer TA: Shreyas Chand

November 5 No exam results today. 9 Classes to go!

Chapter 8. Pipelining.

Computer Organization and Design Building a Computer!

Building a Computer! Don Porter Lecture 14.

Computer Organization and Design Building a Computer!

Now that’s what I call dirty laundry

Morgan Kaufmann Publishers The Processor

Guest Lecturer: Justin Hsia

A relevant question Assuming you’ve got: One washer (takes 30 minutes)

Presentation transcript:

Computer Organization and Design Pipelining Montek Singh Wed, Dec 5, 2012 Lecture 18

Pipelining Read Chapter 4.5-4.8 Between 411 problems sets, I haven’t had a minute to do laundry Now that’s what I call dirty laundry Read Chapter 4.5-4.8

Laundry Example INPUT: dirty laundry Device: Washer Function: Fill, Agitate, Spin WasherPD = 30 mins OUTPUT: 4 more weeks Device: Dryer Function: Heat, Spin DryerPD = 60 mins

Laundry: One Load at a Time Everyone knows that the real reason that UNC students put off doing laundry so long is not because they procrastinate, are lazy, or even have better things to do. The fact is, doing laundry one load at a time is not smart. Step 1: Step 2: Total = WasherPD + DryerPD = _________ mins 90

Total = N*(WasherPD + DryerPD) Laundry: Doing N Loads! Here’s how they do laundry at Duke, the “unpipelined” way. Step 1: Step 2: Step 3: Step 4: … Total = N*(WasherPD + DryerPD) = ____________ mins N*90

Laundry: Doing N Loads! … UNC students “pipeline” the laundry process. That’s why we wait! Step 1: Step 2: Step 3: … Actually, it’s more like N*60 + 30 if we account for the startup time (i.e., filling up the pipeline) correctly. When doing pipeline analysis, we’re mostly interested in the “steady state” where we assume we have an infinite supply of inputs. Total = N * Max(WasherPD, DryerPD) = ____________ mins N*60

Recall Our Performance Measures Latency: Delay from input to corresponding output Duke Laundry = _________ mins UNC Laundry = _________ mins Throughput: Rate at which inputs or outputs are processed Duke Laundry = _________ outputs/min UNC Laundry = _________ outputs/min 90 Assuming that the wash is started as soon as possible and waits (wet) in the washer until dryer is available. 120 1/90 1/60 Even though we increase latency, it takes less time per load

Okay, Back to Circuits… F G H X P(X) X F(X) G(X) P(X) For combinational logic: latency = tPD, throughput = 1/tPD. We can’t get the answer faster, but are we making effective use of our hardware at all times? X F(X) G(X) P(X) F & G are “idle”, just holding their outputs stable while H performs its computation

Pipelined Circuits use registers to hold H’s input stable! F G H X Now F & G can be working on input Xi+1 while H is performing its computation on Xi. We’ve created a 2-stage pipeline : if we have a valid input X during clock cycle j, P(X) is valid during clock j+2. F G H X P(X) 15 20 25 Suppose F, G, H have propagation delays of 15, 20, 25 ns and we are using ideal zero-delay registers (ts = 0, tpd = 0): unpipelined 2-stage pipeline latency 45 ______ throughput 1/45 ______ Pipelining uses registers to improve the throughput of combinational circuits 50 worse 1/25 better

Pipeline Diagrams Clock cycle Pipeline stages F G H X P(X) 15 20 25 This is an example of parallelism. At any instant we are computing 2 results. Clock cycle i i+1 i+2 i+3 Input Xi Xi+1 F(Xi) G(Xi) Xi+2 F(Xi+1) G(Xi+1) H(Xi) Xi+3 F(Xi+2) G(Xi+2) H(Xi+1) H(Xi+2) … F Reg Pipeline stages G Reg H Reg The results associated with a particular set of input data moves diagonally through the diagram, progressing through one pipeline stage each clock cycle.

Pipelining Summary Advantages: Disadvantages: Higher throughput than combinational system Different parts of the logic work on different parts of the problem… Disadvantages: Generally, increases latency Only as good as the *weakest* link (often called the pipeline’s BOTTLENECK)

Review of CPU Performance MIPS = Millions of Instructions/Second MIPS = Freq CPI Freq = Clock Frequency, MHz CPI = Clocks per Instruction To Increase MIPS: 1. DECREASE CPI. - RISC simplicity reduces CPI to 1.0. - CPI below 1.0? State-of-the-art multiple instruction issue 2. INCREASE Freq. - Freq limited by delay along longest combinational path; hence - PIPELINING is the key to improving performance.

Where Are the Bottlenecks? Pipelining goal: Break LONG combinational paths  memories, ALU in separate stages WA PC +4 Instruction Memory A D Register File RA1 RA2 RD1 RD2 ALU B ALUFN Control Logic Data Memory RD WD R/W Adr Wr WDSEL 0 1 2 BSEL J:<25:0> PCSEL WERF 00 PC+4 Rt: <20:16> Imm: <15:0> ASEL SEXT + x4 BT Z WASEL Rd:<15:11> Rt:<20:16> 1 2 3 PC<31:29>:J<25:0>:00 JT N V C Rs: <25:21> shamt:<10:6> 4 5 6 “16” IRQ 0x80000080 0x80000040 0x80000000 RESET “31” “27” WE

Goal: 5-Stage Pipeline IF ID/RF ALU MEM WB GOAL: Maintain (nearly) 1.0 CPI, but increase clock speed to barely include slowest components (mems, regfile, ALU) APPROACH: structure processor as 5-stage pipeline: IF Instruction Fetch stage: Maintains PC, fetches one instruction per cycle and passes it to ID/RF Instruction Decode/Register File stage: Decode control lines and select source operands ALU ALU stage: Performs specified operation, passes result to … MEM Memory stage: If it’s a lw, use ALU result as an address, pass mem data (or ALU result if not lw) to … WB Write-Back stage: writes result back into register file.

PC<31:29>:J<25:0>:00 5-Stage miniMIPS 0x80000000 0x80000040 PC<31:29>:J<25:0>:00 0x80000080 JT BT PCSEL 6 5 4 3 2 1 • Omits some details PC 00 Instruction Memory A D Instruction +4 Fetch PCREG 00 IRREG Rt: <20:16> Rs: <25:21> J:<25:0> RA1 Register RA2 WA File RD1 RD2 JT = Imm: <15:0> SEXT SEXT BZ x4 shamt:<10:6> + “16” Register 1 2 ASEL 1 BSEL File BT PCALU 00 IRALU A B WDALU Address is available right after instruction enters Memory stage A B ALU ALUFN ALU N V C Z PCMEM 00 IRMEM YMEM WDMEM Wr Adr WD R/W Memory PC+4 PCWB 00 IRWB YWB Data Memory RD Rt:<20:16> Rd:<15:11> “31” “27” Data is needed just before rising clock edge at end of Write Back stage WASEL 0 1 2 3 WDSEL 0 1 2 Write WA Register WD Back WERF WE WA File

Pipelining Improve performance by increasing instruction throughput  Ideal speedup is number of stages in the pipeline. Do we achieve this?

Pipelining What makes it easy What makes it hard? Net effect: all instructions are the same length just a few instruction formats memory operands appear only in loads and stores What makes it hard? structural hazards: suppose we had only one memory control hazards: need to worry about branch instructions data hazards: an instruction depends on a previous instruction Net effect: Individual instructions still take the same number of cycles But improved throughput by increasing the number of simultaneously executing instructions

Data Hazards Problem with starting next instruction before first is finished dependencies that “go backward in time” are data hazards

Software Solution Have compiler guarantee no hazards Where do we insert the “nops” ? Between “producing” and “consuming” instructions! sub $2, $1, $3 and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 100($2) Problem: this really slows us down!

Forwarding Bypass/forward results as soon as they are produced/needed. Don’t wait for them to be written back into registers!

Can't always forward Load word can still cause a hazard: an instruction tries to read a register following a load instruction that writes to the same register. STALL!

Stalling When needed, stall the pipeline by keeping an instruction in the same stage fpr an extra clock cycle.

Branch Hazards When branching, other instructions are in the pipeline! need to add hardware for flushing instructions if we are wrong

Pipeline Summary A very common technique to improve throughput of any circuit used in all modern processors! Fallacies: “Pipelining is easy.” No, smart people get it wrong all of the time! “Pipelining is independent of ISA.” No, many ISA decisions impact how easy/costly it is to implement pipelining (i.e. branch semantics, addressing modes). “Increasing pipeline stages improves performance.” No, returns diminish because of increasing complexity.

… to improve parallelism? What else can we do? … to improve parallelism?

Multicore/multiprocessor Use more than one processor = multiprocessor called multicore when they are all on the same chip read all about it in Chapter 7 of textbook FIGURE 7.2 Classic organization of a shared memory multiprocessor. Copyright © 2009 Elsevier, Inc.

So, what did we learn this semester? That’s it folks! So, what did we learn this semester?

What we learnt this semester You now have a pretty good idea about how computers are designed and how they work: How data and instructions are represented How arithmetic and logic operations are performed How ALU and control circuits are implemented How registers and the memory hierarchy are implemented How performance is measured (Self Study) How performance is increased via pipelining Lots of low-level programming experience: C and MIPS This is how programs are actually executed! This is how OS/networking code is actually written! Java and other higher-level languages are convenient high-level abstractions. You probably have new appreciation for them!

Grades? We are frantically trying to wrap up all grading! Your final grades will be on ConnectCarolina by the end of this week. Also, don’t forget to submit your course evaluation!