Now that’s what I call dirty laundry

Slides:

Advertisements

Similar presentations

Adding the Jump Instruction

Advertisements

Pipelining what Seymour Cray taught the laundry industry

Pipelining I (1) Fall 2005 Lecture 18: Pipelining I.

Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.

CS-447– Computer Architecture Lecture 12 Multiple Cycle Datapath

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 1.

1 COMP541 Sequencing – III (Sequencing a Computer) Montek Singh April 9, 2007.

L18 – Pipeline Issues 1 Comp 411 – Spring /03/08 CPU Pipelining Issues Finishing up Chapter 6 This pipe stuff makes my head hurt! What have you.

L17 – Pipeline Issues 1 Comp 411 – Fall /1308 CPU Pipelining Issues Finishing up Chapter 6 This pipe stuff makes my head hurt! What have you been.

Lecture 16: Basic CPU Design

L16- Building a Computer 1 Comp 411 – Spring /27/08 Building a Computer I wonder where this goes? Instruction Memory A D 0 1 MIPS Kit ALU AB.

Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 17 - Pipelined.

Lecture 24: CPU Design Today’s topic –Multi-Cycle ALU –Introduction to Pipelining 1.

COSC 3430 L08 Basic MIPS Architecture.1 COSC 3430 Computer Architecture Lecture 08 Processors Single cycle Datapath PH 3: Sections

Pipelining the Beta bet·ta ('be-t&) n. Any of various species

1 COMP541 Multicycle MIPS Montek Singh Apr 4, 2012.

COMP541 Multicycle MIPS Montek Singh Apr 8, 2015.

CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.

L16 – Pipelining 1 Comp 411 – Spring /20/2011 Pipelining Between 411 problems sets, I haven’t had a minute to do laundry Now that’s what I call dirty.

CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.

1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.

Computer Organization and Design Building a Computer!

CECS 440 Pipelining.1(c) 2014 – R. W. Allison [slides adapted from D. Patterson slides with additional credits to M.J. Irwin]

How Computers Work Lecture 12 Page 1 How Computers Work Lecture 12 Introduction to Pipelining.

COMP541 Multicycle MIPS Montek Singh Mar 25, 2010.

CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and

Lecture 16: Basic Pipelining

Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

Pipelining Example Laundry Example: Three Stages

Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.

Computer Organization and Design Pipelining Montek Singh Mon, Dec 2, 2013 Lecture 16.

Computer Organization and Design Pipelining Montek Singh Dec 2, 2015 Lecture 16 (SELF STUDY – not covered on the final exam)

LECTURE 7 Pipelining. DATAPATH AND CONTROL We started with the single-cycle implementation, in which a single instruction is executed over a single cycle.

Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.

COM181 Computer Hardware Lecture 6: The MIPs CPU.

Computer Organization and Design Building a Computer! Montek Singh Nov 18-20, 2015 Lecture 14.

L17 – Pipeline Issues 1 Comp 411 – Fall /23/09 CPU Pipelining Issues Read Chapter This pipe stuff makes my head hurt! What have you been.

Problem with Single Cycle Processor Design

CS161 – Design and Architecture of Computer Systems

Lecture 18: Pipelining I.

Computer Organization

Morgan Kaufmann Publishers

Performance of Single-cycle Design

CMSC 611: Advanced Computer Architecture

Computer Organization and Design Building a Computer!

ECE232: Hardware Organization and Design

Chapter 4 The Processor Part 2

Pipelining Read Chapter

Building a Computer I wonder where this goes? MIPS Kit ALU 1

How Computers Work Lecture 13 Details of the Pipelined Beta

Lecturer: Alan Christopher

Serial versus Pipelined Execution

Building a Computer I wonder where this goes? MIPS Kit ALU 1

Pipelining in more detail

The Processor Lecture 3.4: Pipelining Datapath and Control

Computer Organization and Design Pipelining

Guest Lecturer TA: Shreyas Chand

Computer Organization and Design Building a Computer!

Building a Computer! Don Porter Lecture 14.

Computer Organization and Design Building a Computer!

Pipelining Appendix A and Chapter 3.

Now that’s what I call dirty laundry

Morgan Kaufmann Publishers The Processor

Guest Lecturer: Justin Hsia

A relevant question Assuming you’ve got: One washer (takes 30 minutes)

Pipelined datapath and control

CS161 – Design and Architecture of Computer Systems

Presentation transcript:

Now that’s what I call dirty laundry Pipelining Between Spring break and 411 problems sets, I haven’t had a minute to do laundry Now that’s what I call dirty laundry Read Chapter 6.1

Forget 411… Let’s Solve a “Relevant Problem” INPUT: dirty laundry Device: Washer Function: Fill, Agitate, Spin WasherPD = 30 mins OUTPUT: 4 more weeks Device: Dryer Function: Heat, Spin DryerPD = 60 mins

Total = WasherPD + DryerPD One Load at a Time Everyone knows that the real reason that UNC students put off doing laundry so long is *not* because they procrastinate, are lazy, or even have better things to do. The fact is, doing laundry one load at a time is not smart. (Sorry Mom, but you were wrong about this one!) Step 1: Step 2: Total = WasherPD + DryerPD = _________ mins 90

Doing N Loads of Laundry Here’s how they do laundry at Duke, the “combinational” way. (Actually, this is just an urban legend. No one at Duke actually does laundry. The butler’s all arrive on Wednesday morning, pick up the dirty laundry and return it all pressed and starched by dinner) Step 1: Step 2: Step 3: Step 4: … Total = N*(WasherPD + DryerPD) = ____________ mins N*90

Doing N Loads… the UNC way UNC students “pipeline” the laundry process. That’s why we wait! Step 1: Step 2: Step 3: … Actually, it’s more like N*60 + 30 if we account for the startup transient correctly. When doing pipeline analysis, we’re mostly interested in the “steady state” where we assume we have an infinite supply of inputs. Total = N * Max(WasherPD, DryerPD) = ____________ mins N*60

Recall Our Performance Measures Latency: The delay from when an input is established until the output associated with that input becomes valid. (Duke Laundry = _________ mins) ( UNC Laundry = _________ mins) Throughput: The rate of which inputs or outputs are processed. (Duke Laundry = _________ outputs/min) ( UNC Laundry = _________ outputs/min) 90 Assuming that the wash is started as soon as possible and waits (wet) in the washer until dryer is available. 120 Even though we increase latency, it takes less time per load 1/90 1/60

Okay, Back to Circuits… X F(X) G(X) P(X) H X P(X) For combinational logic: latency = tPD, throughput = 1/tPD. We can’t get the answer faster, but are we making effective use of our hardware at all times? X F(X) G(X) P(X) F & G are “idle”, just holding their outputs stable while H performs its computation

Pipelined Circuits use registers to hold H’s input stable! unpipelined Now F & G can be working on input Xi+1 while H is performing its computation on Xi. We’ve created a 2-stage pipeline : if we have a valid input X during clock cycle j, P(X) is valid during clock j+2. F G H X P(X) 15 20 25 Suppose F, G, H have propagation delays of 15, 20, 25 ns and we are using ideal zero-delay registers (ts = 0, tpd = 0): Pipelining uses registers to improve the throughput of combinational circuits unpipelined 2-stage pipeline latency 45 ______ throughput 1/45 ______ 50 worse 1/25 better

Pipeline Diagrams Clock cycle i i+1 i+2 i+3 Input Xi Xi+1 F(Xi) G(Xi) H X P(X) 15 20 25 This is an example of parallelism. At any instant we are computing 2 results. Clock cycle i i+1 i+2 i+3 Input Xi Xi+1 F(Xi) G(Xi) Xi+2 F(Xi+1) G(Xi+1) H(Xi) Xi+3 F(Xi+2) G(Xi+2) H(Xi+1) H(Xi+2) … F Reg Pipeline stages G Reg H Reg The results associated with a particular set of input data moves diagonally through the diagram, progressing through one pipeline stage each clock cycle.

This bottleneck is the only problem Pipelining Summary Advantages: Higher throughput than combinational system Different parts of the logic work on different parts of the problem… Disadvantages: Generally, increases latency Only as good as the *weakest* link (often called the pipeline’s BOTTLENECK) Isn’t there a way around this “weak link” problem? This bottleneck is the only problem

How do UNC students REALLY do Laundry? They work around the bottleneck. First, they find a place with twice as many dryers as washers. Throughput = ______ loads/min Latency = ______ mins/load Step 1: Step 2: 1/30 Step 3: Step 4: 90

Better Yet… Parallelism We can combine interleaving and pipelining with parallelism. Throughput = _______ load/min Latency = _______ min Step 1: Step 2: 2/30 = 1/15 Step 3: 90 Step 4: Step 5:

“Classroom Computer” There are lots of problem sets to grade, each with six problems. Students in Row 1 grade Problem 1 and then hand it back to Row 2 for grading Problem 2, and so on… Assuming we want to pipeline the grading, how do we time the passing of papers between rows? Psets in Row 1 Row 2 Row 3 Row 4 Row 5 Row 6

Controls for “Classroom Computer” Synchronous Asynchronous Teacher picks time interval long enough for worst-case student to grade toughest problem. Everyone passes psets at end of interval. Teacher picks variable time interval long enough for current students to grade current set of problems. Everyone passes psets at end of interval. Globally Timed Students raise hands when they finish grading current problem. Teacher checks every 10 secs, when all hands are raised, everyone passes psets to the row behind. Variant: students can pass when all students in a “column” have hands raised. Students grade current problem, wait for student in next row to be free, and then pass the pset back. Locally Timed

Control Structure Taxonomy Easy to design but fixed-sized interval can be wasteful (no data-dependencies in timing) Large systems lead to very complicated timing generators… just say no! Synchronous Asynchronous Centralized clocked FSM generates all control signals. Central control unit tailors current time slice to current tasks. Globally Timed Start and Finish signals generated by each major subsystem, synchronously with global clock. Each subsystem takes asynchronous Start, generates asynchronous Finish (perhaps using local clock). Locally Timed The “next big idea” for the last several decades: a lot of design work to do in general, but extra work is worth it in special cases The best way to build large systems that have independent components.

Review of CPU Performance MIPS = Millions of Instructions/Second MIPS = Freq CPI Freq = Clock Frequency, MHz CPI = Clocks per Instruction To Increase MIPS: 1. DECREASE CPI. - RISC simplicity reduces CPI to 1.0. - CPI below 1.0? State-of-the-art multiple instruction issue 2. INCREASE Freq. - Freq limited by delay along longest combinational path; hence - PIPELINING is the key to improving performance.

miniMIPS Timing CLK New PC The diagram on the left illustrates the Data Flow of miniMIPS Wanted: longest path Complications: some apparent paths aren’t “possible” functional units have variable execution times (eg, ALU) time axis is not to scale (eg, tPD,MEM is very big!) PC+4 Fetch Inst. Control Logic Read Regs Sign Extend +OFFSET ASEL mux BSEL mux ALU Fetch data PCSEL mux WASEL mux WDSEL mux PC setup RF setup Mem setup CLK

Where Are the Bottlenecks? WA PC +4 Instruction Memory A D Register File RA1 RA2 RD1 RD2 ALU B ALUFN Control Logic Data Memory RD WD R/W Adr Wr WDSEL 0 1 2 BSEL J:<25:0> PCSEL WERF 00 PC+4 Rt: <20:16> Imm: <15:0> ASEL SEXT + x4 BT Z WASEL Rd:<15:11> Rt:<20:16> 1 2 3 PC<31:29>:J<25:0>:00 JT N V C Rs: <25:21> shamt:<10:6> 4 5 6 “16” IRQ 0x80000080 0x80000040 0x80000000 RESET “31” “27” WE Pipelining goal: Break LONG combinational paths  memories, ALU in separate stages

Ultimate Goal: 5-Stage Pipeline GOAL: Maintain (nearly) 1.0 CPI, but increase clock speed to barely include slowest components (mems, regfile, ALU) APPROACH: structure processor as 5-stage pipeline: IF Instruction Fetch stage: Maintains PC, fetches one instruction per cycle and passes it to ID/RF Instruction Decode/Register File stage: Decode control lines and select source operands ALU ALU stage: Performs specified operation, passes result to MEM Memory stage: If it’s a lw, use ALU result as an address, pass mem data (or ALU result if not lw) to WB Write-Back stage: writes result back into register file.

miniMIPS Timing Different instructions use various parts of the data path. 1 instr every 14 nS, 14 nS, 20 nS, 9 nS, 19 nS Program execution order Time CLK add $4, $5, $6 beq $1, $2, 40 lw $3, 30($0) jal 20000 sw $2, 20($4) 6 nS 2 nS 5 nS 4 nS 1 nS Instruction Fetch This is an example of a “Asynchronous Globally-Timed” control strategy (see Lecture 18). Such a system would vary the clock period based on the instruction being executed. This leads to complicated timing generation, and, in the end, slower systems, since it is not very compatible with pipelining! Instruction Decode Register Prop Delay ALU Operation Branch Target Data Access Register Setup

Uniform miniMIPS Timing With a fixed clock period, we have to allow for the worse case. 1 instr EVERY 20 nS Program execution order Time CLK add $4, $5, $6 beq $1, $2, 40 lw $3, 30($0) jal 20000 sw $2, 20($4) 6 nS 2 nS 5 nS 4 nS 1 nS Instruction Fetch By accounting for the “worse case” path (i.e. allowing time for each possible combination of operations) we can implement a “Synchronous Globally-Timed” control strategy. This simplifies timing generation, enforces a uniform processing order, and allows for pipelining! Instruction Decode Register Prop Delay Isn’t the net effect just a slower CPU? ALU Operation Branch Target Data Access Register Setup

Step 1: A 2-Stage Pipeline 0x80000000 0x80000040 PC<31:29>:J<25:0>:00 0x80000080 JT BT IF PCSEL 6 5 4 3 2 1 PC 00 EXE Instruction Memory A D +4 PCEXE 00 IREXE Rt: <20:16> Rs: <25:21> J:<25:0> WASEL Rd:<15:11> 1 2 3 Rt:<20:16> RA1 Register RA2 “31” WA WA File WD “27” RD1 RD2 WE WERF Imm: <15:0> IR stands for “Instruction Register”. The superscript “EXE” denotes the pipeline stage, in which the PC and IR are used. RESET SEXT SEXT JT IRQ Z N V C x4 shamt:<10:6> Control Logic + “16” 1 2 ASEL 1 BSEL PCSEL WASEL BT SEXT A B BSEL ALU Wr ALUFN Data Memory RD WD R/W WDSEL ALUFN Wr N V C Z Adr WERF ASEL PC+4 WDSEL 0 1 2

2-Stage Pipe Timing Improves performance by increasing instruction throughput. Ideal speedup is number of pipeline stages in the pipeline. Program execution order Time CLK add $4, $5, $6 During this, and all subsequent clocks two instructions are in various stages of execution beq $1, $2, 40 lw $3, 30($0) jal 20000 sw $2, 20($4) 6 nS 2 nS 5 nS 4 nS 1 nS Instruction Fetch By partitioning each instruction cycle into a “fetch” stage and an “execute” stage, we get a simple pipeline. Why not include the Instruction-Decode/Register-Access time with the Instruction Fetch? You could. But this partitioning allows for a useful variant with 2-cycle loads and stores. Instruction Decode Register Prop Delay Latency? Throughput? ALU Operation Branch Target 2 Clock periods = 2*14 nS Data Access Register Setup 1 instr per 14 nS

2-Stage w/2-Cycle Loads & Stores Further improves performance, with slight increase in control complexity. Some 1st generation (pre-cache) RISC processors used this approach. Program execution order Time CLK add $4, $5, $6 This design is very similar to the multicycle CPU described in section 5.5 of the text, but with pipelining. beq $1, $2, 40 lw $3, 30($0) jal 20000 sw $2, 20($4) Clock: 8 nS! 6 nS 2 nS 5 nS 4 nS 1 nS Instruction Fetch The clock rate of this variant is over twice that of our original design. Does that mean it is that much faster? Instruction Decode Register Prop Delay ALU Operation Not likely. In practice, as many as 30% of instructions access memory. Thus, the effective speed up is: Branch Target Data Access Register Setup

2-Stage Pipelined Operation Consider a sequence of instructions: ... addi $t2,$t1,1 xor $t2,$t1,$t2 sltiu $t3,$t2,1 srl $t2,$t2,1 ... Recall “Pipeline Diagrams” from an earlier slide. Executed on our 2-stage pipeline: IF EXE i i+1 i+2 i+3 i+4 i+5 i+6 ... xor addi srl sltiu TIME (cycles) Pipeline It can’t be this easy!?

PC<31:29>:J<25:0>:00 0x80000000 0x80000040 PC<31:29>:J<25:0>:00 0x80000080 JT BT Step 2: 4-Stage miniMIPS PCSEL 6 5 4 3 2 1 PC 00 Instruction Memory A Treats register file as two separate devices: combinational READ, clocked WRITE at end of pipe. What other information do we have to pass down pipeline? PC instruction fields What sort of improvement should expect in cycle time? +4 D Instruction Fetch PCREG 00 IRREG Rt: <20:16> Rs: <25:21> J:<25:0> RA1 Register RA2 WA File RD1 RD2 JT = Imm: <15:0> SEXT SEXT BZ x4 shamt:<10:6> + “16” Register 1 2 ASEL 1 BSEL File BT PCALU 00 IRALU A B WDALU A B ALU ALUFN (return addresses) ALU N V C Z PCMEM 00 IRMEM Y WDMEM (decoding) Wr PC+4 Data Memory RD Adr WD R/W Rt:<20:16> Rd:<15:11> “31” “27” WASEL 0 1 2 3 WDSEL 0 1 2 (NB: SAME RF AS ABOVE!) Write WA Register WD Back WERF WE WA File

4-Stage miniMIPS Operation Consider a sequence of instructions: ... addi $t0,$t0,1 sll $t1,$t1,2 andi $t2,$t2,15 sub $t3,$0,$t3 ... Executed on our 4-stage pipeline: IF RF ALU WB i i+1 i+2 i+3 i+4 i+5 i+6 ... sll addi sub andi TIME (cycles) Pipeline