A relevant question Assuming you’ve got: One washer (takes 30 minutes)

Slides:

Advertisements

Similar presentations

PipelineCSCE430/830 Pipeline: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Prof. Yifeng Zhu, U of Maine Fall,

Advertisements

Pipeline Example: cycle 1 lw R10,9(R1) sub R11,R2, R3 and R12,R4, R5 or R13,R6, R7.

CMPT 334 Computer Organization

Pipelining I (1) Fall 2005 Lecture 18: Pipelining I.

Computer Organization

The Processor: Datapath & Control

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 1.

©UCB CS 162 Computer Architecture Lecture 3: Pipelining Contd. Instructor: L.N. Bhuyan

Computer ArchitectureFall 2007 © October 22nd, 2007 Majd F. Sakr CS-447– Computer Architecture.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 9, 2002 Topic: Pipelining Basics.

1 Atanasoff–Berry Computer, built by Professor John Vincent Atanasoff and grad student Clifford Berry in the basement of the physics building at Iowa State.

Computer ArchitectureFall 2008 © October 6th, 2008 Majd F. Sakr CS-447– Computer Architecture.

Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 17 - Pipelined.

Morgan Kaufmann Publishers

1  What is the most boring household activity?. 2 A relevant question  Assuming you’ve got: —One washer (takes 30 minutes) —One drier (takes 40 minutes)

CMPE 421 Advanced Computer Architecture Supplementary material for Pipelining PART1.

Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time.

B 0000 Pipelining ENGR xD52 Eric VanWyk Fall

1 A pipeline diagram  A pipeline diagram shows the execution of a series of instructions. —The instruction sequence is shown vertically, from top to bottom.

CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.

Computer Organization CS224 Chapter 4 Part b The Processor Spring 2010 With thanks to M.J. Irwin, T. Fountain, D. Patterson, and J. Hennessy for some lecture.

1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.

CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.

Electrical and Computer Engineering University of Cyprus LAB3: IMPROVING MIPS PERFORMANCE WITH PIPELINING.

1 A single-cycle MIPS processor  An instruction set architecture is an interface that defines the hardware operations which are available to software.

December 26, 2015©2003 Craig Zilles (derived from slides by Howard Huang) 1 A single-cycle MIPS processor  As previously discussed, an instruction set.

CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and

Performance of Single-cycle Design

February 22, 2016©2003 Craig Zilles (derived from slides by Howard Huang) 1 A single-cycle MIPS processor  As previously discussed, an instruction set.

Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.

Datapath and Control AddressInstruction Memory Write Data Reg Addr Register File ALU Data Memory Address Write Data Read Data PC Read Data Read Data.

Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 10 Computer Hardware Design (Pipeline Datapath and Control Design) Prof. Dr.

Computer Architecture Lecture 6.  Our implementation of the MIPS is simplified memory-reference instructions: lw, sw arithmetic-logical instructions:

Lecture 18: Pipelining I.

Computer Organization

Pipelines An overview of pipelining

Stalling delays the entire pipeline

Note how everything goes left to right, except …

CDA 3101 Spring 2016 Introduction to Computer Organization

Computer Architecture

IT 251 Computer Organization and Architecture

Performance of Single-cycle Design

CMSC 611: Advanced Computer Architecture

Morgan Kaufmann Publishers The Processor

ECE232: Hardware Organization and Design

ECS 154B Computer Architecture II Spring 2009

\course\cpeg323-08F\Topic6b-323

CS/COE0447 Computer Organization & Assembly Language

Chapter 4 The Processor Part 2

CS/COE0447 Computer Organization & Assembly Language

Single-cycle datapath, slightly rearranged

A pipeline diagram Clock cycle lw $t0, 4($sp) IF ID

CS/COE0447 Computer Organization & Assembly Language

Lecturer: Alan Christopher

CS/COE0447 Computer Organization & Assembly Language

Serial versus Pipelined Execution

Systems Architecture II

\course\cpeg323-05F\Topic6b-323

The Processor Lecture 3.2: Building a Datapath with Control

An Introduction to pipelining

Pipelining Appendix A and Chapter 3.

MIPS processor continued

Introduction to Computer Organization and Architecture

The Processor: Datapath & Control.

©2003 Craig Zilles (derived from slides by Howard Huang)

Pipelined datapath and control

CS/COE0447 Computer Organization & Assembly Language

Presentation transcript:

A relevant question Assuming you’ve got: One washer (takes 30 minutes) One drier (takes 40 minutes) One “folder” (takes 20 minutes) It takes 90 minutes to wash, dry, and fold 1 load of laundry. How long does 4 loads take? It takes 90 * 4 = 360 minutes, or just 210 minutes? July 6, 2019 ©2003 Craig Zilles (derived from slides by Howard Huang and David Patterson)

The slow way If each load is done sequentially it takes 6 hours 6 PM 7 8 9 10 11 Midnight Time 30 40 20 30 40 20 30 40 20 30 40 20 If each load is done sequentially it takes 6 hours July 6, 2019 Pipelining

Laundry Pipelining Start each load as soon as possible Overlap loads Pipelined laundry takes 3.5 hours (210 minutes) 210 6 PM 7 8 9 10 11 Midnight Time 30 40 20 July 6, 2019 Pipelining

Pipelining Lessons 70 mins 80 mins 60 mins Pipelining doesn’t help latency of single load, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously using different resources Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline (70 mins) and time to “drain” (60 mins) are reduce speedup 6 PM 7 8 9 Time 30 40 20 July 6, 2019 Pipelining

Pipelining is not just Multiprocessing Pipelining does involve parallel processing, but in a specific way. Both multiprocessing and pipelining relate to the processing of multiple “things” using multiple “functional units” Multiprocessing implies each thing is processed entirely by a single functional unit e.g., multiple lanes at the supermarket In pipelining, each thing is broken into a sequence of pieces, where each piece is handled by a different (specialized) functional unit. Supermarket analogy? Pipelining and multiprocessing are not mutually exclusive Modern processors do both, with multiple pipelines (e.g., superscalar) March 17, 2003 Pipelining

Pipelining Pipelining is a general-purpose efficiency technique It is not specific to processors Pipelining is used in: Assembly lines Bucket brigades Fast food restaurants Pipelining is used in other CS disciplines: Networking Server software architecture Useful to increase throughput in the presence of long latency More on that later… July 6, 2019 Pipelining

Instruction execution review Executing a MIPS instruction can take up to five steps. However, as we saw, not all instructions need all five steps. Step Name Description Instruction Fetch IF Read an instruction from memory. Instruction Decode ID Read source registers and generate control signals. Execute EX Compute an R-type result or a branch outcome. Memory MEM Read or write the data memory. Writeback WB Store a result in the destination register. Instruction Steps required beq IF ID EX R-type WB sw MEM lw July 6, 2019 Pipelining

Single-cycle datapath diagram 4 Shift left 2 PC Add M u x 1 PCSrc Read address Write data Data memory MemWrite MemRead MemToReg Instruction [31-0] I [15 - 0] I [25 - 21] I [20 - 16] I [15 - 11] RegDst register 1 register 2 register data 2 data 1 Registers RegWrite Sign extend ALUSrc Result Zero ALU ALUOp 2ns 2ns 2ns How long does it take to execute each instruction? July 6, 2019 Pipelining

Example: Instruction Fetch (IF) Let’s quickly review how lw is executed in the single-cycle datapath. We’ll ignore PC incrementing and branching for now. In the Instruction Fetch (IF) step, we read the instruction memory. Read address Instruction memory [31-0] Write data Data MemWrite MemRead 1 M u x MemToReg Sign extend ALUSrc Result Zero ALU ALUOp I [15 - 0] I [25 - 21] I [20 - 16] I [15 - 11] RegDst register 1 register 2 register data 2 data 1 Registers RegWrite July 6, 2019 Pipelining

Instruction Decode (ID) The Instruction Decode (ID) step reads the source register from the register file. Read address Instruction memory [31-0] Write data Data MemWrite MemRead 1 M u x MemToReg Sign extend ALUSrc Result Zero ALU ALUOp I [15 - 0] I [25 - 21] I [20 - 16] I [15 - 11] RegDst register 1 register 2 register data 2 data 1 Registers RegWrite July 6, 2019 Pipelining

Execute (EX) The third step, Execute (EX), computes the effective memory address from the source register and the instruction’s constant field. Read address Instruction memory [31-0] Write data Data MemWrite MemRead 1 M u x MemToReg Sign extend ALUSrc Result Zero ALU ALUOp I [15 - 0] I [25 - 21] I [20 - 16] I [15 - 11] RegDst register 1 register 2 register data 2 data 1 Registers RegWrite July 6, 2019 Pipelining

Memory (MEM) The Memory (MEM) step involves reading the data memory, from the address computed by the ALU. Read address Instruction memory [31-0] Write data Data MemWrite MemRead 1 M u x MemToReg Sign extend ALUSrc Result Zero ALU ALUOp I [15 - 0] I [25 - 21] I [20 - 16] I [15 - 11] RegDst register 1 register 2 register data 2 data 1 Registers RegWrite July 6, 2019 Pipelining

Writeback (WB) Finally, in the Writeback (WB) step, the memory value is stored into the destination register. RegWrite MemWrite MemToReg Read address Instruction [31-0] I [25 - 21] Read register 1 Read data 1 ALU Read address Read data 1 M u x I [20 - 16] Read register 2 Zero Instruction memory Read data 2 M u x 1 M u x 1 Result Write address Write register Data memory Write data Registers I [15 - 11] Write data ALUOp MemRead ALUSrc RegDst I [15 - 0] Sign extend July 6, 2019 Pipelining

A bunch of lazy functional units Notice that each execution step uses a different functional unit. In other words, the main units are idle for most of the 8ns cycle! The instruction RAM is used for just 2ns at the start of the cycle. Registers are read once in ID (1ns), and written once in WB (1ns). The ALU is used for 2ns near the middle of the cycle. Reading the data memory only takes 2ns as well. That’s a lot of hardware sitting around doing nothing. July 6, 2019 Pipelining

Putting those slackers to work We shouldn’t have to wait for the entire instruction to complete before we can re-use the functional units. For example, the instruction memory is free in the Instruction Decode step as shown below, so... Idle Instruction Decode (ID) Read address Instruction memory [31-0] Write data Data MemWrite MemRead 1 M u x MemToReg Sign extend ALUSrc Result Zero ALU ALUOp I [15 - 0] I [25 - 21] I [20 - 16] I [15 - 11] RegDst register 1 register 2 register data 2 data 1 Registers RegWrite July 6, 2019 Pipelining

Decoding and fetching together Why don’t we go ahead and fetch the next instruction while we’re decoding the first one? Fetch 2nd Decode 1st instruction Instruction memory [31-0] Read address Write data Data MemWrite MemRead 1 M u x MemToReg Sign extend ALUSrc Result Zero ALU ALUOp I [15 - 0] I [25 - 21] I [20 - 16] I [15 - 11] RegDst register 1 register 2 register data 2 data 1 Registers RegWrite July 6, 2019 Pipelining

Executing, decoding and fetching Similarly, once the first instruction enters its Execute stage, we can go ahead and decode the second instruction. But now the instruction memory is free again, so we can fetch the third instruction! Fetch 3rd Decode 2nd Execute 1st Read address Instruction memory [31-0] Write data Data MemWrite MemRead 1 M u x MemToReg Sign extend ALUSrc Result Zero ALU ALUOp I [15 - 0] I [25 - 21] I [20 - 16] I [15 - 11] RegDst register 1 register 2 register data 2 data 1 Registers RegWrite July 6, 2019 Pipelining

Making Pipelining Work We’ll make our pipeline 5 stages long, to handle each of the five steps in a load instructions (the longest instruction for this machine) Stages are: IF, ID, EX, MEM, and WB We want to support executing 5 instructions simultaneously: one in each stage. July 6, 2019 Pipelining

Break datapath into 5 stages Insert pipeline registers Each stage has its own functional units. Each stage can execute in 2ns IF ID EXE MEM WB RegWrite MemWrite MemToReg Read address Instruction [31-0] I [25 - 21] Read register 1 Read data 1 ALU Read address Read data 1 M u x I [20 - 16] Read register 2 Zero Instruction memory Read data 2 M u x 1 M u x 1 Result Write address Write register Data memory Write data Registers I [15 - 11] Write data ALUOp MemRead ALUSrc RegDst I [15 - 0] Sign extend 2ns 1ns 2ns 2ns July 6, 2019 Pipelining

Pipelining Loads Clock cycle 1 2 3 4 5 6 7 8 9 IF ID EX MEM WB 6 PM 7 lw $t0, 4($sp) IF ID EX MEM WB lw $t1, 8($sp) lw $t2, 12($sp) lw $t3, 16($sp) lw $t4, 20($sp) 6 PM 7 8 9 Time 30 40 20 July 6, 2019 Pipelining

Pipelining Performance Clock cycle 1 2 3 4 5 6 7 8 9 lw $t0, 4($sp) IF ID EX MEM WB lw $t1, 8($sp) lw $t2, 12($sp) lw $t3, 16($sp) lw $t4, 20($sp) filling Execution time on ideal pipeline: time to fill the pipeline + one cycle per instruction How long for N instructions? Compare with other implementations: Single Cycle: (8ns clock period) How much faster is pipelining for N=1000 ? July 6, 2019 Pipelining

Pipeline Datapath: Resource Requirements Clock cycle 1 2 3 4 5 6 7 8 9 lw $t0, 4($sp) IF ID EX MEM WB lw $t1, 8($sp) lw $t2, 12($sp) lw $t3, 16($sp) lw $t4, 20($sp) We need to perform several operations in the same cycle. Increment the PC and add registers at the same time. Fetch one instruction while another one reads or writes data. What does that mean for our hardware? July 6, 2019 Pipelining

Pipelining other instruction types R-type instructions only require 4 stages: IF, ID, EX, and WB We don’t need the MEM stage What happens if we try to pipeline loads with R-type instructions? Clock cycle 1 2 3 4 5 6 7 8 9 add $sp, $sp, -4 IF ID EX WB sub $v0, $a0, $a1 lw $t0, 4($sp) MEM or $s0, $s1, $s2 lw $t1, 8($sp) July 6, 2019 Pipelining

Important Observation Each functional unit can only be used once per instruction Each functional unit must be used at the same stage for all instructions: Load uses Register File’s Write Port during its 5th stage R-type uses Register File’s Write Port during its 4th stage Clock cycle 1 2 3 4 5 6 7 8 9 add $sp, $sp, -4 IF ID EX WB sub $v0, $a0, $a1 lw $t0, 4($sp) MEM or $s0, $s1, $s2 lw $t1, 8($sp) July 6, 2019 Pipelining

A solution: Insert NOP stages Enforce uniformity Make all instructions take 5 cycles. Make them have the same stages, in the same order Some stages will do nothing for some instructions Stores and Branches have NOP stages, too… R-type IF ID EX NOP WB Clock cycle 1 2 3 4 5 6 7 8 9 add $sp, $sp, -4 IF ID EX NOP WB sub $v0, $a0, $a1 lw $t0, 4($sp) MEM or $s0, $s1, $s2 lw $t1, 8($sp) store IF ID EX MEM NOP branch IF ID EX NOP July 6, 2019 Pipelining

Summary Pipelining attempts to maximize instruction throughput by overlapping the execution of multiple instructions. Pipelining offers amazing speedup. In the best case, one instruction finishes on every cycle, and the speedup is equal to the pipeline depth. The pipeline datapath is much like the single-cycle one, but with added pipeline registers Each stage needs is own functional units Next time we’ll see the datapath and control, and walk through an example execution. July 6, 2019 Pipelining