Performance of Single-cycle Design

Slides:



Advertisements
Similar presentations
Pipeline Example: cycle 1 lw R10,9(R1) sub R11,R2, R3 and R12,R4, R5 or R13,R6, R7.
Advertisements

CMPT 334 Computer Organization
The Processor: Datapath & Control
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
©UCB CS 162 Computer Architecture Lecture 3: Pipelining Contd. Instructor: L.N. Bhuyan
Chapter Six Enhancing Performance with Pipelining
Computer ArchitectureFall 2008 © October 6th, 2008 Majd F. Sakr CS-447– Computer Architecture.
Morgan Kaufmann Publishers
1  What is the most boring household activity?. 2 A relevant question  Assuming you’ve got: —One washer (takes 30 minutes) —One drier (takes 40 minutes)
COSC 3430 L08 Basic MIPS Architecture.1 COSC 3430 Computer Architecture Lecture 08 Processors Single cycle Datapath PH 3: Sections
1 A pipeline diagram  A pipeline diagram shows the execution of a series of instructions. —The instruction sequence is shown vertically, from top to bottom.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
Computer Organization CS224 Chapter 4 Part b The Processor Spring 2010 With thanks to M.J. Irwin, T. Fountain, D. Patterson, and J. Hennessy for some lecture.
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.
Electrical and Computer Engineering University of Cyprus LAB3: IMPROVING MIPS PERFORMANCE WITH PIPELINING.
1 A single-cycle MIPS processor  An instruction set architecture is an interface that defines the hardware operations which are available to software.
CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and
Performance of Single-cycle Design
IT 251 Computer Organization and Architecture Multi Cycle CPU Datapath Chia-Chi Teng.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
Datapath and Control AddressInstruction Memory Write Data Reg Addr Register File ALU Data Memory Address Write Data Read Data PC Read Data Read Data.
COM181 Computer Hardware Lecture 6: The MIPs CPU.
1 CS/COE0447 Computer Organization & Assembly Language Chapter 5 Part 3.
1 The final datapath. 2 Control  The control unit is responsible for setting all the control signals so that each instruction is executed properly. —The.
Computer Architecture Lecture 6.  Our implementation of the MIPS is simplified memory-reference instructions: lw, sw arithmetic-logical instructions:
CS161 – Design and Architecture of Computer Systems
Computer Organization
Stalling delays the entire pipeline
Single-Cycle Datapath and Control
Computer Architecture
IT 251 Computer Organization and Architecture
Morgan Kaufmann Publishers The Processor
ECS 154B Computer Architecture II Spring 2009
\course\cpeg323-08F\Topic6b-323
CS/COE0447 Computer Organization & Assembly Language
Design of the Control Unit for Single-Cycle Instruction Execution
MIPS processor continued
Chapter 4 The Processor Part 2
CS/COE0447 Computer Organization & Assembly Language
Single-cycle datapath, slightly rearranged
Single-Cycle CPU DataPath.
CS/COE0447 Computer Organization & Assembly Language
Design of the Control Unit for One-cycle Instruction Execution
A pipeline diagram Clock cycle lw $t0, 4($sp) IF ID
CS/COE0447 Computer Organization & Assembly Language
Lecturer: Alan Christopher
CS/COE0447 Computer Organization & Assembly Language
Serial versus Pipelined Execution
Systems Architecture II
Composing the Elements
Composing the Elements
The Processor Lecture 3.2: Building a Datapath with Control
Guest Lecturer TA: Shreyas Chand
COSC 2021: Computer Organization Instructor: Dr. Amir Asif
COSC 2021: Computer Organization Instructor: Dr. Amir Asif
Processor: Multi-Cycle Datapath & Control
Chapter Four The Processor: Datapath and Control
Pipelining Appendix A and Chapter 3.
MIPS processor continued
Introduction to Computer Organization and Architecture
A single-cycle MIPS processor
A relevant question Assuming you’ve got: One washer (takes 30 minutes)
The Processor: Datapath & Control.
Pipelining.
Processor: Datapath and Control
Pipelined datapath and control
CS161 – Design and Architecture of Computer Systems
CS/COE0447 Computer Organization & Assembly Language
Presentation transcript:

Performance of Single-cycle Design At the start of the cycle, PC is updated (PC + 4, or PC + 4 + offset × 4) New instruction loaded from memory, control unit sets the datapath signals appropriately so that registers are read, ALU output is generated, data memory is accessed, branch target addresses are computed, and register file is updated In a single-cycle datapath everything must complete within one clock cycle, before the next clock cycle How long is that clock cycle? CPU time = Instructions executed  CPI  Clock cycle time CPI = 1 for a single-cycle design

Components of the data-path Each component of the datapath has an associated delay (latency) The cycle time has to be large enough to accommodate the slowest instruction 8ns reading the instruction memory 2ns reading the register file 1ns ALU computation 2ns accessing data memory 2ns writing to the register file 1ns M u x 1 Read address Instruction memory [31-0] Write data Data Sign extend Result Zero ALU I [15 - 0] I [25 - 21] I [20 - 16] I [15 - 11] register 1 register 2 register data 2 data 1 Registers 2 ns 1 ns 0 ns

(48% x 6ns) + (22% x 8ns) + (11% x 7ns) + (19% x 5ns) = 6.36ns How bad is this? With these same component delays, a sw instruction would need 7ns, and beq would need just 5ns Let’s consider the gcc instruction mix: With a single-cycle datapath, each instruction would require 8ns But if we could execute instructions as fast as possible, the average time per instruction for gcc would be: (48% x 6ns) + (22% x 8ns) + (11% x 7ns) + (19% x 5ns) = 6.36ns The single-cycle datapath is about 1.26 times slower! Instruction Frequency Arithmetic 48% Loads 22% Stores 11% Branches 19%

Improving performance Two ideas for improving performance: Spilt each instruction into multiple steps, each taking 1 cycle steps: IF (instruction fetch), ID (instruction decode), EX (execute ALU operation), MEM (memory access), WB (register write-back) slow instructions take more cycles than fast instructions known as a multi-cycle implementation Crucial observation: each instruction uses only a portion of the datapath in each step can overlap instructions; each uses one portion of the datapath known as a pipelined implementation Examples of pipelining: any assembly process (cars, etc.), multiple loads of laundry (washer + dryer), …

Pipelining: Example Assembling a sandwich: Order, Toast (optional), Add extras, Pay ORD (8 seconds) TOS (0 or 10 seconds) ADD (0 to 10 seconds) PAY (5 seconds) We can assemble sandwiches every 10 seconds with pipelining: A single sandwich takes between 13 and 33 seconds PAY ADD TOS ORD 0 10 20 30 40 50 60

Pipelining lessons Pipelining can increase throughput (#sandwiches per hour), but… Every sandwich must use all stages prevents clashes in the pipeline Every stage must take the same amount of time limited by the slowest stage (in this example, 10 seconds) These two factors decrease the latency (time per sandwich)! For an optimal k-stage pipeline: every stage does useful work stage lengths are balanced Under these conditions, we nearly achieve the optimal speedup: k “nearly” because there is still the fill and drain time

Pipelining not just Multiprocessing Pipelining does involve parallel processing, but in a specific way Both multiprocessing and pipelining relate to the processing of multiple “things” using multiple “functional units” In multiprocessing, each thing is processed entirely by a single functional unit e.g. multiple lanes at the supermarket In pipelining, each thing is broken into a sequence of pieces, where each piece is handled by a different (specialized) functional unit e.g. checker vs. bagger Pipelining and multiprocessing are not mutually exclusive Modern processors do both, with multiple pipelines (e.g. superscalar) Pipelining is a general-purpose efficiency technique; used elsewhere in CS: Networking, I/O devices, server software architecture

Pipelining MIPS Executing a MIPS instruction can take up to five stages Not all instructions need all five stages and stages have different lengths Clock cycle time determined by length of slowest stage (2ns here) Step Name Description Instruction Fetch IF Read an instruction from memory Instruction Decode ID Read source registers and generate control signals Execute EX Compute an R-type result or a branch outcome Memory MEM Read or write the data memory Writeback WB Store a result in the destination register Instruction Steps required beq IF ID EX R-type WB sw MEM lw

Instruction Fetch (IF) While IF is executing, the rest of the datapath is sitting idle… Read address Instruction memory [31-0] Write data Data MemWrite MemRead 1 M u x MemToReg Sign extend ALUSrc Result Zero ALU ALUOp I [15 - 0] I [25 - 21] I [20 - 16] I [15 - 11] RegDst register 1 register 2 register data 2 data 1 Registers RegWrite

Instruction Decode (ID) Then while ID is executing, the IF-related portion becomes idle… Read address Instruction memory [31-0] Write data Data MemWrite MemRead 1 M u x MemToReg Sign extend ALUSrc Result Zero ALU ALUOp I [15 - 0] I [25 - 21] I [20 - 16] I [15 - 11] RegDst register 1 register 2 register data 2 data 1 Registers RegWrite

Execute (EX) ..and so on for the EX portion… Read address Instruction memory [31-0] Write data Data MemWrite MemRead 1 M u x MemToReg Sign extend ALUSrc Result Zero ALU ALUOp I [15 - 0] I [25 - 21] I [20 - 16] I [15 - 11] RegDst register 1 register 2 register data 2 data 1 Registers RegWrite

Memory (MEM) …the MEM portion… Read address Instruction memory [31-0] Write data Data MemWrite MemRead 1 M u x MemToReg Sign extend ALUSrc Result Zero ALU ALUOp I [15 - 0] I [25 - 21] I [20 - 16] I [15 - 11] RegDst register 1 register 2 register data 2 data 1 Registers RegWrite

Writeback (WB) …and the WB portion What about the “clash” with the IF stage over the register file? Answer: Register file is read at the end of the clock cycle, but written at the start of the clock cycle. Hence, there is no clash RegWrite MemWrite MemToReg Read address Instruction [31-0] I [25 - 21] Read register 1 Read data 1 ALU Read address Read data 1 M u x I [20 - 16] Read register 2 Zero Instruction memory Read data 2 M u x 1 M u x 1 Result Write address Write register Data memory Write data Registers I [15 - 11] Write data ALUOp MemRead ALUSrc RegDst I [15 - 0] Sign extend

Decoding and fetching together Why don’t we go ahead and fetch the next instruction while we’re decoding the first one? Fetch 2nd Decode 1st instruction Instruction memory [31-0] Read address Write data Data MemWrite MemRead 1 M u x MemToReg Sign extend ALUSrc Result Zero ALU ALUOp I [15 - 0] I [25 - 21] I [20 - 16] I [15 - 11] RegDst register 1 register 2 register data 2 data 1 Registers RegWrite

Executing, decoding and fetching Similarly, once the first instruction enters its Execute stage, we can go ahead and decode the second instruction But now the instruction memory is free again, so we can fetch the third instruction! Fetch 3rd Decode 2nd Execute 1st Read address Instruction memory [31-0] Write data Data MemWrite MemRead 1 M u x MemToReg Sign extend ALUSrc Result Zero ALU ALUOp I [15 - 0] I [25 - 21] I [20 - 16] I [15 - 11] RegDst register 1 register 2 register data 2 data 1 Registers RegWrite

Break datapath into 5 stages Each stage has its own functional units Full pipeline  the datapath is simultaneously working on 5 instructions! IF ID EXE MEM WB RegWrite MemWrite MemToReg Read address Instruction [31-0] I [25 - 21] Read register 1 Read data 1 ALU Read address Read data 1 M u x I [20 - 16] Read register 2 Zero Instruction memory Read data 2 M u x 1 M u x 1 Result Write address Write register Data memory Write data Registers I [15 - 11] Write data ALUOp MemRead ALUSrc RegDst I [15 - 0] Sign extend newest oldest

A pipeline diagram Clock cycle 1 2 3 4 5 6 7 8 9 lw $t0, 4($sp) IF ID EX MEM WB sub $v0, $a0, $a1 and $t1, $t2, $t3 or $s0, $s1, $s2 addi $sp, $sp, -4 A pipeline diagram shows the execution of a series of instructions The instruction sequence is shown vertically, from top to bottom Clock cycles are shown horizontally, from left to right Each instruction is divided into its component stages Example: In cycle 3, there are three active instructions: The “lw” instruction is in its Execute stage Simultaneously, the “sub” is in its Instruction Decode stage Also, the “and” instruction is just being fetched

Pipeline terminology filling full emptying Clock cycle 1 2 3 4 5 6 7 8 9 lw $t0, 4($sp) IF ID EX MEM WB sub $v0, $a0, $a1 and $t1, $t2, $t3 or $s0, $s1, $s2 add $sp, $sp, -4 filling full emptying The pipeline depth is the number of stages—in this case, five The pipeline is filling in the first four cycles (unused functional units) In cycle 5, the pipeline is full. Five instructions are being executed simultaneously, so all hardware units are in use In cycles 6-9, the pipeline is emptying/draining

Pipelining Performance Clock cycle 1 2 3 4 5 6 7 8 9 lw $t0, 4($sp) IF ID EX MEM WB lw $t1, 8($sp) lw $t2, 12($sp) lw $t3, 16($sp) lw $t4, 20($sp) filling How many cycles to execute N instructions on a k stage pipeline? Solution 1: k  1 cycles to fill the pipeline + one cycle per instruction = k  1 + N cycles Solution 2: k cycles for the first instruction + one cycle for each of the remaining N  1 instructions When N = 1000, how much faster is a 5-stage pipeline (2ns clock cycle) vs. a single cycle implementation (8ns clock cycle)?

Pipeline Datapath: Resource Requirements Clock cycle 1 2 3 4 5 6 7 8 9 lw $t0, 4($sp) IF ID EX MEM WB lw $t1, 8($sp) lw $t2, 12($sp) lw $t3, 16($sp) lw $t4, 20($sp) We need to perform several operations in the same cycle Increment the PC and add registers at the same time Fetch one instruction while another one reads or writes data What does that mean for our hardware? Separate ADDER and ALU Two memories (instruction memory and data memory)