Download presentation
Presentation is loading. Please wait.
Published byebrew Μανωλάς Modified over 6 years ago
1
School of Computing and Informatics Arizona State University
CSE 420/598 Computer Architecture Lec 18 – Appendix A – Pipelining (Basics) Sandeep K. S. Gupta School of Computing and Informatics Arizona State University Based on Slides by David Patterson and M. Younis CS252 S05
2
A "Typical" RISC ISA 32-bit fixed format instruction (3 formats)
32 32-bit GPR (R0 contains zero, DP take pair) 3-address, reg-reg arithmetic instruction Single address mode for load/store: base + displacement no indirection Simple branch conditions Delayed branch see: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM PowerPC, CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3 9/20/2018 CSE420/598
3
Basics of a RISC Instruction Set
RISC architectures are characterized by the following features that dramatically simplifies the implementation: All ALU operations apply only on data in registers Memory is affected only by load and store operations Instructions follow very few formats and typically are of the same size All MIPS instructions are 32 bits, following one of three formats: R-type I-type J-type op target address 26 31 6 bits 26 bits rs rt rd shamt funct 6 11 16 21 5 bits immediate 16 bits * Slide is courtesy of Dave Patterson 9/20/2018 CSE420/598
4
MIPS Instruction format
Register-format instructions: op: Basic operation of the instruction, traditionally called opcode rs: The first register source operand rt: The second register source operand rd: The register destination operand, it gets the result of the operation shmat: Shift amount funct: This field selects the specific variant of the operation of the op field MIPS assembly language includes two conditional branching instructions using PC -relative addressing: beq register1, register2, L1 # go to L1 if (register1) = (register2) bne register1, register2, L1 # go to L1 if (register1) (register2) Examples: add $t2, $ t1, $ t1 # Temp reg $t2 = 2 $t1 sub $t1, $s3, $s4 # Temp reg $t1 = $s3 - $s4 and $t1, $ t2, $ t3 # Temp reg $t1 = $t2 . $t bne $s3, $s4, Else # if $s3 $s4 jump to Else 9/20/2018 CSE420/598
5
MIPS Instruction format
Immediate-type instructions: The 16-bit address means a load word instruction can load a word within a region of 215 bytes of the address in the base register Examples: lw $t0, 32($s3) , sw $t1, 128($s3) MIPS handle 16-bit constant efficiently by including the constant value in the address field of an I-type instruction (Immediate-type) addi $sp, $sp, 4 #$sp = $sp + 4 For large constants that need more than 16 bits, a load upper-immediate (lui) instruction is used to concatenate the second part 9/20/2018 CSE420/598
6
Addressing in Branches & Jumps
I-type instructions leaves only 16 bits for address reference limiting the size of the jump MIPS branch instructions use the address as an increment to the PC allowing the program to be as large as 232 (called PC-relative addressing) Since the program counter gets incremented prior to instruction execution, the branch address is actually relative to (PC + 4) MIPS also supports an J-type instruction format for large jump instructions The 26-bit address in a J-type instruct. is concatenated to upper 8 bits of PC 9/20/2018 CSE420/598
7
5 Steps of MIPS Datapath 4 Instruction Fetch Instr. Decode Reg. Fetch
Execute Addr. Calc Memory Access Write Back Next PC MUX 4 Adder Next SEQ PC Zero? RS1 Reg File Address MUX Memory RS2 ALU Inst Memory Data L M D RD MUX MUX Sign Extend IR <= mem[PC]; PC <= PC + 4 Imm WB Data Reg[IRrd] <= Reg[IRrs] opIRop Reg[IRrt] 9/20/2018 CSE420/598 CS252 S05
8
5 Steps of MIPS Datapath 4 Instruction Fetch Instr. Decode Reg. Fetch
Execute Addr. Calc Memory Access Write Back Next PC IF/ID ID/EX MEM/WB EX/MEM MUX Next SEQ PC Next SEQ PC 4 Adder Zero? RS1 Reg File Address Memory MUX RS2 ALU Memory Data MUX MUX IR <= mem[PC]; PC <= PC + 4 Sign Extend Imm WB Data A <= Reg[IRrs]; B <= Reg[IRrt] RD RD RD rslt <= A opIRop B WB <= rslt 9/20/2018 CSE420/598 Reg[IRrd] <= WB CS252 S05
9
Inst. Set Processor Controller
IR <= mem[PC]; PC <= PC + 4 Ifetch A <= Reg[IRrs]; B <= Reg[IRrt] opFetch-DCD JSR JR ST PC <= IRjaddr if bop(A,b) PC <= PC+IRim br jmp r <= A + IRim WB <= Mem[r] Reg[IRrd] <= WB LD r <= A opIRop IRim Reg[IRrd] <= WB WB <= r RI RR r <= A opIRop B WB <= r Reg[IRrd] <= WB 9/20/2018 CSE420/598
10
A Simple Implementation of MIPS
9/20/2018 CSE420/598
11
Single-cycle Instruction Execution
9/20/2018 CSE420/598
12
Multi-Cycle Implementation of MIPS
Instruction fetch cycle (IF) IR Mem[PC]; NPC PC + 4 Instruction decode/register fetch cycle (ID) A Regs[IR6..10]; B Regs[IR11..15]; Imm ((IR16)16 ##IR16..31) Execution/effective address cycle (EX) Memory ref: ALUOutput A + Imm; Reg-Reg ALU: ALUOutput A func B; Reg-Imm ALU: ALUOutput A op Imm; Branch: ALUOutput NPC + Imm; Cond (A op 0) Memory access/branch completion cycle (MEM) Memory ref: LMD Mem[ALUOutput] or Mem(ALUOutput] B; Branch: if (cond) PC ALUOutput; Write-back cycle (WB) Reg-Reg ALU: Regs[IR16..20] ALUOutput; Reg-Imm ALU: Regs[IR11..15] ALUOutput; Load: Regs[IR11..15] LMD; 9/20/2018 CSE420/598
13
Multi-cycle Instruction Execution
9/20/2018 CSE420/598
14
Stages of Instruction Execution
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Ifetch Reg/Dec Exec Mem WB Load The load instruction is the longest All instructions follows at most the following five steps: Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory and update PC Reg/Dec: Registers Fetch and Instruction Decode Exec: Calculate the memory address Mem: Read the data from the Data Memory WB: Write the data back to the register file As shown here, each of these five steps will take one clock cycle to complete. And in pipeline terminology, each step is referred to as one stage of the pipeline. +1 = 8 min. (X:48) * Slide is courtesy of Dave Patterson 9/20/2018 CSE420/598 CS252 S05
15
Instruction Pipelining
Start handling of next instruction while the current instruction is in progress Pipelining is feasible when different devices are used at different stages of instruction execution IFetch Dec Exec Mem WB Program Flow Time Pipelining improves performance by increasing instruction throughput 9/20/2018 CSE420/598
16
Single Cycle, Multiple Cycle, vs. Pipeline
Clk Single Cycle Implementation: Load Store Waste Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk Multiple Cycle Implementation: Load Store R-type Ifetch Reg Exec Mem Wr Ifetch Reg Exec Mem Ifetch Here are the timing diagrams showing the differences between the single cycle, multiple cycle, and pipeline implementations. For example, in the pipeline implementation, we can finish executing the Load, Store, and R-type instruction sequence in seven cycles. In the multiple clock cycle implementation, however, we cannot start executing the store until Cycle 6 because we must wait for the load instruction to complete. Similarly, we cannot start the execution of the R-type instruction until the store instruction has completed its execution in Cycle 9. In the Single Cycle implementation, the cycle time is set to accommodate the longest instruction, the Load instruction. Consequently, the cycle time for the Single Cycle implementation can be five times longer than the multiple cycle implementation. But may be more importantly, since the cycle time has to be long enough for the load instruction, it is too long for the store instruction so the last part of the cycle here is wasted. +2 = 77 min. (X:57) Pipeline Implementation: Load Ifetch Reg Exec Mem Wr Store Ifetch Reg Exec Mem Wr R-type Ifetch Reg Exec Mem Wr * Slide is courtesy of Dave Patterson 9/20/2018 CSE420/598 CS252 S05
17
Example of Instruction Pipelining
Time between first & fourth instructions is 3 8 = 24 ns Time between first & fourth instructions is 3 2 = 6 ns Ideal and upper bound for speedup is number of stages in the pipeline 9/20/2018 CSE420/598 CS252 S05
18
Pipeline Performance execution time of the individual instruction
Pipeline increases the instruction throughput but does not reduce the execution time of the individual instruction Execution time of the individual instruction in pipeline can be slower due: Additional pipeline control compared to none pipeline execution Imbalance among the different pipeline stages Suppose we execute 100 instructions: Single Cycle Machine 45 ns/cycle x 1 CPI x 100 inst = 4500 ns Multi-cycle Machine 10 ns/cycle x 4.2 CPI (due to inst mix) x 100 inst = 4200 ns Ideal 5 stages pipelined machine 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns Due to fill and drain effects of a pipeline ideal performance can be achieved only for long (>> 2*pipeline_depth) instruction streams Example: a sequence of 1000 load instructions would take 5000 cycles on a multi-cycle machine while taking 1004 on a pipeline machine speedup = 5000/1004 5 9/20/2018 CSE420/598
19
5 Steps of MIPS Datapath 4 Data stationary control Instruction Fetch
Instr. Decode Reg. Fetch Execute Addr. Calc Memory Access Write Back Next PC IF/ID ID/EX MEM/WB EX/MEM MUX Next SEQ PC Next SEQ PC 4 Adder Zero? RS1 Reg File Address Memory MUX RS2 ALU Memory Data MUX MUX Sign Extend Imm WB Data RD RD RD Data stationary control local decode for each instruction phase / pipeline stage 9/20/2018 CSE420/598 CS252 S05
20
Pipelining is not quite that easy!
Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away) Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock) Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps). 9/20/2018 CSE420/598
21
One Memory Port/Structural Hazards
Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 ALU I n s t r. O r d e Load Ifetch Reg DMem Reg Reg ALU DMem Ifetch Instr 1 Reg ALU DMem Ifetch Instr 2 ALU Instr 3 Ifetch Reg DMem Reg Reg ALU DMem Ifetch Instr 4 9/20/2018 CSE420/598
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.