Pipelining sections A.1-A.8

Pipelining sections A.1-A.8
CMPE382 Fall 2002 Pipelining sections A.1-A.8 Based on Hennessy & Patterson CA:AQA3e Adapted from Patterson’s CS252 lecture slides at Berkeley with additions from Culler and Amaral

Midterm Friday MEC 2-1 2 aid sheets
CMPE382 / EE710a2 Computer Architecture First Midterm Exam Professor: Duncan Elliott Date Friday, October 4, 2002, 12:00 MEC 2-1 Department of Electrical and Computer Engineering University of Alberta Last office hours today (extended until 4pm)

Midterm Instructions:
Attempt all questions. Write your answers directly on the question sheets. (If necessary, use the back of a page or use an answer booklet and refer to the booklet on the question paper.) Students are responsible for writing their name and student ID number in ink on the cover of this exam, writing their student ID number on all pages, and turning in all pages of their examination in the correct order and suitably fastened together. A calculator and 2 aid sheets (both sides) are the only permissible aids. The aid sheets must be in the student’s own handwriting (no photocopies or printouts), may be no larger than 21.59x27.94 cm paper, and may contain any information. If you feel that information is missing from a question, clearly state any assumptions you had to make to answer the question. The use of unauthorized personal listening, communication, recording, photographic and/or computational devices is strictly prohibited. Such devices must be turned off and stowed.

What is Pipelining? Pipelining is a key implementation technique used
to build fast processors. It allows the execution of multiple instructions to overlap in time. A pipeline within a processor is similar to a car assembly line. Each assembly station is called a pipe stage or a pipe segment. The throughput of an instruction pipeline is the measure of how often an instruction exits the pipeline.

1 2 M u x Target 4 32 26 PC 1 1 I[25-21] I[20-16] 1 M u x 1 2 3 4
1 2 M u x Target 4 Conc/ Shift left 2 32 26 PC M u x 1 M u x 1 I[25-21] Read address Read register 1 Instruction [31-26] I[20-16] Read data 1 Read register 2 Zero Memory Write address M u x 1 ALU result Instruction [25-0] Write register Read data 2 MemData M u x 1 2 3 ALU Write data Instruction register Write data 4 [15-11] 1 M u x Registers 32 I[15-0] Sign ext. Shift left 2 16

5 Steps of MIPS Datapath 4 Instruction Fetch Instr. Decode Reg. Fetch
Execute Addr. Calc Memory Access Write Back Next PC MUX 4 Adder Next SEQ PC Zero? RS1 Reg File Address MUX Memory RS2 ALU Inst Memory Data L M D RD MUX MUX Sign Extend Imm WB Data

5 Steps of MIPS Datapath 4 Data stationary control Instruction Fetch
Instr. Decode Reg. Fetch Execute Addr. Calc Memory Access Write Back Next PC IF/ID ID/EX MEM/WB EX/MEM MUX Next SEQ PC Next SEQ PC 4 Adder Zero? RS1 Reg File Address Memory MUX RS2 ALU Memory Data MUX MUX Sign Extend Imm WB Data RD RD RD Data stationary control local decode for each instruction phase / pipeline stage

Steps to Execute Each Instruction Type

Pipeline Stages We can divide the execution of an instruction
into the following 5 “classic” stages: IF: Instruction Fetch ID: Instruction Decode, register fetch EX: Execution MEM: Memory Access WB: Register write Back

Pipeline Throughput and Latency
IF ID EX MEM WB 5 ns 4 ns 10 ns Consider the pipeline above with the indicated delays. We want to know what is the pipeline throughput and the pipeline latency. Pipeline throughput: instructions completed per second. Pipeline latency: how long does it take to execute a single instruction in the pipeline.

IF ID EX MEM WB 5 ns 4 ns 10 ns Pipeline throughput: how often an instruction is completed. Pipeline latency: how long does it take to execute an instruction in the pipeline. Is this right?

IF ID EX MEM WB 5 ns 4 ns 10 ns Simply adding the latencies to compute the pipeline latency, only would work for an isolated instruction IF MEM ID I1 L(I1) = 28ns EX WB IF I2 L(I2) = 33ns ID EX MEM WB IF I3 L(I3) = 38ns ID EX MEM WB IF I4 ID EX MEM WB L(I5) = 43ns We are in trouble! The latency is not constant. This happens because this is an unbalanced pipeline. The solution is to make every state the same length as the longest one.

Synchronous Pipeline Throughput and Latency
IF ID EX MEM WB 5 ns 4 ns 10 ns The slowest pipeline stage also limits the latency!! I1 IF ID EX MEM WB I2 L(I2) = 50ns IF ID EX MEM WB I3 IF ID EX MEM WB I4 IF ID EX MEM 10 20 30 40 50 60 L(I1) = L(I2) = L(I3) = L(I4) = 50ns

IF ID EX MEM WB 5 ns 4 ns 10 ns How long does it take to execute (issue) instructions in this pipeline? (disregard latency, bubbles caused by branches, cache misses, hazards) How long would it take using the same modules without pipelining?

IF ID EX MEM WB 5 ns 4 ns 10 ns Thus the speedup that we got from the pipeline is: How can we improve this pipeline design? We need to reduce the unbalance to increase the clock speed.

IF ID EX MEM1 MEM2 WB 5 ns 4 ns 5 ns 5 ns 5 ns 4 ns Now we have one more pipeline stage, but the maximum latency of a single stage is reduced in half. The new latency for a single instruction is:

IF ID EX MEM1 MEM2 WB 5 ns 4 ns 5 ns 5 ns 5 ns 4 ns I1 IF ID EX MEM1 MEM1 WB I2 IF ID EX MEM1 MEM1 WB I3 IF ID EX MEM1 MEM1 WB I4 IF ID EX MEM1 MEM1 WB I5 IF ID EX MEM1 MEM1 WB I6 IF ID EX MEM1 MEM1 WB I7 IF ID EX MEM1 MEM1 WB

IF ID EX MEM1 MEM2 WB 5 ns 4 ns 5 ns 5 ns 5 ns 4 ns How long does it take to execute instructions in this pipeline? (disregard bubbles caused by branches, cache misses, etc, for now) Thus the speedup that we get from the pipeline is:

IF ID EX MEM1 MEM2 WB 5 ns 4 ns 5 ns 5 ns 5 ns 4 ns What have we learned from this example? 1. It is important to balance the delays in the stages of the pipeline 2. The throughput of a pipeline is 1/max(delay). 3. The latency is Nmax(delay), where N is the number of stages in the pipeline.

Pipelines with Stalls H&P p.A-12

Pipelining Lessons 6 PM 7 8 9 30 40 20 A B C D Pipelining doesn’t help
latency of single task, it helps throughput of entire workload 6 PM 7 8 9 Time T a s k O r d e 30 40 20 Pipeline rate limited by slowest pipeline stage A Multiple tasks operating simultaneously B Potential speedup = Number pipe stages C Unbalanced lengths of pipe stages reduces speedup Time to “fill” and “drain” pipeline reduces speedup D

Computer Pipelines Execute billions of instructions, so throughput is what matters Desirable features: all instructions same length, registers located in same place in instruction format, memory operands only in loads or stores

Visualizing Pipelining
Time (clock cycles) Reg ALU DMem Ifetch Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7 Cycle 5 I n s t r. O r d e

Its Not That Easy for Computers
Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle Structural hazards: Hardware cannot support this combination of instructions (single person to fold and put clothes away) Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock) Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).

One Memory Port/Structural Hazards
Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 ALU I n s t r. O r d e Load Ifetch Reg DMem Reg Reg ALU DMem Ifetch Instr 1 Reg ALU DMem Ifetch Instr 2 ALU Instr 3 Ifetch Reg DMem Reg Reg ALU DMem Ifetch Instr 4

One Memory Port/Structural Hazards
Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 ALU I n s t r. O r d e Load Ifetch Reg DMem Reg Reg ALU DMem Ifetch Instr 1 Reg ALU DMem Ifetch Instr 2 Bubble Stall Reg ALU DMem Ifetch Instr 3

Data Hazard on R1 add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9
Time (clock cycles) IF ID/RF EX MEM WB I n s t r. O r d e add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 Reg ALU DMem Ifetch

Three Generic Data Hazards
Read After Write (RAW) Error if InstrJ tries to read operand before InstrI writes it Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication. I: add r1,r2,r3 J: sub r4,r1,r3

Write After Read (WAR) Error if InstrJ tries to write operand before InstrI reads it Called an “anti-dependence” by compiler writers. This results from reuse of the name “r1”. I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7

Write After Write (WAW) Error if InstrJ tries to write operand before InstrI writes it. Called an “output dependence” by compiler writers This also results from the reuse of name “r1”. I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7

WAW Can’t happen in MIPS 5-stage integer pipeline because: All instructions take 5 stages, and Writes are always in stage 5 What about floating point?

Midterm 1 Good work Denominator 44 with bonus, 38, recorded as 33
Average 23.5/33 = 71% Check that all questions are marked and total correct Friday’s homework H&P A1.a, A3

Cumulative Distribution

Mid Q1 Solutions 1a. 1.58t = 4 t = log1.58(4) = ln(4)/ln(1.58) = years 1b. Instruction latency=1 => no pipelining therefore, this machine can be expected to be slow

Q2 2a. on board – never again
2c. So long as reads have no side-effects, they can safely be reordered, so there is no need to force a stall to place reads in the correct order.

Q 2b LD R1  42(R0) DADD R3  R1,R2 XOR R3  R1,R4 SD 42(R0)  R3
DSUB R3  R4,R5 AND R1  R7,R3 delete DADD avoid register conflicts first R1 -> R11 first R3 -> R13 alternate two “streams” LD R11  42(R0) DSUB R3  R4,R5 XOR R13  R11,R4 AND R1  R7,R3 SD 42(R0)  R13

Q 3 Amdahl’s Law Fraction enhanced = 45%
CPI=1 for all instructions => execution time fraction = dynamic fraction 3a. =1.203 3b. =2.053 3c. =1.818

Q4 Bonus – 2 people received marks
# 2 instruction loop max: .dword start: LD R1max(R0) loop: BNE R1,R0,loop DADDI R1R1,-1 # optimized empty loop is no loop # j= if you like use delay slot, 64bit arithmetic, 64 bit constants

Forwarding to Avoid RAW Data Hazards
Time (clock cycles) I n s t r. O r d e add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11

HW Change for Forwarding

Data Hazard Even with Forwarding
Time (clock cycles) I n s t r. O r d e lw r1, 0(r2) sub r4,r1,r6 MIPS actually didn’t interlock: MPU without Interlocked Pipelined Stages and r6,r1,r7 or r8,r1,r9

Data Hazard Even with Forwarding
Time (clock cycles) I n s t r. O r d e lw r1, 0(r2) sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9

Software Scheduling to Avoid Load Hazards
Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d ,e, and f in memory. Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SW d,Rd Fast code: LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB Rd,Re,Rf SW d,Rd

Why the fast code is faster?

Mental Exercise Do forwarding under software control
Assign special register names to previous, previous2 ALU result Reduces register file size by 2, unfortunately Saves hardware Software (compiler) is cheap Good or bad idea? What are some of the costs?

Control Hazard on Branches Three Stage Stall
Reg ALU DMem Ifetch 10: beq r1,r3,36 14: and r2,r3,r5 18: or r6,r1,r7 22: add r8,r1,r9 36: xor r10,r1,r11

Example: Branch Stall Impact
If 30% branch, Stall of 3 cycles is significant Two part solution: Determine if branch is taken or not sooner, AND Compute taken branch address earlier MIPS branch tests if two registers = or  MIPS R1000 Solution: Move Zero test to ID/RF stage Adder to calculate new PC in ID/RF stage 1 clock cycle penalty for branch versus 3

Pipelined MIPS R1000 Datapath
Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Memory Access Write Back Next PC Next SEQ PC ID/EX EX/MEM MEM/WB MUX 4 Adder IF/ID Adder Zero? RS1 Address Reg File Memory RS2 ALU Memory Data MUX MUX Sign Extend Imm WB Data RD RD RD Data stationary control local decode for each instruction phase / pipeline stage

Branch Hazard Alternatives
#1: Stall until branch direction is clear #2: Static Predict Branch Not Taken Execute successor instructions in sequence “Squash” instructions in pipeline if branch actually taken Advantage of late pipeline state update 47% MIPS branches not taken on average PC+4 already calculated, so use it to get next instruction #3: Static Predict Branch Taken 53% MIPS branches taken on average But haven’t calculated branch target address in MIPS MIPS still incurs 1 cycle branch penalty Other machines: branch target known before outcome

Branch Hazard Alternatives
#4: Delayed Branch Define branch to take place AFTER a following instruction branch instruction sequential successor1 sequential successor sequential successorn branch target if taken 1 slot delay allows proper decision and branch target address in 5 stage pipeline MIPS R1000 uses this, others carry forward the convention Branch delay of length n

Delayed Branch Where to get instructions to fill branch delay slot?
Before branch instruction From the target address: only valuable when branch taken From fall through: only valuable when branch not taken Canceling (squashing) branches allow more slots to be filled A canceling branch only executes instructions in the delay slot if the direction predicted is correct.

Fig. 3.28

Delayed Branch Compiler effectiveness for single branch delay slot:
Fills about 60% of branch delay slots About 80% of instructions executed in branch delay slots useful in computation About 50% (60% x 80%) of slots usefully filled Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar)

Static (compiler) branch prediction
Use static prediction, squash speculatively executed instructions if necessary Branch-likely, Branch-not-likely instructions BEQL, BNEL “obsolete” in MIPS R4000 BEQ, BNE already predicted “not-taken” Compiler can predict some branch outcomes Profiling with representative data can predict more outcomes But, data may change over time ...

Dynamic Branch Prediction
e.g. 256 instructions on the fly (8-issue x 32 deep) Huge cost to mispredict Need to do better than 56% Predict based on past history Speculatively execute most-likely code (branch taken / not-taken) squash write-back or commit stage if branch target guessed wrong

Mismatched Pipeline Lengths
FP pipelines are often different length to integer Different instructions have different throughputs (initiation intervals) e.g. FPadd throughput=1, FPdivide throughput=1/25 cycles WAW, WAR, RAW hazards, look out Instructions issued in order can now complete out of order Hardware (or rarely, compiler) must monitor and avoid hazards Multiple instructions want to perform WB in single cycle New structural hazard What if an earlier instruction never completes?

Interrupts and Exceptions
Need to resume (carry on) after an exception At what program counter value? Between instructions IO device request Invoke OS, tracing, breakpoints Stall until all previously issued instructions have completed Resume with instruction at next PC Within an instruction ALU overflow, NaN, etc. Memory page fault or violation, unsupported misaligned access Power failure interrupt Similar: Mispredicted speculative execution (branches) Recovery gets tense if out of order execution is permitted

Recovery from Exceptions Within Instructions
68000 – page fault not supported in processor Buy a second processor and run it ~1 instruction behind the first. If the primary processor encounters an exception, interrupt the second processor and recover clean state from it. 68012, 68020, 68030 take a snapshot of processor internal state, including pipeline (~40 words in addition to documented user state), write state out to stack Harder to recover because instructions have multiple destinations (e.g. autoincrement address mode) Restart instructions from the middle

Recovery from Exceptions Within Instructions
Keep an in-order register-value history “history file” restore register values from history pretend out-of-order completion of instructions never happened restart instructions from first one that did not complete “Future file” Instruction results saved in future file and not committed to “real” registers until previous instructions complete Force in-order completion easy in 5-stage pipeline, omit memory-write and write-back (single destination per instruction) performance hit in complex pipelines

Imprecise Exceptions No simple return-from-interrupt instruction
Keep a hardware list of instructions that have completed Simulate completion of missing instructions in OS software Return to PC where all remaining instructions have yet to be executed How to debugging a nightmare Synchronize instruction stall until all issued instructions can no longer generate an exception

MIPS R4400 “superpipelined”, 8 clock cycles instead of 5
2+ cycles for instruction fetch 3 cycles for data memory load/store Result of load speculatively available pending cache tag check FP latency cycles

Scoreboard for out-of-order execution
In-order instruction issue Bypass instructions that must stall before operand fetch, allowing another instruction to get started Execute out of order Scoreboard keeps record of data dependencies between instructions controls safe issue controls when result can safely be written No ability to eliminate dependencies through register renaming

Pipelining sections A.1-A.8

Similar presentations

Presentation on theme: "Pipelining sections A.1-A.8"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Pipelining sections A.1-A.8

Similar presentations

Presentation on theme: "Pipelining sections A.1-A.8"— Presentation transcript:

Similar presentations

About project

Feedback