Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time.

Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time from start to finish for one car. Throughput: Number of finished cars per time unit. 1 car/275 min = 0.218 cars/hour 275 minutes per car. Issues: How can we make the process better by adding more workers? (smaller is better) (larger is better) 6.1

An Assembly line 6.1 11 1 11 22 2 22 33 3 33 44 4 44 6050 804045 First two stages can’t produce faster than one car/80 min or a backlog will occur at third stage. 80 Last two stages only receive one car/80 min to work on. 80 Latency: 400 min/car Throughput: 4 cars/640 min (1 car/160 min) time Will approach 1 car/80 min as time goes on

Applying Assembly Lines to CPUs The single-cycle design did everything “at once” Can we break the single-cycle design up into stages? 6.1 Issues: Car assembly works well. Will it be so easy to do the same technique to a CPU?

Instruction Memory Data Memory Add 4 Read address Instruction [31-0] Read address Write address Write data Read data Result Zero Result Sh. Left 2 1 0 0 1 sign extend PC 16 32 Read reg. num A Registers Read reg num B Write reg num Write reg data Read reg data A Read reg data B Read reg num A 0 1 Imm: [15-0] Rs:[25-21] Rt:[20-16] Rd: [15-11] 1 0 Instr. Fetch, PC=PC+4 Instr. Decode Register Fetch Execute, Address Calc. Memory Reg. Write- back Breaking up the Single-Cycle Datapath 6.2 Stages from multi-cycle design

Instruction Memory Data Memory Add 4 Read address Instruction [31-0] Read address Write address Write data Read data Result Zero Result Sh. Left 2 1 0 0 1 sign extend PC 16 32 Read reg. num A Registers Read reg num B Write reg num Write reg data Read reg data A Read reg data B Read reg num A 0 1 Imm: [15-0] Rs:[25-21] Rt:[20-16] Rd: [15-11] 1 0 Instr. Fetch, PC=PC+4 Instr. Decode Register Fetch Execute, Address Calc. Memory Reg. Write- back The Key - Pipeline Registers 6.2 clock PC+4

Instruction Memory Data Memory Add 4 Read address Instruction [31-0] Read address Write address Write data Read data Result Zero Result Sh. Left 2 1 0 0 1 sign extend PC 16 32 Read reg. num A Registers Read reg num B Write reg num Write reg data Read reg data A Read reg data B Read reg num A 0 1 Imm: [15-0] Rs:[25-21] Rt:[20-16] Rd: [15-11] 1 0 Example: R-type Instruction 6.2 PC+4 Writes the correct data to the wrong register In general, arrows that go backwards across pipeline stages may be bad news...

Instruction Memory Data Memory Add 4 Read address Instruction [31-0] Read address Write address Write data Read data Result Zero Result Sh. Left 2 1 0 0 1 sign extend PC 16 32 Read reg. num A Registers Read reg num B Write reg num Write reg data Read reg data A Read reg data B Read reg num A Imm: [15-0] Rs:[25-21] Rt:[20-16] 0 1 Rd: [15-11] 1 0 Correcting the Write Register Problem 6.2 PC+4 0 1 Rt:[20-16] Rd:[15-11]

Assembly-line Control Signals 1 3 54 In an assembly line, the manufacturing instructions can be attached to the car. The instructions then move along with the car. F: Standard E: 135 HP B: 2-door P: Green F: Leather E: 190 HP B: 4-door P: Blue F: Cotton B: 2-door P: Lavender F: Leather P: Green F: Vinyl F: Leather 2 By separating the control signals by stages, only the signals needed for the current stage must be decoded. All signals for later stages must be passed along. 6.1

Instruction Memory Data Memory Add 4 Read address Instruction [31-0] Read address Write address Write data Read data Result Zero Result Sh. Left 2 1 0 0 1 sign extend PC 16 32 Read reg. num A Registers Read reg num B Write reg num Write reg data Read reg data A Read reg data B Read reg num A Imm: [15-0] Rs:[25-21] Rt:[20-16] 1 0 The Pipelined Control Logic 6.3 PC+4 0 1 Rt:[20-16] Rd:[15-11] ALU control ALUOp RegWrite MemToReg MemWrite MemRead ALUSrc PCSrc RegDest Op:[31-26] W M E Control W M W Branch

How’d we do? Compared to Single-cycle 5 stages --> Potentially 5x speedup Not likely Stages won’t all be equally long Pipeline registers will cause some delays Latency --> Greater than in single-cycle design More complexity, but nicely divided up

Example 1 Consider executing the following code add $3, $4, $5 and $6, $7, $8 sub $9, $10, $11 on i) A single-cycle machine with a cycle time of 200 ns ii) A 5-stage pipeline machine with a cycle time of 50 ns Which one runs faster? What if the instructions were 100 instead of 3?

Analyzing Pipelines 6.4 ADD$10, $14, $0 SUB$12, $13, $2 AND$1, $6, $11 SW$3, 200($9) OR$9, $13, $7 OR IF RF MWB EX IF RF MWB SW EX IF RF MWB AND EX IF RF MWB SUB EX IFRFM ADD EX WB

Data Hazards 6.4 ADD$13, $14, $0 SUB$12, $13, $2 AND$1, $6, $13 SW$3, 200($13) OR$9, $13, $7 Writes register $13 Reads wrong $13 Reads ? $13 Reads correct $13 OR IFRFMWB EX IFRFMWB SW EX IFRFMWB AND EX IFRFMWB SUB EX IFRFM ADD EX WB

Preventing Data Hazards 6.4 ADD$13, $14, $0 NOPNOP SUB$12, $13, $2 AND$1, $6, $13 SW$3, 200($13) OR$9, $13, $7 Insert NOP’s into the instruction stream to allow WB to happen before RF. Assume we can’t write a register and read the new value in the same cycle Assume we can’t write a register and read the new value in the same cycle IFRF OR IFRF SW EX IFRFM AND EX IFMWB SUB EX RFIFRFM ADD WB EX IFMWB SUB EX RF

Detecting Hazards 6.5 ADD$13, $14, $0 SUB$12, $13, $2 AND$1, $6, $13 SW$3, 200($13) OR$9, $13, $7 Check each instruction as it is being decoded (RF-ID stage). If it reads a register that will be written by any instruction ahead of it (in RF, EX, or M stages), there is a hazard. Write: $13 Read A: $13 Read B: $13 Read A: $13 IFRF OR EX SW IFRFM EX IFRFMWB AND EX IFRFMWB SUB EX ADD IFRFMWB EX Compare write reg # in EX with read reg # in RF Compare write reg # in M with read reg # in RF Compare write reg # in WB with read reg # in RF

Stalling with Bubbles 6.5 ADD$13, $14, $0 SUB$12, $13, $2 AND$1, $6, $13 SW$3, 200($13) OR$9, $13, $7 IFRF OR IF SUB IF SUB IF SUB Stalling: Kill the current execution by “neutralizing” all the control signals so that it won’t write any registers. Don’t write PC+4 into PC --> Stay at the current instruction and try again. IFRFM ADD WB EX IFRFMWB SUB EX IFRFM AND EX SW IFRF EX = = =

Register Forwarding 6.6 ADD$13, $14, $0 SUB$12, $13, $2 AND$1, $6, $13 SW$3, 200($13) OR$9, $13, $2 Register $13’s value is computed in the EX stage of the ADD even though it isn’t written in the register until the WB stage. --> The pipeline register following the EX stage hold the value of $13 that’s needed in the SUB instruction’s EX stage. IFRFMWB SUB EX IFRFMWB AND EX IFRFMWB OR EX IFRFMWB SW EX IFRFM ADD WB EX

Unforwardable Loads 6.6 LW$2, 30($2) AND$1, $2, $13 SW$3, 200($2) OR$9, $2, $1 IFRFMWB AND EX IFRFM LW WB EX IFRFMWB SW EX IFRFMWB OR EX OR IFRFMWB AND EX Loads don’t compute the register to write back until the Memory stage. This is one stage to late for the next instruction. ---> We can’t prevent stalls if the instruction following a Load uses the result of the Load.

Example 2 Consider executing the following code on a 5-stage pipeline datapath add $3, $4, $5 lw $7, 100($3) sub $8, $7, $9 1. Identify any potential data dependencies 2. How many cycles will it take to execute this code assuming no register forwarding? 3. How many cycles will it take to execute this code assuming register forwarding is available?

Branch Hazards 6.7 BEQ$2, $1, SKIP AND$1, $2, $13 SW$3, 200($2) OR$9, $2, $4 ADD$3, $2, $5 SKIP:LW $2,32($4) IFRFMWB AND EX IFRFMWB OR EX IFRFMWB OR EX LW IFRFMWB SW EX Don’t know result of branch until the end of the M stage If the branch is taken, we’ve blown it by executing the intervening instructions If the branch is taken, we’ve blown it by executing the intervening instructions IFRF BEQ WB EX M

Solution 1: Stall 6.5 IFRF ADD IF AND IF AND IF AND IFRFM BEQ WB EX IFRFMWB AND EX IFRFM SW EX OR IFRF EX BEQ$2, $1, SKIP AND$1, $2, $13 SW$3, 200($2) OR$9, $2, $4 ADD$3, $2, $5 SKIP:LW $2,32($4) Stalling always solves the problem. If we didn’t have so many branches in programs, it would not be a problem Branch not taken

6.7 BEQ$2, $1, SKIP AND$1, $2, $13 SW$3, 200($2) OR$9, $2, $4 ADD$3, $2, $5 SKIP:LW $2,32($4) IFRF BEQ WB EX M If we guess right, we win --> No stall at all IFRFMWB LW EX IFRFMWB OR EX If we guessed wrong, 1. We have to undo all that we did (fortunately, no writebacks have occured yet). 2. We still take all the time of a stall IFRFMWB AND EX IFRFMWB SW EX Solution 2: Assume not Taken Must be undone if branch is taken! Branch is taken...

6.7 Solution 3: Better Prediction Predict that the branch goes the same way as the last time Works great for loops Works great for “special-case” code Need to keep track of the information for each branch, though... One or two bits will do Keep a small table of recently used branches and which way they went

6.7 Solution 4: Delayed Branches XOR$1, $3, $3 ADD$2, $3, $4 SUB$4, $3, $1 OR$3, $2, $0 BEQ$10, $11, SKIP LW$4, 60($2) SKIPAND$1, $2, $3 If we had some warning, we could compute the branch ahead of time... XOR$1, $3, $3 Branch-After-Three-EQ $10,$11,SKIP ADD$2, $3, $4 SUB$4, $3, $1 OR$3, $2, $0 LW$4, 60($2) SKIPAND$1, $2, $3 3 delay slots These instructions are always executed. Branch can’t depend on them...

3-slot Delayed Branch 6.7 IFRF B3E WB EX MIFRFMWB LW or AND EX Branch-After-Three-EQ $10,$11,SKIP ADD$2, $3, $4 SUB$4, $3, $1 OR$3, $2, $0 LW$4, 60($2) SKIPAND$1, $2, $3 IFRFWB EX M ADD IFRFWB EX M SUB IFRFWB EX M OR

Branch summary Two decent solutions: Branch prediction Requires more hardware Used in modern microprocessors Delayed branch Requires special software manipulation Often doesn’t deliver its promise Used often in CPUs 4-10 years ago

Example 3 Consider executing the following code LOOP: add $3, $4, $5 and $6, $7, $8 bne $12, $8, LOOP on i) A single-cycle machine with a cycle time of 200 ns ii) A 5-stage pipeline machine with a cycle time of 50 ns A. Assume the loop executes 10 times B. Assume the loop executes 100 times C. Assume the loop executes 1000 times Which one runs faster?

Example 4 Consider executing the following code on a 5-stage pipeline datapath addi $3, $0, 10 LOOPSTART:lw $5, ARRAY($3) addi $5, $5, 1 sw $5, ARRAY addi $3, $3, -1 bne $3, $0, LOOPSTART add $3, $5, $6 sub $7, $8, $9 addi $4, $6, 3 1. Identify potential data dependencies 2. How many cycles will it take to execute this code? A. With nops/stalls B. With branch prediction assuming branch not taken C. With branch prediction based on one previous result

Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time.

Similar presentations

Presentation on theme: "Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time.

Similar presentations

Presentation on theme: "Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time."— Presentation transcript:

Similar presentations

About project

Feedback