Abstraction Question General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the.

Abstraction Question General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the machine Embedded processors tend to have entire control over the code run, compiler (if any), the ISA, and the hardware – so breaking the abstraction layer makes sense. Bottom line becomes who controls what elements of the design. Intel side note

We’ve been doing mips – how’d intel do this for a CISC x86 ISA? Microops from Pentium Pro on

Branch Hazards or “Which way did he go?” Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. Dr. Leo Porter Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License Read 5.1, 5.2

Control Dependence Just as an instruction will be dependent on other instructions to provide its operands ( dependence), it will also be dependent on other instructions to determine whether it gets executed or not ( dependence or ___________ dependence). Control dependences are particularly critical with _____________ branches. add $5, $3, $2 sub $6, $5, $2 beq $6, $7, somewhere and $9, $6, $1... somewhere: or $10, $5, $2 add $12, $11, $9... conditional branch control data

Dealing With Branch Hazards Hardware –stall until you know which direction –reduce hazard through earlier computation of branch direction –guess which direction  assume not taken (easiest)  more educated guess based on history (requires that you know it is a branch before it is even decoded!) Hardware/Software –noops, or instructions that get executed either way (delayed branch).

Given our current pipeline – let’s assume we stall until we know the branch outcome. How many cycles will you lose per branch? Stalling the pipeline Sel ecti on cycles A0 B1 C2 D3 E4

Stalling for Branch Hazards beq $4, $0, there and $12, $2, $5 or... add... sw... IMReg DMReg IMReg IMReg DM IMReg DMReg IMReg DMReg CC1CC2CC3CC4CC5CC6CC7CC8 Bubble

Stalling for Branch Hazards Seems wasteful, particularly when the branch isn’t taken. Makes all branches cost 4 cycles. What if we just assume ____________ Why? branchesAren’t taken

Assume Branch Not Taken beq $4, $0, there and $12, $2, $5 or... add... sw... IMReg DMReg IMReg IMReg DM IMReg DMReg IMReg DMReg CC1CC2CC3CC4CC5CC6CC7CC8 works pretty well when you’re right

Assume Branch Not Taken beq $4, $0, there and $12, $2, $5 or... add... there: sub $12, $4, $2 IMReg IMReg IM Reg IMReg DMReg CC1CC2CC3CC4CC5CC6CC7CC8 Flush same performance as stalling when you’re wrong Wrong Path insts

Let’s improve the pipeline so we move branch resolution to Decode. How many cycles would we lose then on a taken branch? Stalling the pipeline Sel ecti on cycles A0 B1 C2 D3 E4 Add drawing of Resolving in decode.

The Pipeline with flushing for taken branches Notice the IF/ID flush line added.

Branch Hazards – Assume Not Taken Great if most of your branches aren’t taken. What about loops which are taken 95% of the time? –we would like the option of assuming not taken for some branches, and taken for others, depending on ???

Branch Hazards – Predicting Taken IMReg ALU DMRegIMReg ALU DMReg CC1CC2CC3CC4CC5CC6CC7CC8 beq $2, $1, here here: lw Reading quiz Required information to predict Taken: 1.An instruction is a branch before decode 2.The target of the branch 3.The outcome of the branch SelectionRequired knowledge A2,3 B1,2,3 C1,2 D2 ENone of the above

Branch Target Buffer Keeps track of the PCs of recently seen branches and their targets. Consult during Fetch (in parallel with Instruction Memory read) to determine: –Is this a branch? –If so, what is the target

Branch Hazards – Predict Taken Static policy: –Forward branches (if statements) predict not taken –Backward branches (loops) predict taken Dynamic prediction (coming soon) First – Branch Delay Slots

Eliminating the Branch Stall There’s no rule that says we have to see the effect of the branch immediately. Why not wait an extra instruction before branching? The original SPARC and MIPS processors each used a single branch delay slot to eliminate single-cycle stalls after branches. The instruction after a conditional branch is always executed in those machines, regardless of whether the branch is taken or not!

Branch Delay Slot beq $4, $0, there and $12, $2, $5 there: or... add... sw... IMReg DMReg IMReg IMReg DM IMReg DMReg IMReg DMReg CC1CC2CC3CC4CC5CC6CC7CC8 Branch delay slot instruction (next instruction after a branch) is executed even if the branch is taken.

Filling the branch delay slot 1 add $5, $3, $7 2 add $9, $1, $3 3 sub $6, $1, $4 4 and $7, $8, $2 5 beq $6, $7, there nop /* branch delay slot */ 6 add $9, $1, $2 7 sub $2, $9, $5... there: 8 mult $2, $10, $11 … No-R7 WAR Safe, $1 and $3 are fine No-R6 No-R7 Not safe ($9 on nt path) Not safe (needs $9 not yet produced) Not safe ($2 is used before overwritten) E is the correct answer SelectionSafe instructions A1,2 B2,6 C6,8 D1,2,7,8 ENone of the above * It is not safe to assume anything about the … code

Filling the branch delay slot The branch delay slot is only useful if you can find something to put there. If you can’t find anything, you must put a noop to insure correctness.

Branch Delay Slots This works great for this implementation of the architecture, but becomes a permanent part of the ISA. What about the MIPS R10000, which has a 5-cycle branch penalty, and executes 4 instructions per cycle??? Bottom line: Exposed a detail of the hardware implementation to the ISA.

Dynamic Branch Prediction Can we guess the outcome of branches? What should we base that guess on?

Branch Prediction 1 0 1 program counter for (i=0;i<10;i++) {... }... add $i, $i, #1 beq $i, #10, loop Accuracy? Too easily swayed

Two-bit predictors give better loop prediction for (i=0;i<10;i++) {... }... add $i, $i, #1 beq $i, #10, loop Strongly Taken 11 Weakly Taken 10 Weakly Not Taken 01 Strongly Not Taken 00 Decrement when not taken Increment when taken branch address 00 PHT Better (less sway) Slower learning

Suppose we have the following branch patterns for 3 branches (A, B, C). What is the accuracy of a 1-bit and 2- bit Branch History Table. Assume initial values of 1 (1-bit) and (10) 2-bit. Strongly Taken 11 Weakly Taken 10 Weakly Not Taken 01 Strongly Not Taken 00 Decrement when not taken Increment when taken 1-bit A. T T T T N B. T T N T N C. N T N T N 2-bit A. T T T T N B. T T N T N C. N T N T N

Modern Branch Prediction - Pentium 4 Performance dependent on accurate Branch Prediction 20 Stage Pipeline – 3-way issue –60 instructions in flight (12 branches) –17 th stage is branch resolution –~17*3=51 instructions lost on mispredict X 1234567891011121314151617181920

Branch Prediction Latest branch predictors are significantly more sophisticated, using more advanced correlating techniques, larger structures, and even AI techniques Use patterns of branches (local history) and recent other branch history (global history) to make predictions

For a given program on our 5-stage MIPS pipeline processor: 20% of insts are loads, 50% of instructions following a load are arithmetic instructions depending on the load 20% of instructions are branches. Using dynamic branch prediction, we achieve 80% prediction accuracy. What is the CPI of your program? SelectionCPI A0.76 B0.9 C1.0 D1.1 E1.14 Putting it all together.

Control Hazards -- Key Points Control (or branch) hazards arise because we must fetch the next instruction before we know if we are branching or where we are branching. Control hazards are detected in hardware. We can reduce the impact of control hazards through: –early detection of branch address and condition –branch prediction –branch delay slots

Given our 5-stage MIPS pipeline – what is the steady state CPI for the following code? Assume the branch is taken thousands of times. Recall – a processor is in steady state when all stages are active. Loop: lw r1, 0 (r2) add r2, r3, r4 sub r5, r1, r2 beq r5, $zero, Loop SelectionCPI A1 B1.25 C1.5 D1.75 ENone of the above Steady-State CPI = (#insts+#stalls+#flushed_insts) #insts

IF = 200ps ID = 100ps EX = 100ps M = 200ps WB = 100ps Hardware engineers determine these to be the execution times per stage of the MIPS 5-stage pipeline processor. Consider splitting IF and M into 2 stages each. (So IF1 IF2 and M1 M2.) The most important code run by the company is (assume branch is taken most of the time): Loop: lw r1, 0 (r2) add r2, r3, r4 sub r5, r1, r2 beq r5, $zero, Loop SelectionCPICT AIncrease B Decrease C Increase DDecrease EIncreaseNo Change What would be the impact of the new 7-stage pipeline compared to the original 5-stage MIPS pipeline.. Assume the pipeline has forwarding where available, predicts branch not taken, and resolves branches in ID. Isomorphic

IF = 200ps ID = 100ps EX = 200ps M = 200ps WB = 100ps Hardware engineers determine these to be the execution times per stage of the MIPS 5-stage pipeline processor. Consider splitting IF and M into 2 stages each. (So IF1 IF2 and M1 M2.) The most important code run by the company is (assume branch is taken most of the time): Loop: lw r1, 0 (r2) add r2, r3, r4 sub r5, r1, r2 beq r5, $zero, Loop SelectionCPICT AIncrease B Decrease C Increase DDecrease EIncreaseNo Change What would be the impact of the new 7-stage pipeline compared to the original 5-stage MIPS pipeline.. Assume the pipeline has forwarding where available, predicts branch not taken, and resolves branches in ID.

Loop: lw r1, 0 (r2) add r2, r3, r4 sub r5, r1, r2 beq r5, $zero, Loop 7-stage Pipeline For diagramming the code in the last slide. Drive home – more stages = more hazards

Pipelining -- Key Points Pipelining focuses on improving instruction throughput, not individual instruction latency. Data hazards can be handled by hardware or software – but most modern processors have hardware support for stalling and forwarding. Control hazards can be handled by hardware or software – but most modern processors use Branch Target Buffers and advanced dynamic branch prediction to reduce the hazard. ET = IC*CPI*CT

Abstraction Question General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the.

Similar presentations

Presentation on theme: "Abstraction Question General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Abstraction Question General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the.

Similar presentations

Presentation on theme: "Abstraction Question General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the."— Presentation transcript:

Similar presentations

About project

Feedback