Download presentation
Presentation is loading. Please wait.
1
EECE476: Computer Architecture Lecture 22: Zero-cycle Branches (no text) Superpipelining (no text) vs. Superscalar (text 6.8) The University of British ColumbiaEECE 476© 2005 Guy Lemieux
2
2 Jumps and Unconditional Branches BTB: Branch-target buffer –Tells us where a branch will jump to BPB: Branch-prediction buffer –Tells us if branch will be taken Consider J and (unconditional) BR –Always takes the branch (prediction unnecessary) ADD$t3, $t1,$t2 JLABEL … LABEL:SUBI$t3, $t3, 1 –This ADD always knows a “J is coming after” (the J is always at PC+4) –Target of JMP/BR is known (not from a register, JR) –Recipe for zero-cycle branches!
3
3 Zero-cycle Branches? ADD$t3, $t1,$t2 JLABEL … LABEL:SUBI$t3, $t3, 1 Use BTB for non-branch instruction (eg, ADD) –Any instruction immediately before a J or BR BTB reserves entry for ADD, Target: “LABEL” of J, Tag: PC of ADD –ADD now “looks like” a branch/jump When executing ADD –BTB says “always fetch from target LABEL” –Requires small change to datapath (BTB can also select next PC, not just BPB) Do not have to fetch JMP/BR itself! –Branch executed in zero cycles!
4
4 Zero-cycle Branch: Limits What if target comes from a register? –BTB holds useless value, usually wrong What if branch is conditional (eg, BEQ)? –Two paths: taken and untaken –Do not know which path is correct until after executing BEQ Actually need to fetch & execute BEQ! To determine Rs and Rt, and do comparison Cannot do in zero cycles? Can conditional branches ever take zero cycles? –YES! –But I’ll let you figure out how…
5
Superpipelining
6
6 Pipeline Trends Slowest stages in classic 5-stage pipeline –Instruction and data memory accesses –CPUs get faster much more quickly than memory –Memory accesses continue to be the bottleneck in computer architecture for last 10-15 years –Instruction and Data Memory replaced with faster caches A cache is a small, fast on-chip memory –Keeps a local copy of data from main memory French: cache means HIDE Idea: cache memory is hidden from your program (transparent) –Discuss details later..
7
7 Superpipelining 5-stage pipeline is “classical” –MIPS –Intel 486 has 5-stage pipeline First Intel CPU with on-chip cache Superpipelining –More pipeline stages Basic Idea: faster clock speeds –Do less work per clock cycle –Still complete 1 instruction per cycle
8
8 MIPS R4000 Superpipeline 5 stages: I, D, X, M, W I stage: read memory M stage: read memory –Fast caches are still too slow 8 stages: IF, IS, D, X, DF, DS, DC, W –Approx 2x clock speed of 5-stage pipeline Split “I” stage in two –IF “I First” –IS “I Second” Split “M” stage in three –DF “D First” –DS “D Second” –DC “D CheckTag”
9
9 MIPS R4000 CheckTag Stage CheckTag Stage –Cache is similar to BTB Contains a TAG specifying the memory address for the data it is holding –Access data cache Must check TAG to verify we got the correct data –CheckTag takes 1 extra clock cycle! –If CheckTag fails pipeline must stall get data from actual data memory (10-100+ clock cycles) MIPS R4000 is very aggressive –Forwarding Units take data out of “DS” stage (can’t take from DF) –If CheckTag fails, it BACKS UP the pipeline 1 cycle (hard to do!)
10
10 Superpipelining Limits Data Hazards –More forwarding Eg, X forwarding from DF, DS, DC, and WB stages –More pipeline stalls CheckTag failure causes stall Load-Use Penalty: 2 cycles –Load instruction: 2 clock cycles (DF, DS) –Use instruction: must wait for load to finish Insert 2 instructions between Load and Use Can use NOP If no instructions, pipeline will stall
11
11 Importance of Branch Prediction Branch-Delay Penalty –Branch in “D” stage Two more instructions are being fetched (IF, IS) Two branch delay slots! –Next version of superpipeline… May have 3 branch delay slots? Not a good idea! Need BRANCH PREDICTION –MIPS R4000 Total branch delay: 3 cycles 1 delay slot (historical), followed by 2 cycles static branch prediction (predict-untaken)
12
Superscalar
13
13 Superscalar Basic Idea –Why execute only 1 instruction in a clock cycle? –How about 2 instructions per cycle? Tempting to begin calling it IPC (instructions per cycle) –IPC = 1 / CPI –Compare “IPC” to “MIPS” … both are rates Stick to CPI for this course: ExecutionTime = InstructionCount * CPI * ClockPeriod Ideal CPI = 0.5 in this case
14
14 Static Superscalar Find 2 instructions every clock cycle! –Pair them up when writing assembly code –Called Static Superscalar Compiler does the work ahead of time –Given two instructions, CPU just executes them Instructions must be independent If hard to find independent instruction, use NOP –Compiler looks for “eligible” pairs Automagically avoid dependences between instruction pairs Not much brains in CPU…
15
15 Static Superscalar: Need to Double All Resources? Need to double everything? –Need 2 Instruction Memories? –Need 2 Register Files (4 read ports, 2 write ports) ? –Need 2 ALUs ? –Need 2 Data Memories? Too much overhead, not usually done –Just 1 Instruction Memory with 2 x 32bit outputs (8 bytes) –Just 1 ALU –Just 1 Data Memory (need partial ALU to compute address) –Need bigger register file (4 read ports, 2 write ports) Practical limits imposed to use fewer resources –Only combine 1 ALU instruction + 1 Memory instruction Cannot combine 2 ALU instructions or 2 Memory instructions –Align all instructions in pairs in the instruction memory PC%8==0 for ALU instructions, PC%8==4 for memory instructions
16
16 Static Superscalar
17
17 Pipeline Diagram for Superscalar Two instructions per cycle 1aALU or BRIDXMW 1bLD or STIDXMW 2aALU or BRIDXMW 2bLD or STIDXMW 3aALU or BRIDXMW 3bLD or STIDXMW 4aALU or BRIDXMW 4bLD or STIDXMW 5aALU or BRIDXMW 5bLD or STIDXMW
18
18 Code Scheduling for Superscalar Example Loop:lw$t0, 0($s1) addi$s1,$s1,-4 addu$t0,$t0,$s2 sw$t0, 4($s1) bne$s1,$zero, Loop Regular pipeline: 5 cycles per iteration (assuming no delay slots) int *p; for( ; p != 0; p-- ) { *p = *p + CONST; }
19
19 Code Scheduling for Superscalar Loop:lw$t0, 0($s1) addi$s1,$s1,-4 addu$t0,$t0,$s2 sw$t0, 4($s1) bne$s1,$zero,Loop LABELALU/BR INSTRLD/ST INSTRCycle Loop:LW $t0,0($s1)1 ADDI $s1,$s1,-42 ADDU $t0,$t0,$s23 BNE $s1,$zero,LoopSW $t0,4($s1)4 Blank table entries are NOPS. Load-use delay prevents ADDU being earlier. Effective CPI is 0.8, not 0.5!
20
20 Code Scheduling for Superscalar The compiler can further improve CPI Loop unrolling –Example: unroll previous code 4 times (# iterations multiple of 4) –Execute new body ¼ number of iterations LABELALU/BR INSTRLD/ST INSTRCycle Loop:LW $t0, 0($s1)1 LW $t1, -4($s1)2 ADDU $t0,$t0,$s2LW $t2, -8($s1)3 ADDU $t1,$t1,$s2LW $t3,-12($s1)4 ADDU $t2,$t2,$s2SW $t0, 0($s1)5 ADDU $t3,$t3,$s2 SW $t1, -4($s1)6 ADDI $s1,$s1,-16SW $t2, -8($s1)7 BNE $s1,$zero,LoopSW $t3, 4($s1)8
21
21 Code Scheduling for Superscalar Unroll loop 4 times –More registers used –Some BNE/ADDI instructions are gone CPI Improved –Before Unrolling: 0.8 –After Unrolling 8/14 = 0.57 InstrCount Improved –Before Unrolling: 20/4 = 5 per iteration –After Unrolling: 14/4 = 3.5 per iteration We don’t get this with superpipelining! Overall Performance –Pipelined: 5 cycles / iteration –Superscalar before unrolling: 4 cycles / iteration –Superscalar after unrolling: 2 cycles / iteration Superpipelined: 2.0x faster than pipelined! Superscalar unrolled: 2.5x faster than pipelined! LABELALU/BR INSTRLD/ST INSTRCycle Loop:LW $t0, 0($s1)1 LW $t1, -4($s1)2 ADDU $t0,$t0,$s2LW $t2, -8($s1)3 ADDU $t1,$t1,$s2LW $t3,-12($s1)4 ADDU $t2,$t2,$s2SW $t0, 0($s1)5 ADDU $t3,$t3,$s2 SW $t1, -4($s1)6 ADDI $s1,$s1,-16SW $t2, -8($s1)7 BNE $s1,$zero,LoopSW $t3, 4($s1)8
22
22 Importance of Branch Prediction Now fetching two instructions every cycle –Given a branch: Which two instructions to fetch: Taken or Not-Taken path? Misprediction? –Many lost opportunities to execute instructions –Significant performance loss! Branch prediction CRUCIAL!
23
23 Superpipelining vs. Superscalar Which is better? –Debate lasted a few years in mid-1990s Result: both won! –Can combine superpipelining and superscalar Branch prediction is now crucial! –6 instructions enter pipeline after a branch x3 from superpipelining x2 from superscalar Superscalar can be enhanced further –Rely less upon compiler –Hardware finds instructions to pair together More hazard detection, etc –Dynamic superscalar (next class!)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.