EECE476: Computer Architecture Lecture 22: Zero-cycle Branches (no text) Superpipelining (no text) vs. Superscalar (text 6.8) The University of British.

Slides:



Advertisements
Similar presentations
Morgan Kaufmann Publishers The Processor
Advertisements

CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Advanced Pipelining Optimally Scheduling Code Optimally Programming Code Scheduling for Superscalars (6.9) Exceptions (5.6, 6.8)
Lecture Objectives: 1)Define branch prediction. 2)Draw a state machine for a 2 bit branch prediction scheme 3)Explain the impact on the compiler of branch.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Chapter 4 The Processor CprE 381 Computer Organization and Assembly Level Programming, Fall 2013 Zhao Zhang Iowa State University Revised from original.
Pipelining II (1) Fall 2005 Lecture 19: Pipelining II.
Chapter 4 CSF 2009 The processor: Instruction-Level Parallelism.
EECE476: Computer Architecture Lecture 23: Speculative Execution, Dynamic Superscalar (text 6.8 plus more) The University of British ColumbiaEECE 476©
EECE476: Computer Architecture Lecture 21: Faster Branches Branch Prediction with Branch-Target Buffers (not in textbook) The University of British ColumbiaEECE.
EECE476: Computer Architecture Lecture 18: Pipelining Control Hazards Chapter 6.6 The University of British ColumbiaEECE 476© 2005 Guy Lemieux.
MIPS Pipeline Default behaviour and pipeline organization The University of British ColumbiaEECE 476© 2005 Guy Lemieux.
CS61C L29 CPU Design : Pipelining to Improve Performance II (1) Garcia, Fall 2006 © UCB Running away with it  Our #8 football team had a great effort.
CS Computer Architecture 1 CS 430 – Computer Architecture Pipelined Execution - Review William J. Taffe using slides of David Patterson.
Pipelining II Andreas Klappenecker CPSC321 Computer Architecture.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 19 - Pipelined.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
CS61C L30 CPU Design : Pipelining to Improve Performance II (1) Garcia, Spring 2007 © UCB E-voting bill in congress!  Rep Rush Holt (D-NJ) has a bill.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Pipelining Andreas Klappenecker CPSC321 Computer Architecture.
1 Chapter Six - 2nd Half Pipelined Processor Forwarding, Hazards, Branching EE3055 Web:
1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.
EECE476: Computer Architecture Lecture 19: Pipelining Reducing Control Hazard Penalty Chapter 6.6 The University of British ColumbiaEECE 476© 2005 Guy.
Goal: Reduce the Penalty of Control Hazards
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 3.
Pipelined Processor II CPSC 321 Andreas Klappenecker.
Pipelined Datapath and Control (Lecture #15) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.
1  1998 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining.
CS 61C L38 Pipelined Execution, part II (1) Garcia, Spring 2004 © UCB Lecturer PSOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.
Lecture 15: Pipelining and Hazards CS 2011 Fall 2014, Dr. Rozier.
1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell CS352H: Computer Systems Architecture Topic 8: MIPS Pipelined.
Chapter 4 CSF 2009 The processor: Pipelining. Performance Issues Longest delay determines clock period – Critical path: load instruction – Instruction.
Comp Sci pipelining 1 Ch. 13 Pipelining. Comp Sci pipelining 2 Pipelining.
Instruction Level Parallelism Pipeline with data forwarding and accelerated branch Loop Unrolling Multiple Issue -- Multiple functional Units Static vs.
CMPE 421 Parallel Computer Architecture
CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.
CS 1104 Help Session IV Five Issues in Pipelining Colin Tan, S
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
5/13/99 Ashish Sabharwal1 Pipelining and Hazards n Hazards occur because –Don’t have enough resources (ALU’s, memory,…) Structural Hazard –Need a value.
1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.
Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.
1/24/ :00 PM 1 of 86 Pipelining Chapter 6. 1/24/ :00 PM 2 of 86 Overview of Pipelining Pipelining is an implementation technique in which.
Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
Introduction to Computer Organization Pipelining.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,
Csci 136 Computer Architecture II – Superscalar and Dynamic Pipelining Xiuzhen Cheng
Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.
Use of Pipelining to Achieve CPI < 1
Chapter Six.
CS 352H: Computer Systems Architecture
CSCI206 - Computer Organization & Programming
Pipeline Architecture since 1985
ECS 154B Computer Architecture II Spring 2009
Pipelining: Advanced ILP
Pipelining review.
Pipelining in more detail
CSCI206 - Computer Organization & Programming
Computer Architecture
CSCI206 - Computer Organization & Programming
Chapter Six.
Chapter Six.
Control unit extension for data hazards
CSC3050 – Computer Architecture
Wackiness Algorithm A: Algorithm B:
Control unit extension for data hazards
Presentation transcript:

EECE476: Computer Architecture Lecture 22: Zero-cycle Branches (no text) Superpipelining (no text) vs. Superscalar (text 6.8) The University of British ColumbiaEECE 476© 2005 Guy Lemieux

2 Jumps and Unconditional Branches BTB: Branch-target buffer –Tells us where a branch will jump to BPB: Branch-prediction buffer –Tells us if branch will be taken Consider J and (unconditional) BR –Always takes the branch (prediction unnecessary) ADD$t3, $t1,$t2 JLABEL … LABEL:SUBI$t3, $t3, 1 –This ADD always knows a “J is coming after” (the J is always at PC+4) –Target of JMP/BR is known (not from a register, JR) –Recipe for zero-cycle branches!

3 Zero-cycle Branches? ADD$t3, $t1,$t2 JLABEL … LABEL:SUBI$t3, $t3, 1 Use BTB for non-branch instruction (eg, ADD) –Any instruction immediately before a J or BR BTB reserves entry for ADD, Target: “LABEL” of J, Tag: PC of ADD –ADD now “looks like” a branch/jump When executing ADD –BTB says “always fetch from target LABEL” –Requires small change to datapath (BTB can also select next PC, not just BPB) Do not have to fetch JMP/BR itself! –Branch executed in zero cycles!

4 Zero-cycle Branch: Limits What if target comes from a register? –BTB holds useless value, usually wrong What if branch is conditional (eg, BEQ)? –Two paths: taken and untaken –Do not know which path is correct until after executing BEQ Actually need to fetch & execute BEQ! To determine Rs and Rt, and do comparison Cannot do in zero cycles? Can conditional branches ever take zero cycles? –YES! –But I’ll let you figure out how…

Superpipelining

6 Pipeline Trends Slowest stages in classic 5-stage pipeline –Instruction and data memory accesses –CPUs get faster much more quickly than memory –Memory accesses continue to be the bottleneck in computer architecture for last years –Instruction and Data Memory replaced with faster caches A cache is a small, fast on-chip memory –Keeps a local copy of data from main memory French: cache means HIDE Idea: cache memory is hidden from your program (transparent) –Discuss details later..

7 Superpipelining 5-stage pipeline is “classical” –MIPS –Intel 486 has 5-stage pipeline First Intel CPU with on-chip cache Superpipelining –More pipeline stages Basic Idea: faster clock speeds –Do less work per clock cycle –Still complete 1 instruction per cycle

8 MIPS R4000 Superpipeline 5 stages: I, D, X, M, W I stage: read memory M stage: read memory –Fast caches are still too slow 8 stages: IF, IS, D, X, DF, DS, DC, W –Approx 2x clock speed of 5-stage pipeline Split “I” stage in two –IF “I First” –IS “I Second” Split “M” stage in three –DF “D First” –DS “D Second” –DC “D CheckTag”

9 MIPS R4000 CheckTag Stage CheckTag Stage –Cache is similar to BTB Contains a TAG specifying the memory address for the data it is holding –Access data cache Must check TAG to verify we got the correct data –CheckTag takes 1 extra clock cycle! –If CheckTag fails pipeline must stall get data from actual data memory ( clock cycles) MIPS R4000 is very aggressive –Forwarding Units take data out of “DS” stage (can’t take from DF) –If CheckTag fails, it BACKS UP the pipeline 1 cycle (hard to do!)

10 Superpipelining Limits Data Hazards –More forwarding Eg, X forwarding from DF, DS, DC, and WB stages –More pipeline stalls CheckTag failure causes stall Load-Use Penalty: 2 cycles –Load instruction: 2 clock cycles (DF, DS) –Use instruction: must wait for load to finish Insert 2 instructions between Load and Use Can use NOP If no instructions, pipeline will stall

11 Importance of Branch Prediction Branch-Delay Penalty –Branch in “D” stage Two more instructions are being fetched (IF, IS) Two branch delay slots! –Next version of superpipeline… May have 3 branch delay slots? Not a good idea! Need BRANCH PREDICTION –MIPS R4000 Total branch delay: 3 cycles 1 delay slot (historical), followed by 2 cycles static branch prediction (predict-untaken)

Superscalar

13 Superscalar Basic Idea –Why execute only 1 instruction in a clock cycle? –How about 2 instructions per cycle? Tempting to begin calling it IPC (instructions per cycle) –IPC = 1 / CPI –Compare “IPC” to “MIPS” … both are rates Stick to CPI for this course: ExecutionTime = InstructionCount * CPI * ClockPeriod Ideal CPI = 0.5 in this case

14 Static Superscalar Find 2 instructions every clock cycle! –Pair them up when writing assembly code –Called Static Superscalar Compiler does the work ahead of time –Given two instructions, CPU just executes them Instructions must be independent If hard to find independent instruction, use NOP –Compiler looks for “eligible” pairs Automagically avoid dependences between instruction pairs Not much brains in CPU…

15 Static Superscalar: Need to Double All Resources? Need to double everything? –Need 2 Instruction Memories? –Need 2 Register Files (4 read ports, 2 write ports) ? –Need 2 ALUs ? –Need 2 Data Memories? Too much overhead, not usually done –Just 1 Instruction Memory with 2 x 32bit outputs (8 bytes) –Just 1 ALU –Just 1 Data Memory (need partial ALU to compute address) –Need bigger register file (4 read ports, 2 write ports) Practical limits imposed to use fewer resources –Only combine 1 ALU instruction + 1 Memory instruction Cannot combine 2 ALU instructions or 2 Memory instructions –Align all instructions in pairs in the instruction memory PC%8==0 for ALU instructions, PC%8==4 for memory instructions

16 Static Superscalar

17 Pipeline Diagram for Superscalar Two instructions per cycle 1aALU or BRIDXMW 1bLD or STIDXMW 2aALU or BRIDXMW 2bLD or STIDXMW 3aALU or BRIDXMW 3bLD or STIDXMW 4aALU or BRIDXMW 4bLD or STIDXMW 5aALU or BRIDXMW 5bLD or STIDXMW

18 Code Scheduling for Superscalar Example Loop:lw$t0, 0($s1) addi$s1,$s1,-4 addu$t0,$t0,$s2 sw$t0, 4($s1) bne$s1,$zero, Loop Regular pipeline: 5 cycles per iteration (assuming no delay slots) int *p; for( ; p != 0; p-- ) { *p = *p + CONST; }

19 Code Scheduling for Superscalar Loop:lw$t0, 0($s1) addi$s1,$s1,-4 addu$t0,$t0,$s2 sw$t0, 4($s1) bne$s1,$zero,Loop LABELALU/BR INSTRLD/ST INSTRCycle Loop:LW $t0,0($s1)1 ADDI $s1,$s1,-42 ADDU $t0,$t0,$s23 BNE $s1,$zero,LoopSW $t0,4($s1)4 Blank table entries are NOPS. Load-use delay prevents ADDU being earlier. Effective CPI is 0.8, not 0.5!

20 Code Scheduling for Superscalar The compiler can further improve CPI Loop unrolling –Example: unroll previous code 4 times (# iterations multiple of 4) –Execute new body ¼ number of iterations LABELALU/BR INSTRLD/ST INSTRCycle Loop:LW $t0, 0($s1)1 LW $t1, -4($s1)2 ADDU $t0,$t0,$s2LW $t2, -8($s1)3 ADDU $t1,$t1,$s2LW $t3,-12($s1)4 ADDU $t2,$t2,$s2SW $t0, 0($s1)5 ADDU $t3,$t3,$s2 SW $t1, -4($s1)6 ADDI $s1,$s1,-16SW $t2, -8($s1)7 BNE $s1,$zero,LoopSW $t3, 4($s1)8

21 Code Scheduling for Superscalar Unroll loop 4 times –More registers used –Some BNE/ADDI instructions are gone CPI Improved –Before Unrolling: 0.8 –After Unrolling 8/14 = 0.57 InstrCount Improved –Before Unrolling: 20/4 = 5 per iteration –After Unrolling: 14/4 = 3.5 per iteration We don’t get this with superpipelining! Overall Performance –Pipelined: 5 cycles / iteration –Superscalar before unrolling: 4 cycles / iteration –Superscalar after unrolling: 2 cycles / iteration Superpipelined: 2.0x faster than pipelined! Superscalar unrolled: 2.5x faster than pipelined! LABELALU/BR INSTRLD/ST INSTRCycle Loop:LW $t0, 0($s1)1 LW $t1, -4($s1)2 ADDU $t0,$t0,$s2LW $t2, -8($s1)3 ADDU $t1,$t1,$s2LW $t3,-12($s1)4 ADDU $t2,$t2,$s2SW $t0, 0($s1)5 ADDU $t3,$t3,$s2 SW $t1, -4($s1)6 ADDI $s1,$s1,-16SW $t2, -8($s1)7 BNE $s1,$zero,LoopSW $t3, 4($s1)8

22 Importance of Branch Prediction Now fetching two instructions every cycle –Given a branch: Which two instructions to fetch: Taken or Not-Taken path? Misprediction? –Many lost opportunities to execute instructions –Significant performance loss! Branch prediction CRUCIAL!

23 Superpipelining vs. Superscalar Which is better? –Debate lasted a few years in mid-1990s Result: both won! –Can combine superpipelining and superscalar Branch prediction is now crucial! –6 instructions enter pipeline after a branch x3 from superpipelining x2 from superscalar Superscalar can be enhanced further –Rely less upon compiler –Hardware finds instructions to pair together More hazard detection, etc –Dynamic superscalar (next class!)