1 EE524 / CptS561 Computer Architecture Speculation: allow an instruction to issue that is dependent on branch predicted to be taken without any consequences.

Slides:



Advertisements
Similar presentations
Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
Advertisements

Lec18.1 Step by step for Dynamic Scheduling by reorder buffer Copyright by John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 19, 2005 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
A scheme to overcome data hazards
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
CSE 502 Graduate Computer Architecture Lec 11 – More Instruction Level Parallelism Via Speculation Larry Wittie Computer Science, StonyBrook University.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
Computer Architecture Lec 8 – Instruction Level Parallelism.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
CS136, Advanced Architecture Speculation. CS136 2 Outline Speculation Speculative Tomasulo Example Memory Aliases Exceptions VLIW Increasing instruction.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 14, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
Lecture 8: More ILP stuff Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
EEL 5708 Speculation. Branch prediction. Superscalar processors. Lotzi Bölöni.
CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
DAP Spr.‘98 ©UCB 1 Lecture 6: ILP Techniques Contd. Laxmi N. Bhuyan CS 162 Spring 2003.
CSE 502 Graduate Computer Architecture Lec – More Instruction Level Parallelism Via Speculation Larry Wittie Computer Science, StonyBrook University.
EECC551 - Shaaban #1 lec # 6 Fall Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Mar 17, 2009 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
CPSC614 Lec 5.1 Instruction Level Parallelism and Dynamic Execution #4: Based on lectures by Prof. David A. Patterson E. J. Kim.
1 Zvika Guz Slides modified from Prof. Dave Patterson, Prof. John Kubiatowicz, and Prof. Nancy Warter-Perez Out Of Order Execution.
COMP381 by M. Hamdi 1 Superscalar Processors. COMP381 by M. Hamdi 2 Recall from Pipelining Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data.
EECC551 - Shaaban #1 lec # 6 Winter Evolution of Processor Performance Source: John P. Chen, Intel Labs CPI >
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 9, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
Review of CS 203A Laxmi Narayan Bhuyan Lecture2.
EECC551 - Shaaban #1 lec # 8 Winter Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to.
CIS 629 Fall 2002 Multiple Issue/Speculation Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to utilize.
EECC551 - Shaaban #1 lec # 6 Fall Evolution of Processor Performance Source: John P. Chen, Intel Labs CPI > (?)
ENGS 116 Lecture 91 Dynamic Branch Prediction and Speculation Vincent H. Berk October 10, 2005 Reading for today: Chapter 3.2 – 3.6 Reading for Wednesday:
CPSC614 Lec 5.1 Instruction Level Parallelism and Dynamic Execution #4: Based on lectures by Prof. David A. Patterson E. J. Kim.
1 Overcoming Control Hazards with Dynamic Scheduling & Speculation.
1 Chapter 2: ILP and Its Exploitation Review simple static pipeline ILP Overview Dynamic branch prediction Dynamic scheduling, out-of-order execution Hardware-based.
CS/EE 5810 CS/EE 6810 F00: 1 Extracting More ILP.
CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
Lecture 7: Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
EECS 252 Graduate Computer Architecture Lec 8 – Instruction Level Parallelism David Patterson Electrical Engineering and Computer Sciences University of.
CS 5513 Computer Architecture Lecture 6 – Instruction Level Parallelism continued.
现代计算机体系结构 1 主讲教师:张钢 教授 天津大学计算机学院 通信邮箱: 提交作业邮箱: 2012 年.
CS203 – Advanced Computer Architecture ILP and Speculation.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
/ Computer Architecture and Design
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
COMP 740: Computer Architecture and Implementation
Tomasulo Loop Example Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1
CS203 – Advanced Computer Architecture
CS5100 Advanced Computer Architecture Hardware-Based Speculation
CPSC 614 Computer Architecture Lec 5 – Instruction Level Parallelism
CSCE 430/830 Computer Architecture Advanced HW Approaches: Speculation
David Patterson Electrical Engineering and Computer Sciences
Lecture 5: VLIW, Software Pipelining, and Limits to ILP
Tomasulo With Reorder buffer:
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Adapted from the slides of Prof
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
John Kubiatowicz Electrical Engineering and Computer Sciences
Larry Wittie Computer Science, StonyBrook University and ~lw
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
CPSC 614 Computer Architecture Lec 5 – Instruction Level Parallelism
Adapted from the slides of Prof
Chapter 3: ILP and Its Exploitation
September 20, 2000 Prof. John Kubiatowicz
Overcoming Control Hazards with Dynamic Scheduling & Speculation
Presentation transcript:

1 EE524 / CptS561 Computer Architecture Speculation: allow an instruction to issue that is dependent on branch predicted to be taken without any consequences (including exceptions) if branch is not actually taken (“HW undo”); called “boosting” Combine branch prediction with dynamic scheduling to execute before branches resolved Separate speculative bypassing of results from real bypassing of results –When instruction no longer speculative, write boosted results (instruction commit) or discard boosted results –execute out-of-order but commit in-order to prevent irrevocable action (update state or exception) until instruction commits HW support for More ILP

2 EE524 / CptS561 Computer Architecture HW support for More ILP Need HW buffer for results of uncommitted instructions: reorder buffer –3 fields: instr, destination, value. –Reorder buffer can be operand source  more registers like RS. –Use reorder buffer number instead of reservation station when execution completes. –Supplies operands between execution complete & commit. –Once operand commits, result is put into register. –Instructions commit in order. –As a result, its easy to undo speculated instructions on mispredicted branches or on exceptions. Reorder Buffer FP Regs FP Op Queue FP AdderFP Mult Res Stations

3 EE524 / CptS561 Computer Architecture Load unit

4 EE524 / CptS561 Computer Architecture Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station and reorder buffer slot free, issue instr & send operands & reorder buffer no. for destination (this stage sometimes called “dispatch”) 2.Execution—operate on operands (EX) When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute; checks RAW (sometimes called “issue”) 3.Write result—finish execution (WB) Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available. 4.Commit—update register with reorder result When instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instr from reorder buffer. Mispredicted branch flushes reorder buffer (sometimes called “graduation”)

5 EE524 / CptS561 Computer Architecture Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 F0 LD F0,10(R2) N N Done? Dest Oldest Newest from Memory 1 10+R2 Dest Reorder Buffer Registers

6 EE524 / CptS561 Computer Architecture 2 ADDD R(F4),ROB1 Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 F10 F0 ADDD F10,F4,F0 LD F0,10(R2) N N N N Done? Dest Oldest Newest from Memory 1 10+R2 Dest Reorder Buffer Registers

7 EE524 / CptS561 Computer Architecture 3 DIVD ROB2,R(F6) 2 ADDD R(F4),ROB1 Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) N N N N N N Done? Dest Oldest Newest from Memory 1 10+R2 Dest Reorder Buffer Registers

8 EE524 / CptS561 Computer Architecture 3 DIVD ROB2,R(F6) 2 ADDD R(F4),ROB1 6 ADDD ROB5, R(F6) Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 F0 ADDD F0,F4,F6 N N F4 LD F4,0(R3) N N -- BNE F2, N N F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) N N N N N N Done? Dest Oldest Newest from Memory 1 10+R2 Dest Reorder Buffer Registers 5 0+R3

9 EE524 / CptS561 Computer Architecture 3 DIVD ROB2,R(F6) 2 ADDD R(F4),ROB1 6 ADDD ROB5, R(F6) Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 -- F0 ROB5 ST 0(R3),F4 ADDD F0,F4,F6 N N N N F4 LD F4,0(R3) N N -- BNE F2, N N F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) N N N N N N Done? Dest Oldest Newest from Memory Dest Reorder Buffer Registers 1 10+R2 5 0+R3

10 EE524 / CptS561 Computer Architecture 3 DIVD ROB2,R(F6) Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 -- F0 M[10] ST 0(R3),F4 ADDD F0,F4,F6 Y Y N N F4 M[10] LD F4,0(R3) Y Y -- BNE F2, N N F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) N N N N N N Done? Dest Oldest Newest from Memory 1 10+R2 Dest Reorder Buffer Registers 2 ADDD R(F4),ROB1 6 ADDD M[10],R(F6)

11 EE524 / CptS561 Computer Architecture 3 DIVD ROB2,R(F6) 2 ADDD R(F4),ROB1 Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 -- F0 M[10] ST 0(R3),F4 ADDD F0,F4,F6 Y Y Ex F4 M[10] LD F4,0(R3) Y Y -- BNE F2, N N F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) N N N N N N Done? Dest Oldest Newest from Memory 1 10+R2 Dest Reorder Buffer Registers

12 EE524 / CptS561 Computer Architecture -- F0 M[10] ST 0(R3),F4 ADDD F0,F4,F6 Y Y Ex F4 M[10] LD F4,0(R3) Y Y -- BNE F2, N N 3 DIVD ROB2,R(F6) 2 ADDD R(F4),ROB1 Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) N N N N N N Done? Dest Oldest Newest from Memory 1 10+R2 Dest Reorder Buffer Registers What about memory hazards???

13 EE524 / CptS561 Computer Architecture Renaming Registers Common variation of speculative design Reorder buffer keeps instruction information but not the result Extend register file with extra renaming registers to hold speculative results Rename register allocated at issue; result into rename register on execution complete; rename register into real register on commit Operands read either from register file (real or speculative) or via Common Data Bus Advantage: operands are always from single source (extended register file)

14 EE524 / CptS561 Computer Architecture Registers Renaming Registers Reorder Buffer FP Regs FP Op Queue FP AdderFP Mult Res Stations

15 EE524 / CptS561 Computer Architecture Dynamic Scheduling in PowerPC 604 and Pentium Pro Both processors have: –In-order Issue, –Out-of-order execution, –In-order Commit

16 EE524 / CptS561 Computer Architecture Exceptions and Interrupts Technique for both precise interrupts/exceptions and speculation: in-order completion and in-order commit –If we speculate and are wrong, need to back up and restart execution to point at which we predicted incorrectly –This is exactly same as need to do with precise exceptions Exceptions are handled by not recognizing the exception until instruction that caused it is ready to commit in ROB –If a speculated instruction raises an exception, the exception is recorded in the ROB –This is why reorder buffers in all new processors

17 EE524 / CptS561 Computer Architecture Dynamic Scheduling in PowerPC 604 and Pentium Pro ParameterPPCPPro Max. instructions issued/clock43 Max. instr. complete exec./clock65 Max. instr. commited/clock63 Window (Instrs in reorder buffer)1640 Number of reservations stations1220 Number of rename registers8int/12FP40 No. integer functional units (FUs)2 2 No. floating point FUs11 No. branch FUs11 No. complex integer FUs10 No. memory FUs11 load +1 store

18 EE524 / CptS561 Computer Architecture Dynamic Scheduling in Pentium Pro PPro doesn’t pipeline 80x86 instructions PPro decode unit translates the Intel instructions into 72- bit micro-operations (­ DLX) Sends micro-operations to reorder buffer & reservation stations Takes 1 clock cycle to determine length of 80x86 instructions + 2 more to create the micro-operations clocks in total pipeline (­ 3 state machines) Many instructions translate to 1 to 4 micro-operations Complex 80x86 instructions are executed by a conventional microprogram (8K x 72 bits) that issues long sequences of micro-operations

19 EE524 / CptS561 Computer Architecture Getting CPI < 1: Issuing Multiple Instructions/Cycle Two variations –Superscalar: varying no. instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo) »IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000 –(Very) Long Instruction Words (V)LIW: fixed number of instructions (4-16) scheduled by the compiler; put ops into wide templates »Joint HP/Intel agreement in 1999/2000 »Intel Architecture-64 (IA-64) 64-bit address »Style: “Explicitly Parallel Instruction Computer (EPIC)” Anticipated success lead to use of Instructions Per Clock cycle (IPC) vs. CPI

20 EE524 / CptS561 Computer Architecture Getting CPI < 1: Issuing Multiple Instructions/Cycle Superscalar DLX: 2 instructions, 1 FP & 1 anything else – Fetch 64-bits/clock cycle; Int on left, FP on right – Can only issue 2nd instruction if 1st instruction issues – More ports for FP registers to do FP load & FP op in a pair TypePipeStages Int. instructionIFIDEXMEMWB FP instructionIFIDEXMEMWB Int. instructionIFIDEXMEMWB FP instructionIFIDEXMEMWB Int. instructionIFIDEXMEMWB FP instructionIFIDEXMEMWB 1 cycle load delay expands to 3 instructions in SS –instruction in right half can’t use it, nor instructions in next slot

21 EE524 / CptS561 Computer Architecture Superscalar IF ID IF EX ID IF M EX ID IF WB M EX ID IF WB M EX ID WB M EX WB M WB

22 EE524 / CptS561 Computer Architecture Review: Unrolled Loop that Minimizes Stalls for Scalar 1 Loop:LDF0,0(R1) 2 LDF6,-8(R1) 3 LDF10,-16(R1) 4 LDF14,-24(R1) 5 ADDDF4,F0,F2 6 ADDDF8,F6,F2 7 ADDDF12,F10,F2 8 ADDDF16,F14,F2 9 SD0(R1),F4 10 SD-8(R1),F8 11 SD-16(R1),F12 12 SUBIR1,R1,#32 13 BNEZR1,LOOP 14 SD8(R1),F16; 8-32 = clock cycles, or 3.5 per iteration LD to ADDD: 1 Cycle ADDD to SD: 2 Cycles

23 EE524 / CptS561 Computer Architecture Loop Unrolling in Superscalar Integer instructionFP instructionClock cycle Loop:LD F0,0(R1)1 LD F6,-8(R1)2 LD F10,-16(R1)ADDD F4,F0,F23 LD F14,-24(R1)ADDD F8,F6,F24 LD F18,-32(R1)ADDD F12,F10,F25 SD 0(R1),F4ADDD F16,F14,F26 SD -8(R1),F8ADDD F20,F18,F27 SD -16(R1),F128 SD -24(R1),F169 SUBI R1,R1,#4010 BNEZ R1,LOOP11 SD -32(R1),F2012 Unrolled 5 times to avoid delays (+1 due to SS) 12 clocks, or 2.4 clocks per iteration (1.5X)

24 EE524 / CptS561 Computer Architecture Multiple Issue Challenges Integer/FP split is simple for HW, get CPI of 0.5 only for programs /w: –Exactly 50% FP operations –No hazards If more instructions issue at same time, greater difficulty of decode and issue –Even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2 instructions can issue VLIW: tradeoff instruction space for simple decoding –The very long instruction word (VLIW) has room for many operations –By definition, all the operations the compiler puts in the long instruction word are independent  they can be executed in parallel –E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch »16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide –Need compiling technique that schedules across several branches

25 EE524 / CptS561 Computer Architecture VLIW Mem. ref. 1Mem. ref. 2FP op. 1FP op. 2inst branch

26 EE524 / CptS561 Computer Architecture Loop Unrolling in VLIW Memory MemoryFPFPInt. op/Clock reference 1reference 2operation 1 op. 2 branch LD F0,0(R1)LD F6,-8(R1)1 LD F10,-16(R1)LD F14,-24(R1)2 LD F18,-32(R1)LD F22,-40(R1)ADDD F4,F0,F2ADDD F8,F6,F23 LD F26,-48(R1)ADDD F12,F10,F2ADDD F16,F14,F24 ADDD F20,F18,F2ADDD F24,F22,F25 SD 0(R1),F4SD -8(R1),F8ADDD F28,F26,F26 SD -16(R1),F12SD -24(R1),F167 SD -32(R1),F20SD -40(R1),F24SUBI R1,R1,#488 SD -0(R1),F28BNEZ R1,LOOP9 Unrolled 7 times to avoid delays 7 results in 9 clocks, or 1.3 clocks per iteration (1.8X) Average: 2.5 ops per clock, 50% efficiency Note: Need more registers in VLIW (15 vs. 6 in SS)

27 EE524 / CptS561 Computer Architecture Trace Scheduling Parallelism across IF branches vs. LOOP branches Two steps: –Trace Selection »Find likely sequence of basic blocks (trace) of (statically predicted or profile predicted) long sequence of straight-line code –Trace Compaction »Squeeze trace into few VLIW instructions »Need bookkeeping code in case prediction is wrong Compiler undoes bad guess (discards values in registers) Subtle compiler bugs mean wrong answer vs. poor performance; no hardware interlocks

28 EE524 / CptS561 Computer Architecture Advantages of HW (Tomasulo) vs. SW (VLIW) Speculation HW determines address conflicts HW better branch prediction HW maintains precise exception model HW does not execute bookkeeping instructions Works across multiple implementations SW speculation is much easier for HW design

29 EE524 / CptS561 Computer Architecture Superscalar vs. VLIW Smaller code size Binary compatability across generations of hardware Simplified Hardware for decoding, issuing instructions No Interlock Hardware (compiler checks?) More registers, but simplified Hardware for Register Ports (multiple independent register files?)

30 EE524 / CptS561 Computer Architecture Intel/HP IA-64 “Explicitly Parallel Instruction Computer (EPIC)” IA-64: instruction set architecture bit integer regs bit floating point regs –Not separate register files per functional unit as in old VLIW Hardware checks dependencies (interlocks => binary compatibility over time) Predicated execution (select 1 out of 64 1-bit flags) => 40% fewer mispredictions? Itanium™ was first implementation (2001) –Highly parallel and deeply pipelined hardware at 800MHz –6-wide, 10-stage pipeline at 800Mhz on 0.18 µ process Itanium 2™ is name of 2nd implementation (2005) –6-wide, 8-stage pipeline at 1666MHz on 0.13 µm process –Caches: 32 KB I, 32 KB D, 128 KB L2I, 128 KB L2D, 9216 KB L3

31 EE524 / CptS561 Computer Architecture Execution units

32 EE524 / CptS561 Computer Architecture

33 EE524 / CptS561 Computer Architecture

34 EE524 / CptS561 Computer Architecture BundleSlot 0Slot 1Slot 2EX 9: MMILD F0, 0(R1)LD F6,-8(R1)1 14: MMFLD F10,-16(R1)LD F14,-24(R1)ADD F4,F0,F23 15: MMFLD F18,-32(R1)LD F22,-40(R1)ADD F8,F6,F24 15: MMFLD F26,-48(R1)SD F4,0(R1)ADD F12,F10,F26 15: MMFSD F8,-8(R1)SD F12,-16(R1)ADD F16,F14,F29 15: MMFSD F16,-24(R1)ADD F20,F18,F212 15: MMFSD F20,-32(R1)ADD F24,F22,F215 15: MMFSD F24,-40(R1)ADD F28,F26,F218 17: MIBSD F28,-48(R1)ADD R1,R1,#-56BNE R1,R2,Loop21

35 EE524 / CptS561 Computer Architecture BundleSlot 0Slot 1Slot 2EX 8: MMILD F0, 0(R1)LD F6,-8(R1)1 9: MMILD F10,-16(R1)LD F14,-24(R1)2 14: MMFLD F18,-32(R1)LD F22,-40(R1)ADD F4,F0,F2 3 14: MMFLD F26,-48(R1) ADD F8,F6,F24 15: MMFADD F12,F10,F25 14: MMFSD F4,0(R1) ADD F16,F14,F26 14: MMFSD F8,-8(R1) ADD F20,F18,F27 15: MMFSD F12,-16(R1)ADD F24,F22,F28 14: MMFSD F16,-24(R1)ADD F28,F26,F29 9: MMISD F20,-32(R1)SD F24,-40(R1)11 17: MIBSD F28,-48(R1)ADD R1,R1,#-56BNE R1,R2,Loop12

36 EE524 / CptS561 Computer Architecture Increasing Instruction Fetch Bandwidth Predicts next instruct address, sends it out before decoding instructuction PC of branch sent to BTB When match is found, Predicted PC is returned If branch predicted taken, instruction fetch continues at Predicted PC Branch Target Buffer (BTB)

37 EE524 / CptS561 Computer Architecture Inst. Fetch Bandwidth: Return Address Predictor Small buffer of return addresses acts as a stack Caches most recent return addresses Call  Push a return address on stack Return  Pop an address off stack & predict as new PC

38 EE524 / CptS561 Computer Architecture More Instruction Fetch Bandwidth Integrated branch prediction branch predictor is part of instruction fetch unit and is constantly predicting branches Instruction prefetch Instruction fetch units prefetch to deliver multiple instruct. per clock, integrating it with branch prediction Instruction memory access and buffering Fetching multiple instructions per cycle: –May require accessing multiple cache blocks (prefetch to hide cost of crossing cache blocks) –Provides buffering, acting as on-demand unit to provide instructions to issue stage as needed and in quantity needed

39 EE524 / CptS561 Computer Architecture (Mis) Speculation on Pentium 4 % of micro-ops not used Misspeculation Fraction vzip vpr gcc mcf crafty wupsise swim mgrid applu mesa Floating PointInteger

40 EE524 / CptS561 Computer Architecture Dynamic Scheduling in Superscalar Dependencies stop instruction issue (w/o Tomasulo scheme) Code compiler for old version will run poorly on newest version –May want code to vary depending on how superscalar

41 EE524 / CptS561 Computer Architecture Dynamic Scheduling in Superscalar How to issue two instructions and keep in-order instruction issue for Tomasulo? –Assume 1 integer + 1 floating point –1 Tomasulo control for integer, 1 for floating point Issue 2X Clock Rate, so that issue remains in order Only FP loads might cause dependency between integer and FP issue: –Replace load reservation station with a load queue; operands must be read in the order they are fetched –Load checks addresses in Store Queue to avoid RAW violation –Store checks addresses in Load Queue to avoid WAR,WAW –Called “decoupled architecture”

42 EE524 / CptS561 Computer Architecture Performance of Dynamic SS Iteration InstructionsIssues ExecutesWrites result no. clock-cycle number 1LD F0,0(R1)124 1ADDD F4,F0,F2158 1SD 0(R1),F429 1SUBI R1,R1,#8345 1BNEZ R1,LOOP45 2LD F0,0(R1)568 2ADDD F4,F0,F SD 0(R1),F4613 2SUBI R1,R1,#8789 2BNEZ R1,LOOP89 ­ 4 clocks per iteration; only 1 FP instr/iteration Branches, Decrements issues still take 1 clock cycle How get more performance?