Presentation is loading. Please wait.

Presentation is loading. Please wait.

Flow Path Model of Superscalars

Similar presentations


Presentation on theme: "Flow Path Model of Superscalars"— Presentation transcript:

1 Flow Path Model of Superscalars
I-cache Instruction Branch FETCH Flow Predictor Instruction Buffer DECODE Integer Floating-point Media Memory Memory Data EXECUTE Flow Reorder Register Buffer (ROB) Data COMMIT Flow Store D-cache Queue

2 Instruction Fetch Buffer
Unit Out-of-order Core Fetch buffer smoothes out the rate mismatch between fetch and execution neither the fetch bandwidth nor the execution bandwidth is consistent Fetch bandwidth should be higher than execution bandwidth we prefer to have a stockpile of instructions in the buffer to hide cache miss latencies. This requires both raw cache bandwidth + control flow speculation

3 Instruction Flow Bandwidth

4 Instruction Cache Basic
00 01 10 11 000 001 Row Decoder PC=..xxRRRCC00 111 Mutiplexer Instruction example: 4 instructions per cache line

5 Spatial Locality and Fetch Bandwidth
00 01 10 11 000 001 Row Decoder PC=..xxRRRCC00 111 Inst Inst Inst2 Inst3

6 Fetch Group Miss Alignment
00 01 10 11 000 001 Row Decoder PC=..xx 111 Inst Inst Inst2 Cycle i Cycle i+1 Inst3??

7 IBM RS/6000 Auto-alignment
1 2 3 255 mux T logic A1 A5 A9 A13 B1 B5 B9 B13 A2 A6 A10 A14 B2 B6 B10 B14 A3 A7 A11 A15 B3 B7 B11 B15 TLB hit control and buffer Odd Directory Sets A & B Even Instruction buffer network Interlock, dispatch, b r a n c h , execution, D I s t u i o + IFAR - 2-way set associative I-Cache, inst SRAM modules - 16 instruction per cache line (**What is a cache line?)

8 Instruction Decoding Issues
Primary tasks: Identify individual instructions Determine instruction types Detect inter-instruction dependences Two important factors: Instruction set architecture Width of parallel pipeline

9 Intel Pentium Pro Fetch/Decode Unit
x86 Macro-Instruction Bytes from IFU Instruction Buffer 16 bytes To Next Address Calc. uROM Decoder Decoder Decoder 1 2 Branch Address Calc. 4 uops 1 uop 1 uop uop Queue (6) Up to 3 uops Issued to dispatch

10 Predecoding in the AMD K5
From Memory 8 Instruction Bytes 64 Byte1 Byte2 Byte8 5 Bits Predecode Logic 8 Instr. Bytes + 64 + 40 Predecode Bits I-Cache 16 Instr. Bytes + 128 + 80 Predecode Bits Decode, Translate and Dispatch ROP1 ROP2 ROP3 ROP4 Predecoding is also useful for RISC ISAs!! Cost: cache size, refill time Up to 4 ROP’s

11 Control Dependence

12 IBM’s Experience on Pipelined Processors [Agerwala and Cocke 1987]
Code Characteristics (dynamic) loads - 25% stores - 15% ALU/RR - 40% branches - 20% 1/3 unconditional (always taken) unconditional - 100% schedulable 1/3 conditional taken 1/3 conditional not taken conditional - 50% schedulable

13 Control Flow Graph Shows possible paths of control flow through basic blocks Control Dependence Node X is control dependant on Node Y if the computation in Y determines whether X executes

14 Mapping CFG to Linear Instruction Sequence
B C D C B D D B C

15 Branch Types and Implementation
Types of Branches Conditional or Unconditional? Subroutine Call (aka Link), needs to save PC? How is the branch target computed? Static Target e.g. immediate, PC-relative Dynamic targets e.g. register indirect Conditional Branch Architectures Condition Code ‘N-Z-C-V’ e.g. PowerPC General Purpose Register e.g. Alpha, MIPS Special Purposes register e.g. Power’s Loop Count

16 Condition Resolution

17 Target Address Generation

18 What’s So Bad About Branches?
Performance Penalties Use up execution resources Fragmentation of I-Cache lines Disruption of sequential control flow Need to determine branch direction (conditional branches) Need to determine branch target Robs instruction fetch bandwidth and ILP

19 Riseman and Foster’s Study
7 benchmark programs on CDC-3600 Assume infinite machine: Infinite memory and instruction stack, register file, fxn units Consider only true dependency at data-flow limit If bounded to single basic block, i.e. no bypassing of branches  maximum speedup is 1.72 Suppose one can bypass conditional branches and jumps (i.e. assume the actual branch path is always known such that branches do not impede instruction execution) Br. Bypassed: Max Speedup:


Download ppt "Flow Path Model of Superscalars"

Similar presentations


Ads by Google