ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute of Technology
Instruction Supply Issues Execution Core Instruction Fetch Unit Instruction buffer Fetch throughput defines max performance that can be achieved in later stages Superscalar processors need to supply more than 1 instruction per cycle Instruction Supply limited by Misalignment of multiple instructions in a fetch group Change of Flow (interrupting instruction supply) Memory latency and bandwidth
Aligned Instruction Fetching (4 instructions) PC=..xx000000 00 01 10 11 ..00 A0 A1 A2 A3 ..01 One 64B I-cache line A4 A5 A6 A7 ..10 A8 A9 A10 A11 Row Decoder ..11 A12 A13 A14 A15 Can pull out one row at a time inst 1 inst 2 inst 3 inst 4 Cycle n Assume one fetch group = 16B
Misaligned Fetch PC=..xx001000 A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 ..00 A0 A1 A2 A3 ..01 One 64B I-cache line A4 A5 A6 A7 ..10 A8 A9 A10 A11 Row Decoder ..11 A12 A13 A14 A15 Rotating network inst 1 inst 2 inst 3 inst 4 Cycle n IBM RS/6000
Split Cache Line Access PC=..xx111000 00 01 10 11 ..00 A0 A1 A2 A3 ..01 cache line A A4 A5 A6 A7 ..10 A8 A9 A10 A11 Row Decoder ..11 A12 A13 A14 A15 B0 B1 B2 B3 cache line B B4 B5 B6 B7 inst 1 inst 2 Cycle n inst 3 inst 4 Cycle n+1 Be broken down to 2 physical accesses
Split Cache Line Access Miss PC=..xx111000 00 01 10 11 ..00 A0 A1 A2 A3 ..01 cache line A A4 A5 A6 A7 ..10 A8 A9 A10 A11 Row Decoder ..11 A12 A13 A14 A15 C0 C1 C2 C3 cache line C C4 C5 C6 C7 Cache line B misses inst 1 inst 2 Cycle n inst 3 inst 4 Cycle n+X
High Bandwidth Instruction Fetching Wider issue More instruction feed Major challenge: to fetch more than one non-contiguous basic block per cycle Enabling technique? Predication Branch alignment based on profiling Other hardware solutions (branch prediction is a given) BB1 BB4 BB2 BB3 BB5 BB7 BB6
Assembly w/ predication Predication Example Source code lw r2, [r1+4] lw r3, [r1] blt r3, r2, L1 sw r0, [r1] j L2 L1: sw r0, [r1+4] L2: Typical assembly lw r2, [r1+4] lw r3, [r1] sgt pr4, r2, r3 (p4) sw r0, [r1+4] (!p4) sw r0, [r1] Assembly w/ predication if (a[i+1]>a[i]) a[i+1] = 0 else a[i] = 0 Convert control dependency into data dependency Enlarge basic block size More room for scheduling No fetch disruption
Collapse Buffer [Conte et al. 95] To fetch multiple (often non-contiguous) instructions Use interleaved BTB to enable multiple branch predictions Align instructions in the predicted sequential order Use banked I-cache for multiple line access
Collapsing Buffer Interleaved BTB Fetch PC Cache Bank 1 Cache Bank 2 Interchange Switch Collapsing Circuit
Collapsing Buffer Mechanism Interleaved BTB Valid Instruction Bits E F G H A B C D A E Interchange Switch A B C D E F G H Bank Routing E A D F H Collapsing Circuit A B C E G E F G H A B C D
High Bandwidth Instruction Fetching To fetch more, we need to cross multiple basic blocks (and/or multiple cache lines) Multiple branches predictions BB1 BB4 BB2 BB3 BB5 BB7 BB6
Multiple Branch Predictor [YehMarrPatt’93] Pattern History Table (PHT) design to support MBP Based on global history only Pattern History Table (PHT) Branch History Register (BHR) Tertiary prediction bk b1 …… p2 p1 p2 update Secondary prediction p1 Primary prediction
Multiple Branch Predictin Fetch address (br0 Primary prediction) Fetch address could be retrieved from BTB Predicted path: BB1 BB2 BB5 How to fetch BB2 and BB5? BTB? Can’t. Branch PCs of br1 and br2 not available when MBP made Use a BAC design BTB entry BB1 br1 T (2nd) F BB2 br2 BB3 F (3rd) T F T BB4 BB5 BB6 BB7
Branch Address Cache V br V br V br Tag Taken Target Address Not-Taken Target Address T-T Address T-N Address N-T Address N-N Address 23 bits 1 2 30 bits 30 bits 212 bits per fetch address entry Fetch Addr (from BTB) Use a Branch Address Cache (BAC): Keep 6 possible fetch addresses for 2 more predictions br: 2 bits for branch type (cond, uncond, return) V: single valid bit (to indicate if hits a branch in the sequence) To make one more level prediction Need to cache another 8 more addresses (i.e. total=14 addresses) 464 bits per entry = (23+3)*1 + (30+3) * (2+4) + 30*8
Caching Non-Consecutive Basic Blocks High Fetch Bandwidth + Low Latency BB3 BB5 BB1 BB2 BB4 Fetch in Conventional Instruction Cache BB1 BB2 BB3 BB4 BB5 Fetch in Linear Memory Location
Trace Cache Cache dynamic non-contiguous instructions (traces) Cross multiple basic blocks Need to predict multiple branches (MBP) E F G A B C D E F G H I J I$ Fetch (5 cycles) A B C D E F G H I J Collapsing Buffer Fetch (3 cycles) A B C D E F G H I J Trace Cache H I J K A B C D E F G H I J T$ Fetch (1 cycle) A B C D I$
Trace Cache [Rotenberg Bennett Smith ‘96] 11, 1 11: 3 branches. 1: the trace ends w/ a branch 10 1st Br taken 2nd Br Not taken For T.C. miss Br flag Br mask Line fill buffer Tag Fall-thru Address Taken Address M branches BB2 BB1 BB3 T.C. hits, N instructions Branch 1 Branch 2 Branch 3 Fetch Addr Cache at most (in original paper) M branches OR (M = 3 in all follow-up TC studies due to MBP) N instructions (N = 16 in all follow-up TC studies) Fall-thru address if last branch is predicted not taken MBP
Trace Hit Logic Fetch: A Multi-BPred A 10 11,1 X Y N T N = Cond. AND Tag BF Mask Fall-thru Target Multi-BPred A 10 11,1 X Y N T N = 0 1 Cond. AND Match 1st Block Next Fetch Address Match Remaining Block(s) Trace hit
BB Traversal Path: ABDABDACDABDACDABDAC Trace Cache Example BB Traversal Path: ABDABDACDABDACDABDAC A B C D Exit 5 insts 12 insts 4 insts 6 insts Cond 1: 3 branches Cond 2: Fill a trace cache line Cond 3: Exit A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 D1 D2 D3 D4 A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 16 instructions Trace Cache (5 lines)
BB Traversal Path: ABDABDACDABDACDABDAC Trace Cache Example BB Traversal Path: ABDABDACDABDACDABDAC A B C D Exit 5 insts 12 insts 4 insts 6 insts Cond 1: 3 branches Cond 2: Fill a trace cache line Cond 3: Exit A1 A2 A3 A4 A5 A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 D1 D2 D3 D4 A1 A2 A3 A4 A5 A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C12 D1 D2 D3 D4 D1 D2 D3 D4 A1 A2 A3 A4 A5 A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 D1 D2 D3 D4 A1 A2 A3 A4 A5 A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 D1 D2 D3 D4 D1 D2 D3 D4 Trace Cache (5 lines)
BB Traversal Path: ABDABDACDABDACDABDAC Trace Cache Example BB Traversal Path: ABDABDACDABDACDABDAC A B C D Exit 5 insts 12 insts 4 insts 6 insts Trace Cache is Full A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C12 D1 D2 D3 D4 A1 A2 A3 A4 A5 C12 D1 D2 D3 D4 A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 D1 D2 D3 D4 Trace Cache (5 lines)
Trace Cache Example BB Traversal Path: ABDABDACDABDACDABDAC A B C D Exit 5 insts 12 insts 4 insts 6 insts How many hits? What is the utilization? A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C12 D1 D2 D3 D4 A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 D1 D2 D3 D4
Redundancy Duplication Fragmentation Example B C D A Note that instructions only appear once in I-Cache Same instruction appears many times in TC Fragmentation If 3 BBs < 16 instructions If multiple-target branch (e.g. return, indirect jump or trap) is encountered, stop “trace construction”. Empty slots wasted resources Example A single BB is broken up to (ABC), (BCD), (CDA), (DAB) Duplicating each instruction 3 times 6 B C D A 4 (ABC) =16 inst (BCD) =13 inst (CDA) =15 inst (DAB) =13 inst 6 3 A B C D Trace Cache
Indexability TC saved traces (EAC) and (BCD) Path: (EAC) to (D) Cannot index interior block (D) Can cause duplication Need partial matching (BCD) is cached, if (BC) is needed A B C G D E C B D Trace Cache A
Pentium 4 (NetBurst) Trace Cache Front-end BTB iTLB and Prefetcher L2 Cache No I$ !! Decoder Trace $ BTB Trace $ Rename, execute, etc. Trace-based prediction (predict next-trace, not next-PC) Decoded Instructions