ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia.

ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute of Technology

2 Instruction Supply Issues Fetch throughput defines max performance that can be achieved in later stages Superscalar processors need to supply more than 1 instruction per cycle Instruction Supply limited by –Misalignment of multiple instructions in a fetch group –Change of Flow (interrupting instruction supply) –Memory latency and bandwidth Instruction Fetch Unit Execution Core Instruction buffer

3 Aligned Instruction Fetching (4 instructions) Row Decoder..01..00 A0 A4 00 A1 A5 01 A2 A6 10 A3 A7 11 inst 1 inst 2 inst 3 inst 4 inst 1 inst 2 inst 3 inst 4 PC=..xx000000 One 64B I- cache line A8 A12 A9 A13 A10 A14 A11 A15..10..11 Assume one fetch group = 16B Cycle n Can pull out one row at a time

4 Misaligned Fetch Row Decoder..01..00 A0 A4 00 A1 A5 01 A2 A6 10 A3 A7 11 PC=..xx001000 One 64B I- cache line A8 A12 A9 A13 A10 A14 A11 A15..10..11 inst 1 inst 2 inst 3 inst 4 inst 1 inst 2 inst 3 inst 4 Rotating network Cycle n IBM RS/6000

5 Split Cache Line Access Row Decoder..01..00 A0 A4 00 A1 A5 01 A2 A6 10 A3 A7 11 PC=..xx111000 cache line A A8 A12 A9 A13 A10 A14 A11 A15..10..11 B0B1B2B3 cache line B B4B5B6B7 inst 1 inst 2 inst 1 inst 2 inst 3 inst 4 inst 3 inst 4 Cycle n Cycle n+1 Be broken down to 2 physical accesses

6 Split Cache Line Access Miss Row Decoder A0 A4 00 A1 A5 01 A2 A6 10 A3 A7 11 cache line A A8 A12 A9 A13 A10 A14 A11 A15 C0C1C2C3 cache line C C4C5C6C7 inst 1 inst 2 inst 1 inst 2 inst 3 inst 4 inst 3 inst 4 Cache line Bmisses B misses Cycle n Cycle n+X..01..00..10..11 PC=..xx111000

7 High Bandwidth Instruction Fetching BB1 BB2BB3 BB4 BB5 BB6 BB7 Wider issue  More instruction feed non-contiguousMajor challenge: to fetch more than one non-contiguous basic block per cycle Enabling technique? –Predication –Branch alignment based on profiling –Other hardware solutions (branch prediction is a given)

8 Predication Example Convert control dependency into data dependency Enlarge basic block size –More room for scheduling –No fetch disruption if (a[i+1]>a[i]) a[i+1] = 0 else a[i] = 0 if (a[i+1]>a[i]) a[i+1] = 0 else a[i] = 0 Source code lw r2, [r1+4] lw r3, [r1] blt r3, r2, L1 sw r0, [r1] j L2 L1: sw r0, [r1+4] L2: lw r2, [r1+4] lw r3, [r1] blt r3, r2, L1 sw r0, [r1] j L2 L1: sw r0, [r1+4] L2: Typical assembly lw r2, [r1+4] lw r3, [r1] sgt pr4, r2, r3 (p4) sw r0, [r1+4] (!p4) sw r0, [r1] lw r2, [r1+4] lw r3, [r1] sgt pr4, r2, r3 (p4) sw r0, [r1+4] (!p4) sw r0, [r1] Assembly w/ predication

9 Collapse Buffer [ISCA 95] To fetch multiple (often non-contiguous) instructions Use interleaved BTB to enable multiple branch predictions Align instructions in the predicted sequential order Use banked I-cache for multiple line access

10 Collapsing Buffer Fetch PC Interleaved BTB Cache Bank 1 Cache Bank 2 Interchange Switch Collapsing Circuit

11 Collapsing Buffer Mechanism Interleaved BTB AE Bank Routing EA EFGH ABCD EFGHABCD Interchange Switch ABCDEFGH Collapsing Circuit ABCEG Valid Instruction Bits DFH

12 High Bandwidth Instruction Fetching BB1 BB2BB3 BB4 BB5 BB6 BB7 To fetch more, we need to cross multiple basic blocks (and/or multiple cache lines) Multiple branches predictions

13 Multiple Branch Predictor [YehMarrPatt ICS’93] Pattern History Table (PHT) design to support MBP Based on global history only Branch History Register (BHR) Pattern History Table (PHT) bkbk …… b1b1 Primary prediction Secondary prediction Tertiary prediction p1 p2 p1p2 update

14 Multiple Branch Predictin Fetch address could be retrieved from BTB Predicted path: BB1  BB2  BB5 How to fetch BB2 and BB5? BTB? br1br2 –Can ’ t. Branch PCs of br1 and br2 not available when MBP made –Use a BAC design BB1br1 BB2br2 BB3 BB4BB5BB6BB7 T (2 nd ) F T T F (3 rd ) F Fetch address (br0 Primary prediction) BTB entry

15 Branch Address Cache Use a Branch Address Cache (BAC): Keep 6 possible fetch addresses for 2 more predictions br: 2 bits for branch type (cond, uncond, return) V: single valid bit (to indicate if hits a branch in the sequence) To make one more level prediction –Need to cache another 8 more addresses (i.e. total=14 addresses) –464 bits per entry = (23+3)*1 + (30+3) * (2+4) + 30*8 Tag 23 bits Taken Target Address Not-Taken Target Address T-T Address T-N Address N-T Address N-N Address 30 bits Vbr V V 212 bits per fetch address entry 1 2 Fetch Addr (from BTB)

16 Caching Non-Consecutive Basic Blocks BB2 High Fetch Bandwidth + Low Latency BB1 BB3 BB4 BB5 Fetch in Conventional Instruction Cache BB2BB1BB3 BB4 BB5 Fetch in Linear Memory Location

17 Trace Cache Cache dynamic non-contiguous instructions (traces) Cross multiple basic blocks Need to predict multiple branches (MBP) EFG HIJK AB C D I$ AB C D EFG HIJ I$ Fetch (5 cycles) ABC DEFG HIJ Collapsing Buffer Fetch (3 cycles) ABCDEFGHIJ Trace Cache ABCDEFGHIJ T$ Fetch (1 cycle)

18 Trace Cache [Rotenberg Bennett Smith MICRO‘96] Cache at most (in original paper) –M branches OR (M = 3 in all follow-up TC studies due to MBP) –N instructions (N = 16 in all follow-up TC studies) Fall-thru address if last branch is predicted not taken Tag Br flag Fetch Addr Br mask Fall-thru Address Taken Address MBP BB2BB1BB3 Line fill buffer For T.C. miss T.C. hits, N instructions M M branches Branch 1 Branch 2 Branch 3 10 1 st Br taken 2 nd Br Not taken 11, 1 11: 3 branches. 1: the trace ends w/ a branch

19 Trace Hit Logic A1011,1XY TagBFMaskFall-thruTarget Fetch: A = Match 1 st Block Multi-BPred TN Cond. AND Match Remaining Block(s) Trace hit N 0 1 Next Fetch Address

20 Trace Cache Example A BC D Exit 5 insts 12 insts 4 insts 6 insts BB Traversal Path: ABDABDACDABDACDABDAC A1A2A3A4A5B1B2B3B4B5B6D1D2D3D4A1A2A3A4A5C1C2C3C4C5C6C7C8C9C10C11C12D1D2D3D4A1A2A3A4A5B1B2B3B4B5B6D1D2D3D4A1A2A3A4A5 C1C2C3C4C5C6C7C8C9C10C11C12D1D2D3D4 Trace Cache (5 lines) Cond 1: 3 branches Cond 2: Fill a trace cache line Cond 3: Exit 16 instructions

21 A1A2A3A4A5B1B2B3B4B5B6D1D2D3D4A1A2A3A4A5C1C2C3C4C5C6C7C8C9C10C11 C12 D1D2D3D4A1A2A3A4A5B1B2B3B4B5B6D1D2D3D4A1A2A3A4A5C1C2C3C4C5C6C7C8C9C10C11C12D1D2D3D4 Trace Cache Example A BC D Exit 5 insts 12 insts 4 insts 6 insts BB Traversal Path: ABDABDACDABDACDABDAC A1A2A3A4A5B1B2B3B4B5B6D1D2D3D4A1A2A3A4A5C1C2C3C4C5C6C7C8C9C10C11 C12 D1D2D3D4A1A2A3A4A5B1B2B3B4B5B6D1D2D3D4A1A2A3A4A5C1C2C3C4C5C6C7C8C9C10C11C12D1D2D3D4 Trace Cache (5 lines) Cond 1: 3 branches Cond 2: Fill a trace cache line Cond 3: Exit

22 Trace Cache Example A BC D Exit 5 insts 12 insts 4 insts 6 insts BB Traversal Path: ABDABDACDABDACDABDAC A1A2A3A4A5B1B2B3B4B5B6D1D2D3D4A1A2A3A4A5C1C2C3C4C5C6C7C8C9C10C11B1B2B3B4B5B6D1D2D3D4A1A2A3A4A5C1C2C3C4C5C6C7C8C9C10C11C12D1D2D3D4 Trace Cache (5 lines) C12D1D2D3D4A1A2A3A4A5 C12 D1D2D3D4A1A2A3A4A5 Trace Cache is Full

23 Trace Cache Example A BC D Exit 5 insts 12 insts 4 insts 6 insts BB Traversal Path: ABDABDACDABDACDABDAC A1A2A3A4A5B1B2B3B4B5B6D1D2D3D4A1A2A3A4A5C1C2C3C4C5C6C7C8C9C10C11B1B2B3B4B5B6D1D2D3D4A1A2A3A4A5C1C2C3C4C5C6C7C8C9C10C11C12D1D2D3D4C12D1D2D3D4A1A2A3A4A5 C12 How many hits? What is the utilization?

24 Redundancy Duplication –Note that instructions only appear once in I-Cache –Same instruction appears many times in TC Fragmentation –If 3 BBs < 16 instructions –If multiple-target branch (e.g. return, indirect jump or trap) is encountered, stop “ trace construction ”. –Empty slots  wasted resources Example –A single BB is broken up to (ABC), (BCD), (CDA), (DAB) –Duplicating each instruction 3 times (ABC) =16 inst (BCD) =13 inst (CDA) =15 inst (DAB) =13 inst AB CBD Trace Cache C DA B CDA 6 4 6 3 B C D A

25 Indexability A C D B E TC saved traces (EAC) and (BCD) Path: (EAC) to (D) –Cannot index interior block (D) Can cause duplication Need partial matching –(BCD) is cached, if (BC) is needed E CBD Trace Cache AC G

26 Pentium 4 (NetBurst) Trace Cache Front-end BTB iTLB and Prefetcher L2 Cache Decoder Trace $ BTB Rename, execute, etc. No I$ !! Decoded Instructions Trace-based prediction (predict next-trace, not next-PC)

ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia.

Similar presentations

Presentation on theme: "ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia.

Similar presentations

Presentation on theme: "ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia."— Presentation transcript:

Similar presentations

About project

Feedback