ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia.

Slides:

Advertisements

Similar presentations

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Advertisements

COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.

Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.

Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

CS 7810 Lecture 7 Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching E. Rotenberg, S. Bennett, J.E. Smith Proceedings of MICRO-29.

EECC722 - Shaaban #1 Lec # 5 Fall Decoupled Fetch/Execute Superscalar Processor Engines Superscalar processor micro-architecture is divided.

Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.

EECS 470 Branch Prediction Lecture 6 Coverage: Chapter 3.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 8, 2003 Topic: Instruction-Level Parallelism (Dynamic Branch Prediction)

Replicated Block Cache... block_id d e c o d e r N=2 n direct mapped cache FAi1i2i b word lines Final Collapse Fetch Buffer c o p y - 2 c o p y - 3 c o.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 7, 2002 Topic: Instruction-Level Parallelism (Dynamic Branch Prediction)

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

EECC551 - Shaaban #1 lec # 5 Fall Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with.

EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

EECC722 - Shaaban #1 Lec # 5 Fall High Bandwidth Instruction Fetching Techniques Instruction Bandwidth Issues –The Basic Block Fetch Limitation/Cache.

® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation.

EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.

Goal: Reduce the Penalty of Control Hazards

Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.

Branch Target Buffers BPB: Tag + Prediction

1 COMP 740: Computer Architecture and Implementation Montek Singh Thu, Feb 19, 2009 Topic: Instruction-Level Parallelism III (Dynamic Branch Prediction)

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.

Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

ACSAC’04 Choice Predictor for Free Mongkol Ekpanyapong Pinar Korkmaz Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Institute.

Microprocessor Microarchitecture Instruction Fetch Lynn Choi Dept. Of Computer and Electronics Engineering.

Page 1 Trace Caches Michele Co CS 451. Page 2 Motivation  High performance superscalar processors  High instruction throughput  Exploit ILP –Wider.

Trace cache and Back-end Oper. CSE 4711 Instruction Fetch Unit Using I-cache I-cache I-TLB Decoder Branch Pred Register renaming Execution units.

1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

ECE 4100/6100 Advanced Computer Architecture Lecture 2 Instruction-Level Parallelism (ILP) Prof. Hsien-Hsin Sean Lee School of Electrical and Computer.

Computer System Design

Korea UniversityG. Lee CRE652 Processor Architecture Dynamic Branch Prediction.

Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.

Fetch Directed Prefetching - a Study

CSC 4250 Computer Architectures October 31, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

Branch Prediction Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

CS 6290 Branch Prediction. Control Dependencies Branches are very frequent –Approx. 20% of all instructions Can not wait until we know where it goes –Long.

Effective ahead pipelining of instruction block address generation André Seznec and Antony Fraboulet IRISA/ INRIA.

High Bandwidth Instruction Fetching Techniques

Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)

COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.

CMPE750 - Shaaban #1 Lec # 5 Spring High Bandwidth Instruction Fetching Techniques Instruction Bandwidth Issues –The Basic Block Fetch Limitation/Cache.

Dynamic Branch Prediction

CSL718 : Pipelined Processors

So Far, Focus on ILP for Pipelines

Computer Architecture: Branch Prediction (II) and Predicated Execution

Lecture 9. Branch Target Prediction and Trace Cache

Prof. Hsien-Hsin Sean Lee

CS203 – Advanced Computer Architecture

CS5100 Advanced Computer Architecture Advanced Branch Prediction

PowerPC 604 Superscalar Microprocessor

CS252 Graduate Computer Architecture Spring 2014 Lecture 8: Advanced Out-of-Order Superscalar Designs Part-II Krste Asanovic

Prof. Onur Mutlu Carnegie Mellon University

5.2 Eleven Advanced Optimizations of Cache Performance

Samira Khan University of Virginia Nov 13, 2017

Flow Path Model of Superscalars

Hyperthreading Technology

High Bandwidth Instruction Fetching Techniques

Pipelining: Advanced ILP

Lecture 11: Memory Data Flow Techniques

15-740/ Computer Architecture Lecture 24: Control Flow

Ka-Ming Keung Swamy D Ponpandi

Lecture 10: Branch Prediction and Instruction Delivery

Adapted from the slides of Prof

Dynamic Hardware Prediction

Ka-Ming Keung Swamy D Ponpandi

Spring 2019 Prof. Eric Rotenberg

Presentation transcript:

ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute of Technology

2 Instruction Supply Issues Fetch throughput defines max performance that can be achieved in later stages Superscalar processors need to supply more than 1 instruction per cycle Instruction Supply limited by –Misalignment of multiple instructions in a fetch group –Change of Flow (interrupting instruction supply) –Memory latency and bandwidth Instruction Fetch Unit Execution Core Instruction buffer

3 Aligned Instruction Fetching (4 instructions) Row Decoder A0 A4 00 A1 A5 01 A2 A6 10 A3 A7 11 inst 1 inst 2 inst 3 inst 4 inst 1 inst 2 inst 3 inst 4 PC=..xx One 64B I- cache line A8 A12 A9 A13 A10 A14 A11 A Assume one fetch group = 16B Cycle n Can pull out one row at a time

4 Misaligned Fetch Row Decoder A0 A4 00 A1 A5 01 A2 A6 10 A3 A7 11 PC=..xx One 64B I- cache line A8 A12 A9 A13 A10 A14 A11 A inst 1 inst 2 inst 3 inst 4 inst 1 inst 2 inst 3 inst 4 Rotating network Cycle n IBM RS/6000

5 Split Cache Line Access Row Decoder A0 A4 00 A1 A5 01 A2 A6 10 A3 A7 11 PC=..xx cache line A A8 A12 A9 A13 A10 A14 A11 A B0B1B2B3 cache line B B4B5B6B7 inst 1 inst 2 inst 1 inst 2 inst 3 inst 4 inst 3 inst 4 Cycle n Cycle n+1 Be broken down to 2 physical accesses

6 Split Cache Line Access Miss Row Decoder A0 A4 00 A1 A5 01 A2 A6 10 A3 A7 11 cache line A A8 A12 A9 A13 A10 A14 A11 A15 C0C1C2C3 cache line C C4C5C6C7 inst 1 inst 2 inst 1 inst 2 inst 3 inst 4 inst 3 inst 4 Cache line Bmisses B misses Cycle n Cycle n+X PC=..xx111000

7 High Bandwidth Instruction Fetching BB1 BB2BB3 BB4 BB5 BB6 BB7 Wider issue  More instruction feed non-contiguousMajor challenge: to fetch more than one non-contiguous basic block per cycle Enabling technique? –Predication –Branch alignment based on profiling –Other hardware solutions (branch prediction is a given)

8 Predication Example Convert control dependency into data dependency Enlarge basic block size –More room for scheduling –No fetch disruption if (a[i+1]>a[i]) a[i+1] = 0 else a[i] = 0 if (a[i+1]>a[i]) a[i+1] = 0 else a[i] = 0 Source code lw r2, [r1+4] lw r3, [r1] blt r3, r2, L1 sw r0, [r1] j L2 L1: sw r0, [r1+4] L2: lw r2, [r1+4] lw r3, [r1] blt r3, r2, L1 sw r0, [r1] j L2 L1: sw r0, [r1+4] L2: Typical assembly lw r2, [r1+4] lw r3, [r1] sgt pr4, r2, r3 (p4) sw r0, [r1+4] (!p4) sw r0, [r1] lw r2, [r1+4] lw r3, [r1] sgt pr4, r2, r3 (p4) sw r0, [r1+4] (!p4) sw r0, [r1] Assembly w/ predication

9 Collapse Buffer [ISCA 95] To fetch multiple (often non-contiguous) instructions Use interleaved BTB to enable multiple branch predictions Align instructions in the predicted sequential order Use banked I-cache for multiple line access

10 Collapsing Buffer Fetch PC Interleaved BTB Cache Bank 1 Cache Bank 2 Interchange Switch Collapsing Circuit

11 Collapsing Buffer Mechanism Interleaved BTB AE Bank Routing EA EFGH ABCD EFGHABCD Interchange Switch ABCDEFGH Collapsing Circuit ABCEG Valid Instruction Bits DFH

12 High Bandwidth Instruction Fetching BB1 BB2BB3 BB4 BB5 BB6 BB7 To fetch more, we need to cross multiple basic blocks (and/or multiple cache lines) Multiple branches predictions

13 Multiple Branch Predictor [YehMarrPatt ICS’93] Pattern History Table (PHT) design to support MBP Based on global history only Branch History Register (BHR) Pattern History Table (PHT) bkbk …… b1b1 Primary prediction Secondary prediction Tertiary prediction p1 p2 p1p2 update

14 Multiple Branch Predictin Fetch address could be retrieved from BTB Predicted path: BB1  BB2  BB5 How to fetch BB2 and BB5? BTB? br1br2 –Can ’ t. Branch PCs of br1 and br2 not available when MBP made –Use a BAC design BB1br1 BB2br2 BB3 BB4BB5BB6BB7 T (2 nd ) F T T F (3 rd ) F Fetch address (br0 Primary prediction) BTB entry

15 Branch Address Cache Use a Branch Address Cache (BAC): Keep 6 possible fetch addresses for 2 more predictions br: 2 bits for branch type (cond, uncond, return) V: single valid bit (to indicate if hits a branch in the sequence) To make one more level prediction –Need to cache another 8 more addresses (i.e. total=14 addresses) –464 bits per entry = (23+3)*1 + (30+3) * (2+4) + 30*8 Tag 23 bits Taken Target Address Not-Taken Target Address T-T Address T-N Address N-T Address N-N Address 30 bits Vbr V V 212 bits per fetch address entry 1 2 Fetch Addr (from BTB)

16 Caching Non-Consecutive Basic Blocks BB2 High Fetch Bandwidth + Low Latency BB1 BB3 BB4 BB5 Fetch in Conventional Instruction Cache BB2BB1BB3 BB4 BB5 Fetch in Linear Memory Location

17 Trace Cache Cache dynamic non-contiguous instructions (traces) Cross multiple basic blocks Need to predict multiple branches (MBP) EFG HIJK AB C D I$ AB C D EFG HIJ I$ Fetch (5 cycles) ABC DEFG HIJ Collapsing Buffer Fetch (3 cycles) ABCDEFGHIJ Trace Cache ABCDEFGHIJ T$ Fetch (1 cycle)

18 Trace Cache [Rotenberg Bennett Smith MICRO‘96] Cache at most (in original paper) –M branches OR (M = 3 in all follow-up TC studies due to MBP) –N instructions (N = 16 in all follow-up TC studies) Fall-thru address if last branch is predicted not taken Tag Br flag Fetch Addr Br mask Fall-thru Address Taken Address MBP BB2BB1BB3 Line fill buffer For T.C. miss T.C. hits, N instructions M M branches Branch 1 Branch 2 Branch st Br taken 2 nd Br Not taken 11, 1 11: 3 branches. 1: the trace ends w/ a branch

19 Trace Hit Logic A1011,1XY TagBFMaskFall-thruTarget Fetch: A = Match 1 st Block Multi-BPred TN Cond. AND Match Remaining Block(s) Trace hit N 0 1 Next Fetch Address

20 Trace Cache Example A BC D Exit 5 insts 12 insts 4 insts 6 insts BB Traversal Path: ABDABDACDABDACDABDAC A1A2A3A4A5B1B2B3B4B5B6D1D2D3D4A1A2A3A4A5C1C2C3C4C5C6C7C8C9C10C11C12D1D2D3D4A1A2A3A4A5B1B2B3B4B5B6D1D2D3D4A1A2A3A4A5 C1C2C3C4C5C6C7C8C9C10C11C12D1D2D3D4 Trace Cache (5 lines) Cond 1: 3 branches Cond 2: Fill a trace cache line Cond 3: Exit 16 instructions

21 A1A2A3A4A5B1B2B3B4B5B6D1D2D3D4A1A2A3A4A5C1C2C3C4C5C6C7C8C9C10C11 C12 D1D2D3D4A1A2A3A4A5B1B2B3B4B5B6D1D2D3D4A1A2A3A4A5C1C2C3C4C5C6C7C8C9C10C11C12D1D2D3D4 Trace Cache Example A BC D Exit 5 insts 12 insts 4 insts 6 insts BB Traversal Path: ABDABDACDABDACDABDAC A1A2A3A4A5B1B2B3B4B5B6D1D2D3D4A1A2A3A4A5C1C2C3C4C5C6C7C8C9C10C11 C12 D1D2D3D4A1A2A3A4A5B1B2B3B4B5B6D1D2D3D4A1A2A3A4A5C1C2C3C4C5C6C7C8C9C10C11C12D1D2D3D4 Trace Cache (5 lines) Cond 1: 3 branches Cond 2: Fill a trace cache line Cond 3: Exit

22 Trace Cache Example A BC D Exit 5 insts 12 insts 4 insts 6 insts BB Traversal Path: ABDABDACDABDACDABDAC A1A2A3A4A5B1B2B3B4B5B6D1D2D3D4A1A2A3A4A5C1C2C3C4C5C6C7C8C9C10C11B1B2B3B4B5B6D1D2D3D4A1A2A3A4A5C1C2C3C4C5C6C7C8C9C10C11C12D1D2D3D4 Trace Cache (5 lines) C12D1D2D3D4A1A2A3A4A5 C12 D1D2D3D4A1A2A3A4A5 Trace Cache is Full

23 Trace Cache Example A BC D Exit 5 insts 12 insts 4 insts 6 insts BB Traversal Path: ABDABDACDABDACDABDAC A1A2A3A4A5B1B2B3B4B5B6D1D2D3D4A1A2A3A4A5C1C2C3C4C5C6C7C8C9C10C11B1B2B3B4B5B6D1D2D3D4A1A2A3A4A5C1C2C3C4C5C6C7C8C9C10C11C12D1D2D3D4C12D1D2D3D4A1A2A3A4A5 C12 How many hits? What is the utilization?

24 Redundancy Duplication –Note that instructions only appear once in I-Cache –Same instruction appears many times in TC Fragmentation –If 3 BBs < 16 instructions –If multiple-target branch (e.g. return, indirect jump or trap) is encountered, stop “ trace construction ”. –Empty slots  wasted resources Example –A single BB is broken up to (ABC), (BCD), (CDA), (DAB) –Duplicating each instruction 3 times (ABC) =16 inst (BCD) =13 inst (CDA) =15 inst (DAB) =13 inst AB CBD Trace Cache C DA B CDA B C D A

25 Indexability A C D B E TC saved traces (EAC) and (BCD) Path: (EAC) to (D) –Cannot index interior block (D) Can cause duplication Need partial matching –(BCD) is cached, if (BC) is needed E CBD Trace Cache AC G

26 Pentium 4 (NetBurst) Trace Cache Front-end BTB iTLB and Prefetcher L2 Cache Decoder Trace $ BTB Rename, execute, etc. No I$ !! Decoded Instructions Trace-based prediction (predict next-trace, not next-PC)