Prof. Hsien-Hsin Sean Lee

Slides:



Advertisements
Similar presentations
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Advertisements

Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *
ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia.
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
CS 7810 Lecture 7 Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching E. Rotenberg, S. Bennett, J.E. Smith Proceedings of MICRO-29.
EECC722 - Shaaban #1 Lec # 5 Fall Decoupled Fetch/Execute Superscalar Processor Engines Superscalar processor micro-architecture is divided.
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
EECC722 - Shaaban #1 Lec # 5 Fall High Bandwidth Instruction Fetching Techniques Instruction Bandwidth Issues –The Basic Block Fetch Limitation/Cache.
EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Goal: Reduce the Penalty of Control Hazards
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
Branch Target Buffers BPB: Tag + Prediction
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
EENG449b/Savvides Lec /25/05 March 24, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.
CIS 429/529 Winter 2007 Branch Prediction.1 Branch Prediction, Multiple Issue.
EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
ACSAC’04 Choice Predictor for Free Mongkol Ekpanyapong Pinar Korkmaz Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Institute.
Microprocessor Microarchitecture Instruction Fetch Lynn Choi Dept. Of Computer and Electronics Engineering.
Page 1 Trace Caches Michele Co CS 451. Page 2 Motivation  High performance superscalar processors  High instruction throughput  Exploit ILP –Wider.
1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
ECE 4100/6100 Advanced Computer Architecture Lecture 2 Instruction-Level Parallelism (ILP) Prof. Hsien-Hsin Sean Lee School of Electrical and Computer.
Computer System Design
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
Fetch Directed Prefetching - a Study
Branch Prediction Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.
CS 6290 Branch Prediction. Control Dependencies Branches are very frequent –Approx. 20% of all instructions Can not wait until we know where it goes –Long.
Effective ahead pipelining of instruction block address generation André Seznec and Antony Fraboulet IRISA/ INRIA.
High Bandwidth Instruction Fetching Techniques
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
CMPE750 - Shaaban #1 Lec # 5 Spring High Bandwidth Instruction Fetching Techniques Instruction Bandwidth Issues –The Basic Block Fetch Limitation/Cache.
CSL718 : Pipelined Processors
So Far, Focus on ILP for Pipelines
Computer Architecture: Branch Prediction (II) and Predicated Execution
Lecture 9. Branch Target Prediction and Trace Cache
CS203 – Advanced Computer Architecture
15-740/ Computer Architecture Lecture 21: Superscalar Processing
CS5100 Advanced Computer Architecture Advanced Branch Prediction
PowerPC 604 Superscalar Microprocessor
CS252 Graduate Computer Architecture Spring 2014 Lecture 8: Advanced Out-of-Order Superscalar Designs Part-II Krste Asanovic
Prof. Onur Mutlu Carnegie Mellon University
5.2 Eleven Advanced Optimizations of Cache Performance
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Flow Path Model of Superscalars
Hyperthreading Technology
High Bandwidth Instruction Fetching Techniques
Pipelining: Advanced ILP
The processor: Pipelining and Branching
Module 3: Branch Prediction
Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections )
Lecture 11: Memory Data Flow Techniques
15-740/ Computer Architecture Lecture 24: Control Flow
Ka-Ming Keung Swamy D Ponpandi
Lecture 10: Branch Prediction and Instruction Delivery
Advanced Microarchitecture
So far we have dealt with control hazards in instruction pipelines by:
Pipelining: dynamic branch prediction Prof. Eric Rotenberg
Adapted from the slides of Prof
So far we have dealt with control hazards in instruction pipelines by:
So far we have dealt with control hazards in instruction pipelines by:
Ka-Ming Keung Swamy D Ponpandi
Spring 2019 Prof. Eric Rotenberg
ECE 721, Spring 2019 Prof. Eric Rotenberg.
Spring 2019 Prof. Eric Rotenberg
Presentation transcript:

ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute of Technology

Instruction Supply Issues Execution Core Instruction Fetch Unit Instruction buffer Fetch throughput defines max performance that can be achieved in later stages Superscalar processors need to supply more than 1 instruction per cycle Instruction Supply limited by Misalignment of multiple instructions in a fetch group Change of Flow (interrupting instruction supply) Memory latency and bandwidth

Aligned Instruction Fetching (4 instructions) PC=..xx000000 00 01 10 11 ..00 A0 A1 A2 A3 ..01 One 64B I-cache line A4 A5 A6 A7 ..10 A8 A9 A10 A11 Row Decoder ..11 A12 A13 A14 A15 Can pull out one row at a time inst 1 inst 2 inst 3 inst 4 Cycle n Assume one fetch group = 16B

Misaligned Fetch PC=..xx001000 A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 ..00 A0 A1 A2 A3 ..01 One 64B I-cache line A4 A5 A6 A7 ..10 A8 A9 A10 A11 Row Decoder ..11 A12 A13 A14 A15 Rotating network inst 1 inst 2 inst 3 inst 4 Cycle n IBM RS/6000

Split Cache Line Access PC=..xx111000 00 01 10 11 ..00 A0 A1 A2 A3 ..01 cache line A A4 A5 A6 A7 ..10 A8 A9 A10 A11 Row Decoder ..11 A12 A13 A14 A15 B0 B1 B2 B3 cache line B B4 B5 B6 B7 inst 1 inst 2 Cycle n inst 3 inst 4 Cycle n+1 Be broken down to 2 physical accesses

Split Cache Line Access Miss PC=..xx111000 00 01 10 11 ..00 A0 A1 A2 A3 ..01 cache line A A4 A5 A6 A7 ..10 A8 A9 A10 A11 Row Decoder ..11 A12 A13 A14 A15 C0 C1 C2 C3 cache line C C4 C5 C6 C7 Cache line B misses inst 1 inst 2 Cycle n inst 3 inst 4 Cycle n+X

High Bandwidth Instruction Fetching Wider issue  More instruction feed Major challenge: to fetch more than one non-contiguous basic block per cycle Enabling technique? Predication Branch alignment based on profiling Other hardware solutions (branch prediction is a given) BB1 BB4 BB2 BB3 BB5 BB7 BB6

Assembly w/ predication Predication Example Source code lw r2, [r1+4] lw r3, [r1] blt r3, r2, L1 sw r0, [r1] j L2 L1: sw r0, [r1+4] L2: Typical assembly lw r2, [r1+4] lw r3, [r1] sgt pr4, r2, r3 (p4) sw r0, [r1+4] (!p4) sw r0, [r1] Assembly w/ predication if (a[i+1]>a[i]) a[i+1] = 0 else a[i] = 0 Convert control dependency into data dependency Enlarge basic block size More room for scheduling No fetch disruption

Collapse Buffer [Conte et al. 95] To fetch multiple (often non-contiguous) instructions Use interleaved BTB to enable multiple branch predictions Align instructions in the predicted sequential order Use banked I-cache for multiple line access

Collapsing Buffer Interleaved BTB Fetch PC Cache Bank 1 Cache Bank 2 Interchange Switch Collapsing Circuit

Collapsing Buffer Mechanism Interleaved BTB Valid Instruction Bits E F G H A B C D A E Interchange Switch A B C D E F G H Bank Routing E A D F H Collapsing Circuit A B C E G E F G H A B C D

High Bandwidth Instruction Fetching To fetch more, we need to cross multiple basic blocks (and/or multiple cache lines) Multiple branches predictions BB1 BB4 BB2 BB3 BB5 BB7 BB6

Multiple Branch Predictor [YehMarrPatt’93] Pattern History Table (PHT) design to support MBP Based on global history only Pattern History Table (PHT) Branch History Register (BHR) Tertiary prediction bk b1 …… p2 p1 p2 update Secondary prediction p1 Primary prediction

Multiple Branch Predictin Fetch address (br0 Primary prediction) Fetch address could be retrieved from BTB Predicted path: BB1  BB2  BB5 How to fetch BB2 and BB5? BTB? Can’t. Branch PCs of br1 and br2 not available when MBP made Use a BAC design BTB entry BB1 br1 T (2nd) F BB2 br2 BB3 F (3rd) T F T BB4 BB5 BB6 BB7

Branch Address Cache V br V br V br Tag Taken Target Address Not-Taken Target Address T-T Address T-N Address N-T Address N-N Address 23 bits 1 2 30 bits 30 bits 212 bits per fetch address entry Fetch Addr (from BTB) Use a Branch Address Cache (BAC): Keep 6 possible fetch addresses for 2 more predictions br: 2 bits for branch type (cond, uncond, return) V: single valid bit (to indicate if hits a branch in the sequence) To make one more level prediction Need to cache another 8 more addresses (i.e. total=14 addresses) 464 bits per entry = (23+3)*1 + (30+3) * (2+4) + 30*8

Caching Non-Consecutive Basic Blocks High Fetch Bandwidth + Low Latency BB3 BB5 BB1 BB2 BB4 Fetch in Conventional Instruction Cache BB1 BB2 BB3 BB4 BB5 Fetch in Linear Memory Location

Trace Cache Cache dynamic non-contiguous instructions (traces) Cross multiple basic blocks Need to predict multiple branches (MBP) E F G A B C D E F G H I J I$ Fetch (5 cycles) A B C D E F G H I J Collapsing Buffer Fetch (3 cycles) A B C D E F G H I J Trace Cache H I J K A B C D E F G H I J T$ Fetch (1 cycle) A B C D I$

Trace Cache [Rotenberg Bennett Smith ‘96] 11, 1 11: 3 branches. 1: the trace ends w/ a branch 10 1st Br taken 2nd Br Not taken For T.C. miss Br flag Br mask Line fill buffer Tag Fall-thru Address Taken Address M branches BB2 BB1 BB3 T.C. hits, N instructions Branch 1 Branch 2 Branch 3 Fetch Addr Cache at most (in original paper) M branches OR (M = 3 in all follow-up TC studies due to MBP) N instructions (N = 16 in all follow-up TC studies) Fall-thru address if last branch is predicted not taken MBP

Trace Hit Logic Fetch: A Multi-BPred A 10 11,1 X Y N T N = Cond. AND Tag BF Mask Fall-thru Target Multi-BPred A 10 11,1 X Y N T N = 0 1 Cond. AND Match 1st Block Next Fetch Address Match Remaining Block(s) Trace hit

BB Traversal Path: ABDABDACDABDACDABDAC Trace Cache Example BB Traversal Path: ABDABDACDABDACDABDAC A B C D Exit 5 insts 12 insts 4 insts 6 insts Cond 1: 3 branches Cond 2: Fill a trace cache line Cond 3: Exit A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 D1 D2 D3 D4 A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 16 instructions Trace Cache (5 lines)

BB Traversal Path: ABDABDACDABDACDABDAC Trace Cache Example BB Traversal Path: ABDABDACDABDACDABDAC A B C D Exit 5 insts 12 insts 4 insts 6 insts Cond 1: 3 branches Cond 2: Fill a trace cache line Cond 3: Exit A1 A2 A3 A4 A5 A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 D1 D2 D3 D4 A1 A2 A3 A4 A5 A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C12 D1 D2 D3 D4 D1 D2 D3 D4 A1 A2 A3 A4 A5 A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 D1 D2 D3 D4 A1 A2 A3 A4 A5 A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 D1 D2 D3 D4 D1 D2 D3 D4 Trace Cache (5 lines)

BB Traversal Path: ABDABDACDABDACDABDAC Trace Cache Example BB Traversal Path: ABDABDACDABDACDABDAC A B C D Exit 5 insts 12 insts 4 insts 6 insts Trace Cache is Full A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C12 D1 D2 D3 D4 A1 A2 A3 A4 A5 C12 D1 D2 D3 D4 A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 D1 D2 D3 D4 Trace Cache (5 lines)

Trace Cache Example BB Traversal Path: ABDABDACDABDACDABDAC A B C D Exit 5 insts 12 insts 4 insts 6 insts How many hits? What is the utilization? A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C12 D1 D2 D3 D4 A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 D1 D2 D3 D4

Redundancy Duplication Fragmentation Example B C D A Note that instructions only appear once in I-Cache Same instruction appears many times in TC Fragmentation If 3 BBs < 16 instructions If multiple-target branch (e.g. return, indirect jump or trap) is encountered, stop “trace construction”. Empty slots  wasted resources Example A single BB is broken up to (ABC), (BCD), (CDA), (DAB) Duplicating each instruction 3 times 6 B C D A 4 (ABC) =16 inst (BCD) =13 inst (CDA) =15 inst (DAB) =13 inst 6 3 A B C D Trace Cache

Indexability TC saved traces (EAC) and (BCD) Path: (EAC) to (D) Cannot index interior block (D) Can cause duplication Need partial matching (BCD) is cached, if (BC) is needed A B C G D E C B D Trace Cache A

Pentium 4 (NetBurst) Trace Cache Front-end BTB iTLB and Prefetcher L2 Cache No I$ !! Decoder Trace $ BTB Trace $ Rename, execute, etc. Trace-based prediction (predict next-trace, not next-PC) Decoded Instructions