® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation.

Slides:

Advertisements

Similar presentations

CPU Structure and Function

Advertisements

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

Lecture 12 Reduce Miss Penalty and Hit Time

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.

A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.

Computer Architecture 2011 – Branch Prediction 1 Computer Architecture Advanced Branch Prediction Lihu Rappoport and Adi Yoaz.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.

CS 7810 Lecture 7 Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching E. Rotenberg, S. Bennett, J.E. Smith Proceedings of MICRO-29.

EECC722 - Shaaban #1 Lec # 5 Fall Decoupled Fetch/Execute Superscalar Processor Engines Superscalar processor micro-architecture is divided.

Replicated Block Cache... block_id d e c o d e r N=2 n direct mapped cache FAi1i2i b word lines Final Collapse Fetch Buffer c o p y - 2 c o p y - 3 c o.

Computer ArchitectureFall 2007 © November 21, 2007 Karem A. Sakallah Lecture 23 Virtual Memory (2) CS : Computer Architecture.

Chapter 12 Pipelining Strategies Performance Hazards.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.

Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.

Chapter 12 CPU Structure and Function. Example Register Organizations.

Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.

EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Computer Architecture 2012 – advanced branch prediction 1 Computer Architecture Advanced Branch Prediction By Dan Tsafrir, 21/5/2012 Presentation based.

A Time Predictable Instruction Cache for a Java Processor Martin Schoeberl.

Microprocessor Microarchitecture Instruction Fetch Lynn Choi Dept. Of Computer and Electronics Engineering.

Page 1 Trace Caches Michele Co CS 451. Page 2 Motivation  High performance superscalar processors  High instruction throughput  Exploit ILP –Wider.

3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems

Trace Substitution Hans Vandierendonck, Hans Logie, Koen De Bosschere Ghent University EuroPar 2003, Klagenfurt.

Outline Cache writes DRAM configurations Performance Associative caches Multi-level caches.

1 CSCI 2510 Computer Organization Memory System II Cache In Action.

Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.

Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.

Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.

Computer Structure Advanced Branch Prediction

Computer Architecture 2015 – Advanced Branch Prediction 1 Computer Architecture Advanced Branch Prediction By Yoav Etsion and Dan Tsafrir Presentation.

COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.

LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”

High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.

Instruction Prefetching Smruti R. Sarangi. Contents  Motivation for Prefetching  Simple Schemes  Recent Work  Proactive Instruction Fetching  Return.

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

Memory Hierarchy Ideal memory is fast, large, and inexpensive

Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.

CSE 351 Section 9 3/1/12.

Lecture 9. Branch Target Prediction and Trace Cache

Prof. Hsien-Hsin Sean Lee

CS203 – Advanced Computer Architecture

Computer Structure Advanced Branch Prediction

William Stallings Computer Organization and Architecture 8th Edition

Computer Architecture Advanced Branch Prediction

CSC 4250 Computer Architectures

Multilevel Memories (Improving performance using alittle “cash”)

The University of Adelaide, School of Computer Science

5.2 Eleven Advanced Optimizations of Cache Performance

Cache Memory Presentation I

Morgan Kaufmann Publishers The Processor

TLC: A Tag-less Cache for reducing dynamic first level Cache Energy

Ka-Ming Keung Swamy D Ponpandi

Lecture 10: Branch Prediction and Instruction Delivery

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

Cache - Optimization.

10/18: Lecture Topics Using spatial locality

Computer Structure Advanced Branch Prediction

Ka-Ming Keung Swamy D Ponpandi

Spring 2019 Prof. Eric Rotenberg

Presentation transcript:

® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation

® Lihu Rappoport 2 The Frontend Frontend goal: supply instructions to execution – Predict which instructions to fetch – Fetch the instructions from cache / memory – Decode the instructions – Deliver the decoded instructions to execution Frontend MemoryExecution Instructions Data The processor:

® Lihu Rappoport 3 Requirements from the Frontend High bandwidth Low latency

® Lihu Rappoport 4 The Traditional Solution: Instruction Cache Basic unit: cache line –A sequence of consecutive instructions in memory Deficiencies: –Low Bandwidth Jump into the line Jump out of the linejmp –High Latency –Instructions need decoding

® Lihu Rappoport 5 TC Goals: high bandwidth & low latency Basic unit: trace – A sequence of dynamically executed instructions Trace Cache Instructions are decoded into uops – Fixed length, RISC like instructions Traces have a single entry, and multiple exits Trace end condition jmp jmpjmpjmp –Trace tag/index is derived from starting IP

® Lihu Rappoport 6 Redundancy in the TC Code If (cond) A B Possible Traces (i) AB (ii) B B A Space inefficiency  low hit rate

® Lihu Rappoport 7 XBC Goals High bandwidth Low latency High hit rate

® Lihu Rappoport 8 XBC - eXtended Block Cache Basic unit: XB - eXtended Block jcc jmp XB features – Multiple entry, single exit – Tag / index derived from ending instruction IP – Instructions are decoded XB end conditions – Conditional or indirect branches – Call/Return – Quota (16 uops)

® Lihu Rappoport 9 XBC Fetch Bandwidth Fetch multiple XBs per cycle –A conditional branch ends a XB –Need to predict only 1 branch/ XB –Predicting 2 branch/cyc  fetch 2 XB/cyc Promote  99% biased conditional branches*  Build longer XBs  Maximize XBC bandwidth for a given #pred/cyc  99% biased jcc jcc jcc jmp *[Patel 98]

® Lihu Rappoport 10 XB Length Block typesAverage Length BB basic block 7.7 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% BB XB XBp DBL XB don’t break on uncond8.0 XBp XB + promotion 10.0 DBL group 2 XBp 12.7

® Lihu Rappoport 11 XBC Structure A banked structure which supports Variable length XBs (minimize fragmentation) Fetching multiple XBs/cycle Reorder & Align Bank 0 Bank 1 Bank 2 Bank 3 4 uop

® Lihu Rappoport 12 Support Variable Length XBs An XB may spread over several Banks on the same set Reorder & Align bank 0 bank 1 bank 2 bank 30 1

® Lihu Rappoport 13 Support Fetching 2 XBs/cycle Data may be received from all Banks in the same cycle Reorder & Align bank 0 bank 1 bank 2 bank

® Lihu Rappoport 14 Support Fetching 2 XBs/cycle Actual bandwidth may be sometimes less than 4 banks per cycle Reorder & Align bank 0 bank 1 bank 2 bank

® Lihu Rappoport 15 Reordering and Aligning Uops bank 0 bank 1 bank 2 bank 3 bank i2 bank i3 Reorder Banks Mux 1 bank i0 bank i1 Align Uops Mux 2 bnk i0 bnk i1 bank i2 Empty uops

® Lihu Rappoport 16 XBC Structure The average XB length is >8 uops  16 uop/line is < 2-XB set associative Reorder & Align bank 0 bank 1 bank 2 bank uop

® Lihu Rappoport 17 XBC Structure The average XB length is >8 uops  make each bank set-associative Reorder & Align bank 0 bank 1 bank 2 bank

® Lihu Rappoport 18 The XBTB The XBTB provides the next XB for each XB – XBs are indexed according to ending IP  Cannot directly lookup next IP in the XBC  XBC can only be accessed using the XBTB XBTB provides info needed to access next XB –The IP of the next XB –Defines the set in which the XB resides –A masking vector, indicating the banks in which the XB resides –The #uops counted backward from the end of XB –Defines where to enter the XB XBTB provides next 2 XBs

® Lihu Rappoport 19 XBTB XBC Decoder Memory / Cache BTB Delivery mode Priority Encode XBQ XBC Structure: the whole picture Build mode Fill Unit

® Lihu Rappoport 20 XB Build Algorithm XBTB lookup fails  build a new XB into the fill buffer End-of-XB condition reached  lookup XBC for the new XB –No match  store new XB in the XBC, and update XBTB –Match  there are three cases: XB new  XB exist Update XBTB IP 1 XB exist IP 1 XB new Extend XB exist Update XBTB XB new  XB exis t XB exist IP 1 XB new Complex XB, Update XBTB XB new  XB exist   IP 1 XB exist XB new The XBC has NO Redundancy

® Lihu Rappoport 21 XB new and XB exist have same suffix but different prefix: –Possible solution, complying to no-redundancy: Complex XBs –Drawback: we get 2 short XBs instead of a single long XB Wrong Way IP 1 XB exist XB new IP 1 XB exist Prefix new

® Lihu Rappoport 22 XB new and XB exist have same suffix but different prefix: –Second solution: a single “complex XB” Complex XBs Complex XBs: no redundancy, but still high bandwidth Right Way Prefix cur IP 1 XB exist XB new Prefix new Suffix

® Lihu Rappoport 23 bank 0 bank 1 bank 2 bank 3 Extending an Existing XB An XB can only be extended at its beginning Since the existing uops move, the pointers in the XBTB become stale If we store XB in the usual way, when an XB is extended, we need to move all its uops

® Lihu Rappoport 24 Storing Uops in Reverse Order The solution is to store the uops of an XB in a reversed order bank 0 bank 1 bank 2 bank 30 XB IP is the IP of the ending instruction  extending the XB does not change the XB IP  when an XB is extended, no need to move uops

® Lihu Rappoport 25 Set Search XB is replaced and then placed again –Not on same set  different XB –Same set, same banks  no problem –Same set but not on the same banks  XBTB entries which point to the old location of the XB are erroneous Solution - Set Search –On an XBTB hit & XBC miss, try to locate the XB in other banks in the same set –Calculate new mask according to offset –Only a small penalty: cycle loss, but no switch to build

® Lihu Rappoport 26 XB Replacement Use a LRU among all the lines in a given set LRU also makes sure that we do not evict a line other than the first line of a XB (a head line) –There is no point in retaining the head line while evicting another line –if we enter the XB in the head line, we will get a miss when we reach the evicted line –if a head line is evicted, but we enter the XB in its middle, we may still avoid a miss A non-head line is always accessed after a head line is accessed  its LRU will be higher  it will not be evicted before the head line

® Lihu Rappoport 27 XB Placement Build-mode placement algorithm –New XB is placed in banks such that it does not have bank conflict with the previous XB (if possible) –LRU ordering is maintained by switching the LRU line with the non-conflicting line before the new XB is placed –Set-search repairs the XBTB Delivery mode placement algorithm –repeating bandwidth losses due to bank conflicts found  conflicting lines are moved to non-conflicting banks –Each XB is augmented with a counter –incremented when XB has a bank conflict –when counter reaches threshold, the conflicting lines are switched with other lines in non-conflicting banks –A line can be switched with another line, only if its LRU is higher, or if both gain from the switch

® Lihu Rappoport Games SpecINTSysmarkNT Average Uop per Cycle XBC vs. TC Delivery Bandwidth TCXBC

® Lihu Rappoport 29 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 16K32K64K Size - KUops Uop Miss Rate Miss Rate as a Function of Size TCXBC 29% >50%

® Lihu Rappoport 30 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 124 Associativity Uop Miss Rate Miss Rate as a Function of Size XBC TC

® Lihu Rappoport 31 XBC Features Summary Basic unit - XB –Ends with a conditional branch –Multiple entries, single exit –Indexed according to ending IP –Branch promotion  longer XBs XBC uses a banked structure –Supports fetching multiple XBs/cycle –Supports variable length XBs –Uops within XBs are stored in reverse order

® Lihu Rappoport 32 Conclusions Instruction Cache has high hit rate, but … –Low bandwidth, high latency TC has high bandwidth, low latency, but … –Low hit rate XBC combines the best of both worlds –High bandwidth, low latency and high hit rate

® Lihu Rappoport 33