Page 1 Trace Caches Michele Co CS 451. Page 2 Motivation  High performance superscalar processors  High instruction throughput  Exploit ILP –Wider.

Slides:

Advertisements

Similar presentations

Dynamic Branch Prediction (Sec 4.3) Control dependences become a limiting factor in exploiting ILP So far, we’ve discussed only static branch prediction.

Advertisements

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

8 Processing of control transfer instructions TECH Computer Science 8.1 Introduction 8.2 Basic approaches to branch handling 8.3 Delayed branching 8.4.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

National & Kapodistrian University of Athens Dep.of Informatics & Telecommunications MSc. In Computer Systems Technology Advanced Computer Architecture.

CS 7810 Lecture 7 Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching E. Rotenberg, S. Bennett, J.E. Smith Proceedings of MICRO-29.

EECC722 - Shaaban #1 Lec # 5 Fall Decoupled Fetch/Execute Superscalar Processor Engines Superscalar processor micro-architecture is divided.

Instruction Level Parallelism (ILP) Colin Stevens.

Replicated Block Cache... block_id d e c o d e r N=2 n direct mapped cache FAi1i2i b word lines Final Collapse Fetch Buffer c o p y - 2 c o p y - 3 c o.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

EECC722 - Shaaban #1 Lec # 5 Fall High Bandwidth Instruction Fetching Techniques Instruction Bandwidth Issues –The Basic Block Fetch Limitation.

EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.

Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.

Multiscalar processors

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Trace Processors Presented by Nitin Kumar Eric Rotenberg Quinn Jacobson, Yanos Sazeides, Jim Smith Computer Science Department University of Wisconsin-Madison.

1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )

EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.

© Krste Asanovic, 2014CS252, Spring 2014, Lecture 7 CS252 Graduate Computer Architecture Spring 2014 Lecture 7: Branch Prediction and Load-Store Queues.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Microprocessor Microarchitecture Instruction Fetch Lynn Choi Dept. Of Computer and Electronics Engineering.

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

CS5222 Advanced Computer Architecture Part 3: VLIW Architecture

SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.

Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.

Fetch Directed Prefetching - a Study

Effective ahead pipelining of instruction block address generation André Seznec and Antony Fraboulet IRISA/ INRIA.

COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.

By Kunle Olukotun, BasemA. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang Presented by Dheeraj Kumar Kaveti.

Pentium III Instruction Stream. Introduction Pentium III uses several key features to exploit ILP This part of our presentation will cover the methods.

1 Computer Architecture. 2 Basic Elements Processor Main Memory –volatile –referred to as real memory or primary memory I/O modules –secondary memory.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.

Samira Khan University of Virginia Feb 9, 2016 COMPUTER ARCHITECTURE CS 6354 Precise Exception The content and concept of this course are adapted from.

Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.

Instruction-Level Parallelism and Its Dynamic Exploitation

CS 352H: Computer Systems Architecture

Lecture 9. Branch Target Prediction and Trace Cache

Prof. Hsien-Hsin Sean Lee

CSL718 : Superscalar Processors

15-740/ Computer Architecture Lecture 21: Superscalar Processing

PowerPC 604 Superscalar Microprocessor

CS252 Graduate Computer Architecture Spring 2014 Lecture 8: Advanced Out-of-Order Superscalar Designs Part-II Krste Asanovic

Prof. Onur Mutlu Carnegie Mellon University

5.2 Eleven Advanced Optimizations of Cache Performance

Chapter 14 Instruction Level Parallelism and Superscalar Processors

ECE/CS 552: Pipelining to Superscalar

Superscalar Processors & VLIW Processors

Module 3: Branch Prediction

15-740/ Computer Architecture Lecture 24: Control Flow

Ka-Ming Keung Swamy D Ponpandi

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Lecture 10: Branch Prediction and Instruction Delivery

* From AMD 1996 Publication #18522 Revision E

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

Dynamic Hardware Prediction

Ka-Ming Keung Swamy D Ponpandi

Spring 2019 Prof. Eric Rotenberg

Spring 2019 Prof. Eric Rotenberg

Presentation transcript:

Page 1 Trace Caches Michele Co CS 451

Page 2 Motivation  High performance superscalar processors  High instruction throughput  Exploit ILP –Wider dispatch and issue paths  Execution units designed for high parallelism –Many functional units –Large issue buffers –Many physical registers  Fetch bandwidth becomes performance bottleneck

Page 3 Fetch Performance Limiters  Cache hit rate  Branch prediction accuracy  Branch throughput  Need to predict more than one branch per cycle  Non-contiguous instruction alignment  Fetch unit latency

Page 4 Problems with Traditional Instruction Cache  Contain instructions in compiled order  Works well for sequential code with little branching, or code with large basic blocks

Page 5 Suggested Solutions  Multiple branch target address prediction  Branch address cache (1993, Yeh, Marr, Patt) –Provides quick access to multiple target addresses –Disadvantages Complex alignment network, additional latency

Page 6 Suggested Solutions (cont’d)  Collapsing buffer  Multiple accesses to btb (1995, Conte, Mills, Menezes, Patel) –Allows fetching non- adjacent cache lines –Disadvantages Bank conflicts Poor scalability for interblock branches Significant logic added before and after instruction cache  Fill unit  Caches RISC-like instructions derived from CISC instruction stream  (1988, Melvin, Shebanow, Patt)

Page 7 Problems with Prior Approaches  Need to generate pointers for all noncontiguous instruction blocks BEFORE fetching can begin  Extra stages, additional latency  Complex alignment network necessary  Multiple simultaneous access to instruction cache  Multiporting is expensive  Sequencing  Additional stages, additional latency

Page 8 Potential Solution – Trace Cache  Rotenberg, Bennett, Smith (1996)  Advantages  Caches dynamic instruction sequences –Fetches past multiple branches  No additional fetch unit latency  Disadvantages  Redundant instruction storage –Between trace cache and instruction cache –Within trace cache

Page 9 Trace Cache Details  Trace  Sequence of instructions potentially containing branches and their targets  Terminate on branches with indeterminate number of targets –Returns, indirect jumps, traps  Trace identifier  Start address + branch outcomes  Trace cache line  Valid bit  Tag  Branch flags  Branch mask  Trace fall-through address  Trace target address

Page 10

Page 11 Next Trace Prediction (NTP)  History register  Correlating table  Complex history indexing  Secondary Table  Indexed by most recently committed trace ID  Index generating function

Page 12 NTP Index Generation

Page 13 Return History Stack

Page 14 Trace Cache vs. Existing Techniques

Page 15 Trace Cache Optimizations  Performance  Partial matching [Friendly, Patel, Patt (1997)]  Inactive issue [Friendly, Patel, Patt (1997)]  Trace preconstruction [Jacobson, Smith (2000)]  Power  Sequential access trace cache [Hu, et al., (2002)]  Dynamic direction prediction based trace cache [Hu, et al., (2003)]  Micro-operation cache [Solomon, et al., 2003]

Page 16 Trace Processors  Trace Processor Architecture  Processing elements (PE) –Trace-sized instruction buffer –Multiple dedicated functional units –Local register file –Copy of global register file  Use hierarchy to distribute execution resources  Addresses superscalar processor issues  Complexity –Simplified multiple branch prediction (next trace prediction) –Elimination of local dependence checking (local register file) –Decentralized instruction issue and result bypass logic  Architectural limitations –Reduced bandwidth pressure on global register file (local register files)

Page 17 Trace Processor

Page 18 Trace Cache Variations  Block-based trace cache (BBTC)  Black, Rychlik, Shen (1999)  Less storage capacity needed

Page 19 Trace Table: BBTC Trace Prediction

Page 20 Block Cache

Page 21 Rename Table

Page 22 BBTC Optimization  Completion time multiple branch prediction (Rakvic, et al., 2000)  Improvement over trace table predictions

Page 23 Tree-based Multiple Branch Prediction

Page 24 Tree-PHT

Page 25 Tree-PHT Update

Page 26 Trace Cache Variations (cont’d)  Software trace cache  Ramirez, Larriba-Pey, Navarro, Torrellas (1999)  Profile-directed code reordering to maximize sequentiality –Convert taken branches to not-taken –Move unused basic blocks out of execution path –Inline frequent basic blocks –Map most popular traces to reserved area of i-cache