Ka-Ming Keung Swamy D Ponpandi

Slides:



Advertisements
Similar presentations
1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
Advertisements

Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
CS 7810 Lecture 7 Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching E. Rotenberg, S. Bennett, J.E. Smith Proceedings of MICRO-29.
EECC722 - Shaaban #1 Lec # 5 Fall Decoupled Fetch/Execute Superscalar Processor Engines Superscalar processor micro-architecture is divided.
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Chapter 12 Pipelining Strategies Performance Hazards.
EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Microprocessor Microarchitecture Instruction Fetch Lynn Choi Dept. Of Computer and Electronics Engineering.
Page 1 Trace Caches Michele Co CS 451. Page 2 Motivation  High performance superscalar processors  High instruction throughput  Exploit ILP –Wider.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
Dynamic Branch Prediction
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
CS 352H: Computer Systems Architecture
Prof. Hsien-Hsin Sean Lee
CS203 – Advanced Computer Architecture
Multiscalar Processors
The University of Adelaide, School of Computer Science
PowerPC 604 Superscalar Microprocessor
CS252 Graduate Computer Architecture Spring 2014 Lecture 8: Advanced Out-of-Order Superscalar Designs Part-II Krste Asanovic
5.2 Eleven Advanced Optimizations of Cache Performance
CS203 – Advanced Computer Architecture
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Cache Memory Presentation I
Pipeline Implementation (4.6)
Flow Path Model of Superscalars
Introduction to Pentium Processor
Pipelining: Advanced ILP
Lecture 6: Advanced Pipelines
Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
The Microarchitecture of the Pentium 4 processor
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture: SMT, Cache Hierarchies
Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Dynamic Branch Prediction
How to improve (decrease) CPI
The Processor Lecture 3.6: Control Hazards
Control unit extension for data hazards
Lecture 10: Branch Prediction and Instruction Delivery
Lecture 20: OOO, Memory Hierarchy
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Lecture: SMT, Cache Hierarchies
Lecture 20: OOO, Memory Hierarchy
Sampoorani, Sivakumar and Joshua
Instruction Level Parallelism (ILP)
Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
* From AMD 1996 Publication #18522 Revision E
Computer Architecture
Adapted from the slides of Prof
Control unit extension for data hazards
CSC3050 – Computer Architecture
Control unit extension for data hazards
Ka-Ming Keung Swamy D Ponpandi
Spring 2019 Prof. Eric Rotenberg
Spring 2019 Prof. Eric Rotenberg
Presentation transcript:

Ka-Ming Keung Swamy D Ponpandi Trace Cache Ka-Ming Keung Swamy D Ponpandi

Lecture Topics Trace Cache Block based trace cache Optimizations CPRE 585 Fall 2004

Motivation Complex superscalar processor design to extract higher ILP -diminishing returns Issues – Fetch bandwidth CPRE 585 Fall 2004

Producer-Consumer View Branch outcome/Jump address Instruction Fetch & Decode Instruction Execution This is a producer consumer view of superscalar processor, we can see that the entire processor can be divided Into an fetch unit and execution unit. It is clear that to maintain the instruction execution throughput, fetch unit has to keep the execution unit busy. That is, if execution unit is able to accept instructions at an rate r, then the fetch unit has to provide instructions at the same rate if fetch rate < r, under utlilization, low IPC fetch rate > r, fetch unit stalls Another thing to note is EU actually directs the FU to fetch, hence FU rate is actually dependent on EU’s outcome Of course FU could speculate. Instruction Buffers CPRE 585 Fall 2004

Fetch Bandwidth Branch Outcome Cache Hit rate Branch Throughput Instruction Alignment Fetch bandwidth depends on many factors including Branch outcome – lot of research, we have looked at BHT, target address prediction (BTB)etc Cache hit rate – dependent on memory hierarchy, locality of program etc We will concentrate on branch throughput and instruction alignment, this is because, with increase in issue width multiple branches have to be predicted in the same cycle to help the FU to fetch more instructions Multiple branch prediction introduces another problem, fetching from multiple non-contiguous locations in cache memory which necessitates multiple cache banks (may lead to conflicts), multiports (costly). This increases fetch unit latency which in turn has impact on the instruction execution rate (lower clock, pipeline stalls ). We got to fetch a branch and get the basic block if it is predicted to be taken in the same cycle (how is it possible?) Alignment – fetch non-contiguous blocks, rotate, shift to create the dynamic sequence of instructions CPRE 585 Fall 2004

The Trace Cache Capture dynamic instruction sequences – Snapshot Cache line – N instructions and M basic blocks Dynamic instruction stream trace{A:taken,taken} trace{A:taken,taken} later …. A A t t t t Fill New Trace from Instruction cache Access existing trace using A and predictions (t,t) To alleviate the problems mentioned in the previous slide , trace cache is proposed to capture the dynamic sequence of instructions. Dynamic instruction sequence – can start at any address and continue down up to M taken branches (why not taken branches)? Each cache line is fully specified by the starting address and the order of M-1 taken branches. TRACE CACHE TRACE CACHE t t t t A A 1st basic block 3rd basic block (still filling) TO DECODER 2nd basic block CPRE 585 Fall 2004

Pros and Cons Deliver multiple blocks of instructions in the same cycle without support from a compiler and modifying instruction set Not on critical path, complexity is moved out of the fetch-issue pipeline where additional latency impacts performance CPRE 585 Fall 2004

Pros and Cons No need to rotate, shift basic blocks to create dynamic instruction sequence Redundant Information CPRE 585 Fall 2004

Trace Cache – Dissection Fetch Address Trace Cache Core Fetch Unit n instructions n instructions Line fill 2:1 Mux Instruction latch To Decoder CPRE 585 Fall 2004

Trace Cache – More details Fetch address from decoder Trace Cache Core fetch unit Branch flags Target addr Merge logic Tag Branch mask Instruction Cache Fill Control Branch Target Buffer Line fill buffer A 11 11,1 X Y Return Address Stack Predictor 1st branch 2nd branch BTB Logic fall through 3rd branch Hit Logic mask/shift/interchange To TC next fetch address n instructions n instructions Fetch address from predictor CPRE 585 Fall 2004 To instruction latch for decoding

Trace Cache Component Valid bit Tag Branch flags Branch mask Trace fall-through address Trace target address Line-Fill Buffer Line Fill Buffer It services trace cache misses The control logic merges each incoming block of instructions with preceding instructions in the line-fill buffer Filling is complete when 1. n instructions have benn traced, or 2. m branches have been detected After filling, contents are moved to the trace cache Branch flags and branch mask are generated during the line-fill process The trace target and fall-through address are computed at the end of the line-fill CPRE 585 Fall 2004

Trace Cache Hit Conditions 1. The fetch address match the tag 2. The branch predictions match the branch flags If Hits, an entire trace of instructions is fed into the instruction CPRE 585 Fall 2004

Trace Cache Miss If Misses, fetching proceeds normally from the instruction cache CPRE 585 Fall 2004

Design Issues Associativity Multiple paths Partial matches Indexing methods Cache line size Filling Trace selection Victim trace cache CPRE 585 Fall 2004

Results CPRE 585 Fall 2004

CPRE 585 Fall 2004

Trace cache effectiveness CPRE 585 Fall 2004

Pentium 4 Trace Cache 6 uops per line (what is the benefit of storing as uops?) Max 12 k uops (21 KB) Uses virtual addresses so no need for address translation CPRE 585 Fall 2004

Block-Based Trace Cache Fetch Address renaming Advantage: Lesser bits Unique renamed pointer is assigned to each address No Tag Comparison, Faster instruction fetch Rename table which is maintained at completion Move the complexity and latency from instruction fetch time to completion time CPRE 585 Fall 2004

Block Instructions are stored on a block. Each block contains a trace of instruction Block is updated at the end of each instruction execution Each block has a block ID CPRE 585 Fall 2004

How to find out the trace? Original: Fetched address Now: Trace id Trace id is determined by the branch history, and Past block id predictions Trace table stores the block id predictions CPRE 585 Fall 2004

Rename Table Check if the predicted block exists Give a block id to a new trace CPRE 585 Fall 2004

Dynamic Optimizations Register Moves Register move does not require calculation Can be performed in ROB and renaming unit Trace Cache can do it in a better way CPRE 585 Fall 2004

Register Moves Advantages By Placing detection logic in fill unit, the decode and rename logic execute move instruction without having to pay for latency of detecting them Requires no execution resources, thus incur no delays due to artifacts. CPRE 585 Fall 2004

Re-association Decreases the dependency addi Rx,Ry,4 addi Rz,Rx,4 addi Rz, Ry,8 CPRE 585 Fall 2004

Scaled Adds Decreases the dependency shifti Rw, Rx << 1 shifti Rw, Rx << 1 add Ry, Rw, Rz scaled Add Ry, (Rx<<1), Rz Decreases the dependency CPRE 585 Fall 2004

Instruction Placement Many instructions are often unable to execute the cycle after their source operands are produced. Reorder the instructions within the trace cache Because they must wait for the value to be forwarded from the producting functional unit to the awaiting instruction’s FU CPRE 585 Fall 2004

Conclusion Increases IPC New area for code optimization CPRE 585 Fall 2004