Ka-Ming Keung Swamy D Ponpandi

Ka-Ming Keung Swamy D Ponpandi
Trace Cache Ka-Ming Keung Swamy D Ponpandi

Lecture Topics Trace Cache Block based trace cache Optimizations
CPRE 585 Fall 2004

Motivation Complex superscalar processor design to extract higher ILP -diminishing returns Issues – Fetch bandwidth CPRE 585 Fall 2004

Producer-Consumer View
Branch outcome/Jump address Instruction Fetch & Decode Instruction Execution This is a producer consumer view of superscalar processor, we can see that the entire processor can be divided Into an fetch unit and execution unit. It is clear that to maintain the instruction execution throughput, fetch unit has to keep the execution unit busy. That is, if execution unit is able to accept instructions at an rate r, then the fetch unit has to provide instructions at the same rate if fetch rate < r, under utlilization, low IPC fetch rate > r, fetch unit stalls Another thing to note is EU actually directs the FU to fetch, hence FU rate is actually dependent on EU’s outcome Of course FU could speculate. Instruction Buffers CPRE 585 Fall 2004

Fetch Bandwidth Branch Outcome Cache Hit rate Branch Throughput
Instruction Alignment Fetch bandwidth depends on many factors including Branch outcome – lot of research, we have looked at BHT, target address prediction (BTB)etc Cache hit rate – dependent on memory hierarchy, locality of program etc We will concentrate on branch throughput and instruction alignment, this is because, with increase in issue width multiple branches have to be predicted in the same cycle to help the FU to fetch more instructions Multiple branch prediction introduces another problem, fetching from multiple non-contiguous locations in cache memory which necessitates multiple cache banks (may lead to conflicts), multiports (costly). This increases fetch unit latency which in turn has impact on the instruction execution rate (lower clock, pipeline stalls ). We got to fetch a branch and get the basic block if it is predicted to be taken in the same cycle (how is it possible?) Alignment – fetch non-contiguous blocks, rotate, shift to create the dynamic sequence of instructions CPRE 585 Fall 2004

The Trace Cache Capture dynamic instruction sequences – Snapshot
Cache line – N instructions and M basic blocks Dynamic instruction stream trace{A:taken,taken} trace{A:taken,taken} later …. A A t t t t Fill New Trace from Instruction cache Access existing trace using A and predictions (t,t) To alleviate the problems mentioned in the previous slide , trace cache is proposed to capture the dynamic sequence of instructions. Dynamic instruction sequence – can start at any address and continue down up to M taken branches (why not taken branches)? Each cache line is fully specified by the starting address and the order of M-1 taken branches. TRACE CACHE TRACE CACHE t t t t A A 1st basic block 3rd basic block (still filling) TO DECODER 2nd basic block CPRE 585 Fall 2004

Pros and Cons Deliver multiple blocks of instructions in the same cycle without support from a compiler and modifying instruction set Not on critical path, complexity is moved out of the fetch-issue pipeline where additional latency impacts performance CPRE 585 Fall 2004

Pros and Cons No need to rotate, shift basic blocks to create dynamic instruction sequence Redundant Information CPRE 585 Fall 2004

Trace Cache – Dissection
Fetch Address Trace Cache Core Fetch Unit n instructions n instructions Line fill 2:1 Mux Instruction latch To Decoder CPRE 585 Fall 2004

Trace Cache – More details
Fetch address from decoder Trace Cache Core fetch unit Branch flags Target addr Merge logic Tag Branch mask Instruction Cache Fill Control Branch Target Buffer Line fill buffer A 11 11,1 X Y Return Address Stack Predictor 1st branch 2nd branch BTB Logic fall through 3rd branch Hit Logic mask/shift/interchange To TC next fetch address n instructions n instructions Fetch address from predictor CPRE 585 Fall 2004 To instruction latch for decoding

Trace Cache Component Valid bit Tag Branch flags Branch mask
Trace fall-through address Trace target address Line-Fill Buffer Line Fill Buffer It services trace cache misses The control logic merges each incoming block of instructions with preceding instructions in the line-fill buffer Filling is complete when 1. n instructions have benn traced, or 2. m branches have been detected After filling, contents are moved to the trace cache Branch flags and branch mask are generated during the line-fill process The trace target and fall-through address are computed at the end of the line-fill CPRE 585 Fall 2004

Trace Cache Hit Conditions
1. The fetch address match the tag 2. The branch predictions match the branch flags If Hits, an entire trace of instructions is fed into the instruction CPRE 585 Fall 2004

Trace Cache Miss If Misses, fetching proceeds normally from the instruction cache CPRE 585 Fall 2004

Design Issues Associativity Multiple paths Partial matches
Indexing methods Cache line size Filling Trace selection Victim trace cache CPRE 585 Fall 2004

Results CPRE 585 Fall 2004

CPRE 585 Fall 2004

Trace cache effectiveness
CPRE 585 Fall 2004

Pentium 4 Trace Cache 6 uops per line (what is the benefit of storing as uops?) Max 12 k uops (21 KB) Uses virtual addresses so no need for address translation CPRE 585 Fall 2004

Block-Based Trace Cache
Fetch Address renaming Advantage: Lesser bits Unique renamed pointer is assigned to each address No Tag Comparison, Faster instruction fetch Rename table which is maintained at completion Move the complexity and latency from instruction fetch time to completion time CPRE 585 Fall 2004

Block Instructions are stored on a block.
Each block contains a trace of instruction Block is updated at the end of each instruction execution Each block has a block ID CPRE 585 Fall 2004

How to find out the trace?
Original: Fetched address Now: Trace id Trace id is determined by the branch history, and Past block id predictions Trace table stores the block id predictions CPRE 585 Fall 2004

Rename Table Check if the predicted block exists
Give a block id to a new trace CPRE 585 Fall 2004

Dynamic Optimizations
Register Moves Register move does not require calculation Can be performed in ROB and renaming unit Trace Cache can do it in a better way CPRE 585 Fall 2004

Register Moves Advantages
By Placing detection logic in fill unit, the decode and rename logic execute move instruction without having to pay for latency of detecting them Requires no execution resources, thus incur no delays due to artifacts. CPRE 585 Fall 2004

Re-association Decreases the dependency addi Rx,Ry,4 addi Rz,Rx,4
addi Rz, Ry,8 CPRE 585 Fall 2004

Scaled Adds Decreases the dependency
shifti Rw, Rx << shifti Rw, Rx << 1 add Ry, Rw, Rz scaled Add Ry, (Rx<<1), Rz Decreases the dependency CPRE 585 Fall 2004

Instruction Placement
Many instructions are often unable to execute the cycle after their source operands are produced. Reorder the instructions within the trace cache Because they must wait for the value to be forwarded from the producting functional unit to the awaiting instruction’s FU CPRE 585 Fall 2004

Conclusion Increases IPC New area for code optimization
CPRE 585 Fall 2004

Ka-Ming Keung Swamy D Ponpandi

Similar presentations

Presentation on theme: "Ka-Ming Keung Swamy D Ponpandi"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Ka-Ming Keung Swamy D Ponpandi

Similar presentations

Presentation on theme: "Ka-Ming Keung Swamy D Ponpandi"— Presentation transcript:

Similar presentations

About project

Feedback