EECC722 - Shaaban #1 Lec # 9 Fall2001 10-10-2001 Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.

Slides:



Advertisements
Similar presentations
Computer Organization and Architecture
Advertisements

CSCI 4717/5717 Computer Architecture
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
Pipelining and Control Hazards Oct
ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
CS 7810 Lecture 7 Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching E. Rotenberg, S. Bennett, J.E. Smith Proceedings of MICRO-29.
EECC722 - Shaaban #1 Lec # 5 Fall Decoupled Fetch/Execute Superscalar Processor Engines Superscalar processor micro-architecture is divided.
Replicated Block Cache... block_id d e c o d e r N=2 n direct mapped cache FAi1i2i b word lines Final Collapse Fetch Buffer c o p y - 2 c o p y - 3 c o.
EECC722 - Shaaban #1 Lec # 5 Fall High Bandwidth Instruction Fetching Techniques Instruction Bandwidth Issues –The Basic Block Fetch Limitation.
Chapter 12 Pipelining Strategies Performance Hazards.
EECC551 - Shaaban #1 lec # 5 Fall Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with.
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
EECC722 - Shaaban #1 Lec # 5 Fall High Bandwidth Instruction Fetching Techniques Instruction Bandwidth Issues –The Basic Block Fetch Limitation/Cache.
EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
EECC551 - Shaaban #1 lec # 5 Winter Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with.
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
Chapter 12 CPU Structure and Function. Example Register Organizations.
EENG449b/Savvides Lec /25/05 March 24, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
Evaluation of Dynamic Branch Prediction Schemes in a MIPS Pipeline Debajit Bhattacharya Ali JavadiAbhari ELE 475 Final Project 9 th May, 2012.
5-Stage Pipelining Fetch Instruction (FI) Fetch Operand (FO) Decode Instruction (DI) Write Operand (WO) Execution Instruction (EI) S3S3 S4S4 S1S1 S2S2.
Microprocessor Microarchitecture Instruction Fetch Lynn Choi Dept. Of Computer and Electronics Engineering.
Page 1 Trace Caches Michele Co CS 451. Page 2 Motivation  High performance superscalar processors  High instruction throughput  Exploit ILP –Wider.
1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.
CSCI 6461: Computer Architecture Branch Prediction Instructor: M. Lancaster Corresponding to Hennessey and Patterson Fifth Edition Section 3.3 and Part.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
5/13/99 Ashish Sabharwal1 Pipelining and Hazards n Hazards occur because –Don’t have enough resources (ALU’s, memory,…) Structural Hazard –Need a value.
Fetch Directed Prefetching - a Study
CS 6290 Branch Prediction. Control Dependencies Branches are very frequent –Approx. 20% of all instructions Can not wait until we know where it goes –Long.
Effective ahead pipelining of instruction block address generation André Seznec and Antony Fraboulet IRISA/ INRIA.
High Bandwidth Instruction Fetching Techniques
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
CMPE750 - Shaaban #1 Lec # 5 Spring High Bandwidth Instruction Fetching Techniques Instruction Bandwidth Issues –The Basic Block Fetch Limitation/Cache.
Dynamic Branch Prediction
Lecture 9. Branch Target Prediction and Trace Cache
Prof. Hsien-Hsin Sean Lee
Computer Organization CS224
CS203 – Advanced Computer Architecture
Dynamic Branch Prediction
Multilevel Memories (Improving performance using alittle “cash”)
PowerPC 604 Superscalar Microprocessor
5.2 Eleven Advanced Optimizations of Cache Performance
High Bandwidth Instruction Fetching Techniques
CMSC 611: Advanced Computer Architecture
The processor: Pipelining and Branching
Module 3: Branch Prediction
Comparison of Two Processors
Ka-Ming Keung Swamy D Ponpandi
Dynamic Branch Prediction
Advanced Computer Architecture
Control unit extension for data hazards
Lecture 10: Branch Prediction and Instruction Delivery
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Lecture 20: OOO, Memory Hierarchy
Dynamic Hardware Prediction
rePLay: A Hardware Framework for Dynamic Optimization
Ka-Ming Keung Swamy D Ponpandi
Spring 2019 Prof. Eric Rotenberg
Spring 2019 Prof. Eric Rotenberg
Presentation transcript:

EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch bandwidth requirements may exceed what can be provided by conventional instruction cache fetch mechanisms. The fetch mechanism is expected to supply a large number of instructions, but this is hindered because: – Long instruction sequences are not always in contiguous cache locations. – Also it is difficult to fetch a taken branch and its target in a single cycle. All these factors lead to limiting the fetch bandwidth of an instruction cache to a basic block. All methods perform multiple-branch prediction the cycle before instruction fetch. The three challenges in high-bandwidth instruction fetching: multiple-branch prediction, multiple fetch groups, and instruction alignment and collapsing, in previous studies. All methods to increase instruction fetching bandwidth perform multiple- branch prediction the cycle before instruction fetch, and fall into two general categories: Enhanced Instruction Caches Trace Cache

EECC722 - Shaaban #2 Lec # 9 Fall Approaches To High-Bandwidth Instruction Fetching: Approaches To High-Bandwidth Instruction Fetching: Enhanced Instruction Caches Support fetch of non-contiguous blocks with a multi- ported, multi-banked, or multiple copies of the instruction cache. This leads to multiple fetch groups that must be aligned and collapsed at fetch time, which can increase the fetch latency. Examples: –Branch Address Cache. –Collapsing Buffer (CB)

EECC722 - Shaaban #3 Lec # 9 Fall Collapsing Buffer (CB) This method works on the concept that there are the following elements in the fetch mechanism: –An interleaved I-cache and branch target buffer (BTB), –A multiple branch predictor, – A collapsing buffer. The goal of this method is to fetch multiple cache lines from the I- cache and collapse them together in one fetch iteration. This method will requires that the BTB be accessed more than once to predict the successive branches after the first one and the new cache line. The successive lines from different cache lines must also reside in different cache banks from each other. Therefore, this method not only increases the hardware complexity, and latency, but also is not very scalable.

EECC722 - Shaaban #4 Lec # 9 Fall Collapsing Buffer (CB)

EECC722 - Shaaban #5 Lec # 9 Fall Branch Address Cache This method has four major components: –The branch address cache (BAC), –A multiple branch predictor. –An interleaved instruction cache. –An interchange and alignment network. The basic operation of the BAC is that of a branch history tree mechanism with the depth of the tree determined by the number of branches to be predicted per cycle. The tree determines the path of the code and therefore, the blocks that will be fetched from the I-cache. Again, there is a need for a structure to collapse the code into one stream and to either access multiple cache banks at once or pipeline the cache reads. No matter which methods are implemented, the BAC method will add many cycles to the instruction pipeline and also add code complexity.

EECC722 - Shaaban #6 Lec # 9 Fall Branch Address Cache (BAC)

EECC722 - Shaaban #7 Lec # 9 Fall Multiple Branch Prediction using a Global Pattern History Table (MGAg) Algorithm to make 2 branch predictions from a single branch history register: –To predict the secondary branch, the right-most k-1 branch history bits are used to index into the pattern history table. – k -1 bits address 2 adjacent entries, in the pattern history table. –The primary branch prediction is used to select one of the entries to make the secondary branch prediction. Most recent branch

EECC722 - Shaaban #8 Lec # 9 Fall The concept of the trace cache is that it is an instruction cache which captures dynamic instruction sequences and makes them appear contiguous. Each line of this cache stores a trace of a dynamic instruction stream. The trace cache line size is n and the maximum branch predictions that can be generated is m. Therefore a trace can contain at most n instructions and up to m basic blocks. A trace is defined by the starting address and a sequence of m-1 branch predictions. These m-1 branch predictions define the path followed, by that trace, across m-1 branches. The first time a control flow path is executed, instructions are fetched as normal through the instruction cache. This dynamic sequence of instructions is allocated in the trace cache after assembly in the fill unit. Later, if there is a match for the trace (same starting address and same branch predictions), then the trace is taken from the trace cache and put into the fetch buffer. If not, then the instructions are fetched from the instruction cache. Approaches To High-Bandwidth Instruction Fetching: Conventional Trace Caches

EECC722 - Shaaban #9 Lec # 9 Fall High Level View of Trace Cache Operation

EECC722 - Shaaban #10 Lec # 9 Fall Conventional Trace Cache

EECC722 - Shaaban #11 Lec # 9 Fall Trace Cache: Core Fetch Unit

EECC722 - Shaaban #12 Lec # 9 Fall The Complete Trace Cache Fetch Mechanism

EECC722 - Shaaban #13 Lec # 9 Fall

EECC722 - Shaaban #14 Lec # 9 Fall

EECC722 - Shaaban #15 Lec # 9 Fall Block-Based Trace Cache Block-Based Trace Cache improves on conventional trace cache by instead of explicitly storing instructions of a trace, pointers to blocks constituting a trace are stored in a much smaller trace table. The block-based trace cache renames fetch addresses at the basic block level and stores aligned blocks in a block cache. Traces are constructed by accessing the replicated block cache using block pointers from the trace table. Four major components: –the trace table, –the block cache, –the rename table –the fill unit.

EECC722 - Shaaban #16 Lec # 9 Fall Block-Based Trace Cache

EECC722 - Shaaban #17 Lec # 9 Fall Block-Based Trace Cache: Trace Table The Trace Table is the mechanism that stores the renamed pointers (block ids) to the basic blocks for trace construction. Each entry in the Trace Table holds a shorthand version on the trace. Each entry consists of a valid bit, a tag, and the block ids of the trace. These traces are then used in the fetch cycle to tell which blocks are to be fetched and how the blocks are to be collapsed using the final collapse MUX. The next trace is also predicted using the Trace Table. This is done using a hashing function, which is based either on the last block id and global branch history (gshare prediction) or a combination of the branch history and previous block ids. The filling of the Trace Table is done in the completion stage. The block ids and block steering bits are created in the completion stage based on the blocks that were executed and just completed.

EECC722 - Shaaban #18 Lec # 9 Fall Trace Table

EECC722 - Shaaban #19 Lec # 9 Fall Block-Based Trace Cache: Block-Based Trace Cache: Block Cache The Block Cache is the storage mechanism for the instruction blocks to execute. The Block Cache consists of replicated storage to allow for simultaneous accesses to the cache in the fetch stage. The number of copies of the Block Cache will therefore govern the number of blocks allowed per trace. At fetch time, the Trace Cache provides the block ids to fetch and the steering bits. The blocks needed are then collapsed into the predicted trace using the final collapse MUX. From here, the instructions in the trace can be executed as normal on the Superscalar core.

EECC722 - Shaaban #20 Lec # 9 Fall Block Cache With Final Collapse MUX

EECC722 - Shaaban #21 Lec # 9 Fall Example Implementation of The Rename Table (8 entries, 2-way set associative).

EECC722 - Shaaban #22 Lec # 9 Fall Block-Based Trace Cache: Block-Based Trace Cache: The Fill Unit The Fill Unit is an integral part of the Block-based Trace Cache. It is used to update the Trace Table, Block Cache, and Rename Table at completion time. The Fill Unit constructs a trace of the executed blocks after their completion. From this trace, it updates the Trace Table with the trace prediction, the Block Cache with Physical Blocks from the executed instructions, and the Rename Table with the fetch addresses of the first instruction of the execution blocks. It also controls the overwriting of Block Cache and Rename Table elements that already exist. In the case where the entry already exists, the Fill Unit will not write the data, so that bandwidth is not wasted.

EECC722 - Shaaban #23 Lec # 9 Fall Performance Comparison: Block vs. Conventional Trace Cache