Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.

Slides:



Advertisements
Similar presentations
Topics Left Superscalar machines IA64 / EPIC architecture
Advertisements

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 14, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
Computer Organization and Architecture
National & Kapodistrian University of Athens Dep.of Informatics & Telecommunications MSc. In Computer Systems Technology Advanced Computer Architecture.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )
Chapter 12 Pipelining Strategies Performance Hazards.
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )
EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
1 Lecture 18: Pipelining Today’s topics:  Hazards and instruction scheduling  Branch prediction  Out-of-order execution Reminder:  Assignment 7 will.
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )
1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )
1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )
EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
AMD Opteron Overview Michael Trotter (mjt5v) Tim Kang (tjk2n) Jeff Barbieri (jjb3v)
Page 1 Trace Caches Michele Co CS 451. Page 2 Motivation  High performance superscalar processors  High instruction throughput  Exploit ILP –Wider.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Trace cache and Back-end Oper. CSE 4711 Instruction Fetch Unit Using I-cache I-cache I-TLB Decoder Branch Pred Register renaming Execution units.
Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Shrikant G.
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
Pentium III Instruction Stream. Introduction Pentium III uses several key features to exploit ILP This part of our presentation will cover the methods.
1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
Samira Khan University of Virginia Feb 9, 2016 COMPUTER ARCHITECTURE CS 6354 Precise Exception The content and concept of this course are adapted from.
CS203 – Advanced Computer Architecture ILP and Speculation.
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
Lecture: Out-of-order Processors
Computer Organization CS224
CS203 – Advanced Computer Architecture
PowerPC 604 Superscalar Microprocessor
CS252 Graduate Computer Architecture Spring 2014 Lecture 8: Advanced Out-of-Order Superscalar Designs Part-II Krste Asanovic
Flow Path Model of Superscalars
The processor: Pipelining and Branching
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Ka-Ming Keung Swamy D Ponpandi
Lecture: Out-of-order Processors
Lecture 8: Dynamic ILP Topics: out-of-order processors
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Control unit extension for data hazards
Lecture 20: OOO, Memory Hierarchy
* From AMD 1996 Publication #18522 Revision E
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
Control unit extension for data hazards
Dynamic Hardware Prediction
Control unit extension for data hazards
Chapter 11 Processor Structure and function
Lecture 9: Dynamic ILP Topics: out-of-order processors
Conceptual execution on a processor which exploits ILP
Ka-Ming Keung Swamy D Ponpandi
Spring 2019 Prof. Eric Rotenberg
Spring 2019 Prof. Eric Rotenberg
Presentation transcript:

Trace Caches J. Nelson Amaral

Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there is misprediction Likely can only fetch from one I-cache line – m instructions may spread over two lines I-cache misses are even worse Taken branches – target address may be in the middle of a cache line Instructions before target must be discarded – the remainder of the m instructions fetched need to be discarded. Baer p. 159

Getting more from I-cache How can we increase the probability that more of the m instructions needed are in a cache line? – Increase cache line size Increasing too much increases cache misses – Fetch “next” line What is “next” in a set-associative cache? replacements: – next line does not contain the right instructions – address checking and repair needed even with no branches Baer p. 159

Line-and-way Predictor Instead of predicting a Branch Target Address, predict the next line and set in the Icache. – Called a Next Cache Line and Set Predictor by Calder and Grunwald (ISCA 95) NLS-cache: associate predictor bits with a cache line NLS-table: store the predictor bits into a separate direct mapped tag-less buffer Effective for programs containing many branches Baer p. 160

NLS-Cache Calder, Brad and Grunwald, Dirk, Next Cache Line and Set Prediction, International Symposium on Computer Architecture (ISCA), 1995, NLS: tagless table of pointers to next instruction to be executed into instruction cache NLS also predicts indirect branches and provides branch type. three predicted addresses. Needs an early distinction between branch and non-branch instructions.

Trace Caches A cache that records instructions in the sequence in which they were fetched (or committed) – PC indexes into the trace cache – If predictions are correct: whole trace fetched in one cycle all instructions in the trace are executed Baer p. 161

Trace Caches Design Issues How to design the trace cache? How to build a trace? When to build a trace? How to fetch a trace? When to fetch a trace? When to replace a trace? Baer p. 161

Instruction Fetch with I-cache Baer p. 161

Fetch with I-cache and Trace Cache Trace Cache Baer p. 161

Trace Selection Criteria Number of conditional branches in a trace – number of consecutive correct predictions is limited Merging next block may exceed trace line – no partial blocks in a trace Indirect jump or call-return terminate traces Baer p. 161

Trace Tags What should be the tag of a trace? Is it sufficient to use the address of the first instruction as tag? Baer p. 161

Tags for Trace Cache Entries Assume a trace may contain up to 16 instructions. There are two possible traces: T1: B1-B2-B4 T2: B1-B3-B4 T1 and T2 start at the same address. Possible solution: Add the predicted branch outcomes to the trace Baer p. 162

Fetch with I-cache and Trace Cache Trace Cache Register Renaming Bypass decode stage Instructions in a trace may not need to be decoded: trace of μops. Big advantage on CISC ISAs (Intel IA-32) where decoding is expensive. Baer p. 162

Where to build a trace from? Trace Cache Decoder Fill Unit Traces from mispredicted paths are added to trace cache. Baer p. 163

Where to build a trace from? Trace Cache Fill Unit Reorder Buffer Long delay to build a trace. Not much performance difference between decoder and ROB. Baer p. 163

Next Trace Predictor To predict the next trace – Need to predict the outcome of several branches at the same time. An expanded BTB can be used – Can base the prediction on a path history of past traces Use bits from tags of previous traces to index a trace predictor Baer p. 163

Intel Pentium 4 Trace Cache Trace Cache contains up to 6 μops per line – Can store 12K μops (2K lines) Claim that it delivers a hit rate equal to 8 to 16 KB of I- cache There is no I-cache – On a trace cache miss fetches from L2 Trace cache has its own BTB (512 entries) – Another independent 4K BTB from L2 fetches Advantages over using an I-Cache: – Increased fetch bandwidth – Bypass of the decoder Baer p. 163

A 2-to-4 bit Decoder

An 8-to-256 bit Decoder A0A0 A1A1 A2A2 A3A3 A4A4 A5A5 A6A6 A7A7 256 times 256 lines

Decoding Complexities Need m decoders To speculatively compute branch targets for each of m instructions – Need m adders in the decode stage. – Solution: limit number of branches decode per cycle if branch is resolved in c cycles – still there are c branches in flight – c = 10 is typical Baer p. 164

Alleviating Decoding Complexities Use predecoded bits appended to instructions – predecode on transfers from L2 to I-cache CISC: limit the number of complex instructions decoded in a single cycle Intel P6 3 decoders: 2 for 1-μop instruction 1 for complex instruct. Baer p. 164

Alleviating Decoding Complexities Use an extra decoding stage to steer instructions towards instruction queues. Baer p. 164

Pre-decoded Bits Append 4 bits to each instruction – designate class (integer, floating point, branch, load-store) and execution unit queue Partial decode on transfer from L2 to I-Cache. MIPS R10000 Baer p. 165

Pre-decoded Bits 3 bits appended to each byte: – indicate how many bytes away is the start of the next instruction – stored in a predecode cache – predecode cache is accessed in parallel with I- cache. – Advantage: detection of instruction boundaries done only once (not at every execution) saves power dissipation – Disadvantage: size of I-cache is almost double AMD K7 Baer p. 165

Instruction Buffer (Queue) A pipeline stage brings instructions into a buffer. – Boundaries are determined in the buffer – Instructions are steered to either: A simple decoder A complex decoder – Can detect opportunities for instruction fusion (pe. a compare-and-test followed by a branch) may send the fused instruction to a simple decoder Intel P6 Baer p. 165

Impact of Decoding on Superscalar The complexity of the decoder is one of the major limitations to increase m. Baer p. 165

Three approaches to Register Renaming Reorder Buffer Monolithic Physical Register File Architectural Register File Physical Extension to Register File Baer p. 165

Implementation of Register Renaming Where and when to allocate/release physical registers? What to do on a branch misprediction? What do do on an exception? Baer p. 166

Example i1: R1 ← R2/R3 # division takes a long time i2: R4 ← R1 + R5 i3: R5 ← R6 + R7 i4: R1 ← R8 + R9 i1: R32 ← R2/R3 # division takes a long time i2: R33 ← R32 + R5 i3: R34 ← R6 + R7 i4: R35 ← R8 + R9 R35 will receive a value before instruction i2 is issued. When/how can R32 be released? As soon as i2 is issued. How does the hardware know that i2 is the last use of R32? Use a counter? rename in an use → count up instruction issued→ count down Too expensive! Baer p. 166

Example i1: R1 ← R2/R3 # division takes a long time i2: R4 ← R1 + R5 i3: R5 ← R6 + R7 i4: R1 ← R8 + R9 i1: R32 ← R2/R3 # division takes a long time i2: R33 ← R32 + R5 i3: R34 ← R6 + R7 i4: R35 ← R8 + R9 R35 will receive a value before instruction i2 is issued. When R1 is renamed again, all uses of the first renaming must be issued! Release R32 when i4 commits! Baer p. 166

Allocated Free R32R33R34 R35 Executed Assigned R1R2R3 R6 R4R5 R7R8R9 i1: R1 ← R2/R3 # division takes a long time i2: R4 ← R1 + R5 i3: R5 ← R6 + R7 i4: R1 ← R8 + R9 Renaming Stage: Allocated as a result End of Execution: Result Generated Commit: Physical Register becomes an Architectural Register Release: Next instruction that renames same register commits rename R32 Baer p. 167

Executed Allocated Free R32R33R34 R35 Assigned R1R2R3 R6 R4R5 R7R8R9 i1: R32 ← R2/R3 # division takes a long time i2: R33 ← R32 + R5 i3: R34 ← R6 + R7 i4: R35 ← R8 + R9 Renaming Stage: Allocated as a result End of Execution: Result Generated Commit: Physical Register becomes an Architectural Register Release: Next instruction that renames same register commits execute Baer p. 167

Assigned Executed Allocated Free R32 R33 R34 R35 R1R2R3 R6 R4R5 R7R8R9 i1: R32 ← R2/R3 # division takes a long time i2: R33 ← R32 + R5 i3: R34 ← R6 + R7 i4: R35 ← R8 + R9 Renaming Stage: Allocated as a result End of Execution: Result Generated Commit: Physical Register becomes an Architectural Register Release: Next instruction that renames same register commits commit execute Baer p. 167

Assigned Executed Allocated Free R35 R1R2R3 R6 R4R5 R7R8R9 i1: R32 ← R2/R3 # division takes a long time i2: R33 ← R32 + R5 i3: R34 ← R6 + R7 i4: R35 ← R8 + R9 Renaming Stage: Allocated as a result End of Execution: Result Generated Commit: Physical Register becomes an Architectural Register Release: Next instruction that renames same register commits commit R32R33R34 Baer p. 167

Register Releasing Need to know the previous renaming – maintain a previous vector for each architectural register. – when renaming R1 to R35 in i4, record that the previous renaming of R1 was R32 – when i4 commits, R35 becomes the previous renaming of R1 i1: R32 ← R2/R3 # division takes a long time i2: R33 ← R32 + R5 i3: R34 ← R6 + R7 i4: R35 ← R8 + R9 Baer p. 167

Extended Register File Only physical registers can be in free state. At commit, result must be stored into the architectural register mapped to the physical register. – Needs to associatively search the mapping OR – ROB has a field with name of architectural register Architectural Register File Physical Extension to Register File Baer p. 168

Repair Mechanism - Misprediction How to repair the register renaming after a branch misprediction? – ROB discarding ROB entries for miss-speculated instructions invalidates all mappings from architectural to physical registers. – Monolithic make a copy of mapping table at each branch prediction save copies in circular queue in case of misprediction use saved copy to restore mapping Baer p. 169

Repair Mechanism - Exceptions Monolithic – Mappings are not saved at every instruction There may be no correct map to restore – Have to undo the mappings from last renamed instruction up to the one that caused the exception – Undoing the map is costly but only occurs when an exception is being handled. Baer p. 169

Comparison: ROB-based Space wasting – ROB fields for renaming for instr. that do not use registers. Time wasting – Two cycles to store result into register file. retiring and writing result can be pipelined Power and Space wasting – ROB needs too many read and write ports Easy repair No map saving Compelling Simplicity Baer p. 169

Monolithic ROB-based Intel P6 40-μops ROB in Pentium III and M > 80-μops ROB in Intel Core MIPS R10000 Alpha Extended IBM PowerPC Baer p. 170