Spring 2019 Prof. Eric Rotenberg

Slides:

Advertisements

Similar presentations

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.

Advertisements

Krste Asanovic Electrical Engineering and Computer Sciences

Real-time Signal Processing on Embedded Systems Advanced Cutting-edge Research Seminar I&III.

1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.

Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Reducing the Complexity of the Register File in Dynamic Superscalar Processors Rajeev Balasubramonian, Sandhya Dwarkadas, and David H. Albonesi In Proceedings.

Trace Processors Presented by Nitin Kumar Eric Rotenberg Quinn Jacobson, Yanos Sazeides, Jim Smith Computer Science Department University of Wisconsin-Madison.

Pipelined Datapath and Control (Lecture #15) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.

EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Page 1 Trace Caches Michele Co CS 451. Page 2 Motivation  High performance superscalar processors  High instruction throughput  Exploit ILP –Wider.

OOO Pipelines - II Smruti R. Sarangi IIT Delhi 1.

OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.

ECE/CS 552: Pipeline Hazards © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim.

CS 352H: Computer Systems Architecture

Dynamic Scheduling Why go out of style?

Prof. Hsien-Hsin Sean Lee

Computer Organization CS224

Multiscalar Processors

Smruti R. Sarangi IIT Delhi

PowerPC 604 Superscalar Microprocessor

CIS-550 Advanced Computer Architecture Lecture 10: Precise Exceptions

Dr. George Michelogiannakis EECS, University of California at Berkeley

Case Studies MAINAK CS422 1 CS422 MAINAK CS422 MAINAK 1.

CSE 502: Computer Architecture

CS203 – Advanced Computer Architecture

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

Morgan Kaufmann Publishers The Processor

Commit out of order Phd student: Adrián Cristal.

Lecture 6: Advanced Pipelines

Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

The Microarchitecture of the Pentium 4 processor

Superscalar Processors & VLIW Processors

The processor: Pipelining and Branching

Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Smruti R. Sarangi IIT Delhi

ECE 2162 Reorder Buffer.

Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Ka-Ming Keung Swamy D Ponpandi

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

Lecture: Out-of-order Processors

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Krste Asanovic Electrical Engineering and Computer Sciences

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

Advanced Computer Architecture

Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

* From AMD 1996 Publication #18522 Revision E

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

Instruction-Level Parallelism (ILP)

Additional ILP Topics Prof. Eric Rotenberg

pipelining: data hazards Prof. Eric Rotenberg

Conceptual execution on a processor which exploits ILP

Ka-Ming Keung Swamy D Ponpandi

ECE 721, Spring 2019 Prof. Eric Rotenberg.

ECE 721 Alternatives to ROB-based Retirement

Handling Stores and Loads

Sizing Structures Fixed relations Empirical (simulation-based)

ECE 721 Modern Superscalar Microarchitecture

Spring 2019 Prof. Eric Rotenberg

Dynamic Scheduling Physical Register File ready bits Issue Queue (IQ)

Presentation transcript:

Spring 2019 Prof. Eric Rotenberg ECE 721 Trace Processors Spring 2019 Prof. Eric Rotenberg

Trace Processor Trace Predictor Trace Cache Global Rename Branch Predictor I-cache Trace Cache Pre-rename Global Rename Processing Element (PE) GRF (copy) GRF (copy) GRF (copy) GRF (copy) LRF Function Units LRF LRF LRF ECE 721, Spring 2019 Prof. Eric Rotenberg

Trace Processor Distribute Issue Queue and Execution Lanes among multiple PEs: Simplifies select logic (only 16-32 instructions considered for issuing to a small number of lanes) Exploit value hierarchy in traces to simplify: Register file Register rename logic Bypasses and wakeup ports ECE 721, Spring 2019 Prof. Eric Rotenberg

Value Hierarchy Local values Global values Values produced and consumed within a trace Global values Values communicated among traces ECE 721, Spring 2019 Prof. Eric Rotenberg

Live-ins, Live-outs Live-in value Live-out value Global value produced by a previous trace and consumed by this trace Live-out value Global value produced by this trace and (possibly) consumed by later traces ECE 721, Spring 2019 Prof. Eric Rotenberg

Local Only Purely local value A value produced in this trace, consumed in this trace, and then killed in this trace trace begin r5 Purely local value (not consumed by later traces) kill r5 r5 r5 trace end ECE 721, Spring 2019 Prof. Eric Rotenberg

Local and Live-out Local & live-out A value produced in this trace, consumed in this trace, but not killed in this trace trace begin r5 Local value, may be consumed by later traces r5 trace end ECE 721, Spring 2019 Prof. Eric Rotenberg

Hierarchical Register File Single global register file (GRF) Holds all global values Multiple local register files (LRF) One per PE Holds local values of the trace allocated to the PE ECE 721, Spring 2019 Prof. Eric Rotenberg

Reducing Register File Complexity GRF less complex than monolithic register file Purely local values not held in GRF Reduces number of registers in GRF LRFs off-load much of the read and write traffic => GRF can have fewer read and write ports ECE 721, Spring 2019 Prof. Eric Rotenberg

Pre-renaming Traces Pre-renaming Check dependencies among instructions in a trace when it is first built Pre-rename local values to registers in the LRF ECE 721, Spring 2019 Prof. Eric Rotenberg

Reducing Renaming Complexity Rename Stage is now the Global Rename Stage Only rename global values (live-ins & live-outs) to the GRF Two key simplifications to Global Rename Stage No need to check for dependencies within the trace and bypass free list mappings from producers to consumers within the trace. These are local values, which were analyzed and pre-renamed to the LRF when the trace was constructed and filled into Trace Cache. Global RMT has fewer read ports (live-ins only) and write ports (live-outs only) ECE 721, Spring 2019 Prof. Eric Rotenberg

Example: Pre-renaming the Trace Original trace Pre-renamed trace (stored in Trace Cache) add r3, r1, r2 add {--,L0}, r1, r2 add r3, r1, r3 add {--,L1}, r1, L0 add r5, r3, r5 add {--,L2}, L1, r5 add r3, r5, #1 add {--,L3}, L2, #1 add r5, r6, r2 add {r5,L4}, r6, r2 add r3, r5, r3 add {r3,--}, L4, L3 Key for pre-renamed logical destination registers: {--,Ly}: local only {rx,--}: liveout only {rx,Ly}: liveout and local global live-in (GLI) registers: r1, r2, r5, r6 global live-out (GLO) registers: r3, r5 ECE 721, Spring 2019 Prof. Eric Rotenberg

Example (cont.) Global Rename Map Table Global Rename Map Table GLI(r1) r1 p29 r1 p29 GLI(r2) r2 p31 r2 p31 r3 r3 p9 r4 r4 GLI(r5) r5 p17 r5 p11 GLI(r6) r6 p24 r6 p24 r31 r31 p24 p9 p11 … p17 Global Free List p31 p11 GLO(r3) p29 p9 GLO(r5) ECE 721, Spring 2019 Prof. Eric Rotenberg

Bypass Complexity Bypass complexity Forward value from any execution lane to any other execution lane With many lanes: Long wires (must span all the lanes) Wires are heavily loaded (tapped by each lane) many:1 MUX within each lane Two conventional options available to monolithic superscalar processor Increase cycle time to allow for slow bypasses, or Producers and consumers can’t execute in consecutive cycles Trace processor exploits value hierarchy for a better compromise ECE 721, Spring 2019 Prof. Eric Rotenberg

A Good Compromise w.r.t. Bypasses Local bypasses Fast: Producer and consumer execute in consecutive cycles Global bypasses Slow: Several cycles to bypass value from producer to consumer This is a good compromise Fast clock Some values, but not all values, are slow to bypass ECE 721, Spring 2019 Prof. Eric Rotenberg

Number of Global Bypasses Number of global bypasses should be determined empirically and depends on live-out traffic Number of global bypasses affects: Number of GRF write ports Number of additional wakeup ports in PE’s issue queue Should be favorable compared to monolithic superscalar processor A PE’s issue queue has fewer wakeup ports than the issue queue of a monolithic superscalar # wakeup ports = # execution lanes in PE + # global bypasses ECE 721, Spring 2019 Prof. Eric Rotenberg

Processing Element (PE) Trace Predictor Branch Predictor I-cache Trace Cache Pre-rename Global Rename Processing Element (PE) Local Register File Local Register File Local Register File Local Register File Function Units Global Register File ECE 721, Spring 2019 Prof. Eric Rotenberg

Processing Element (PE) Trace Predictor Branch Predictor I-cache Trace Cache Pre-rename Global Rename Processing Element (PE) GRF (copy) GRF (copy) GRF (copy) GRF (copy) LRF Function Units LRF LRF LRF ECE 721, Spring 2019 Prof. Eric Rotenberg

Trace-Level Sequencing Trace prediction Trace fetch Trace rename Trace dispatch Trace completion Trace retirement ECE 721, Spring 2019 Prof. Eric Rotenberg

Trace Prediction Trace predictor Predicts the next trace Produces one trace id per cycle Trace id uniquely identifies trace Start PC Bit vector indicating directions (taken/not-taken) of embedded branches Start PC ECE 721, Spring 2019 Prof. Eric Rotenberg

Trace Predictor Tidn ... Tid1 Tid0 trace id Hash Function predicted trace id to T$ Hash Function ECE 721, Spring 2019 Prof. Eric Rotenberg

Trace Fetch Lookup predicted trace id in T$ T$ hit T$ miss Send pre-renamed trace to Trace Rename Stage T$ miss Use predicted trace id to fetch basic blocks from I$ Trace build takes multiple cycles After trace is built, pre-rename it ECE 721, Spring 2019 Prof. Eric Rotenberg

Trace Cache Miss Can’t send instructions directly from instruction cache Must package instructions into a trace, pre-rename the trace, and send the trace down the pipeline as a single unit Why? No dependence checking logic in rename stage. ECE 721, Spring 2019 Prof. Eric Rotenberg

Step 1 Step 2 Tid Tid Tid Trace Predictor Tid Trace Cache miss I-cache BTB I-cache BTB I-cache BTB Tid Trace Cache miss Stall flow of instructions into rename stage ECE 721, Spring 2019 Prof. Eric Rotenberg

Step 3 Step 4 cont... Trace Predictor Trace Cache Next Tid pre-rename logic Trace Cache hit Resume to Trace Rename Stage ECE 721, Spring 2019 Prof. Eric Rotenberg

Trace Rename Steps Rename globals to GRF Live-ins: Get mappings from Global Rename Map Table Live-outs: Pop free registers from Global Free List, update Global Rename Map Table ECE 721, Spring 2019 Prof. Eric Rotenberg

Trace Dispatch Steps Allocate entry at tail of Trace-Level Active List Entry holds current mappings of all live-outs in the trace Dispatch trace to a free PE ECE 721, Spring 2019 Prof. Eric Rotenberg

Trace Completion Steps Wait for all instructions in the trace to complete Set “complete” bit in corresponding entry in Trace- Level Active List Free the PE for use by another trace ECE 721, Spring 2019 Prof. Eric Rotenberg

Trace Retirement Steps Wait for entry at head of Trace-Level Active List to have its “complete” bit set Commit/free mappings of live-outs Free previous mappings of live-outs Get previous mappings from Global Architectural Map Table Add them back to Global Free List Commit current mappings of live-outs Get current mappings from Trace-Level Active List Write them into the Global Arch. Map Table Advance head pointer of Trace-Level Active List ECE 721, Spring 2019 Prof. Eric Rotenberg

Exceptions & Branch Mispredictions Can’t back up to middle of trace Values before the exception/branch that were thought to be local may change to global These values must make it into the global register file to be visible to later traces Must squash, re-construct, and re-execute entire trace ECE 721, Spring 2019 Prof. Eric Rotenberg

Handling Exceptions Steps Post exception: Set “exception” bit in Trace-Level Active List Wait until trace reaches head of Trace-Level Active List Squash all later traces Squash all instructions in the trace, even those before the instruction that caused exception ECE 721, Spring 2019 Prof. Eric Rotenberg

Handling Exceptions (cont.) Steps (cont.) Restore Global Rename Map Table from Global Arch. Map Table Construct a modified trace Same trace except terminate it early, just before instruction that caused exception Pre-rename this modified trace Re-execute and commit the modified trace Now ready to service the exception ECE 721, Spring 2019 Prof. Eric Rotenberg

Branch Handling Checkpoint Global Rename Map Table between traces If all branches in a trace resolve without mispredictions: Free corresponding shadow map ECE 721, Spring 2019 Prof. Eric Rotenberg

Branch Handling (cont.) If branch misprediction is detected within a trace: Squash all later traces Squash all instructions in the trace, even those before the mispredicted branch Restore Global Rename Map Table from corresponding shadow map table This restores mappings to what they were before renaming this trace ECE 721, Spring 2019 Prof. Eric Rotenberg

Branch Handling (cont.) Construct a modified trace Same start PC, follow different path after mispredicted branch Fall back to conventional branch predictor, I$, etc. Pre-rename this modified trace Re-dispatch the trace Execution may or may not reveal other mispredictions in the trace... ECE 721, Spring 2019 Prof. Eric Rotenberg