SoC CAD 1 Simultaneous Continual Flow Pipeline Architecture 徐 子 傑 Hsu,Zi Jei Department of Electrical Engineering National Cheng Kung University Tainan,

Slides:



Advertisements
Similar presentations
Topics Left Superscalar machines IA64 / EPIC architecture
Advertisements

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
CSCI 4717/5717 Computer Architecture
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.
SoC CAD 1 Tuning the Continual Flow Pipeline Architecture 徐 子 傑 Hsu,Zi Jei Department of Electrical Engineering National Cheng Kung University Tainan,
Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Architecture Basics ECE 454 Computer Systems Programming
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.
SoC CAD 2015/11/22 1 Instruction Set Extensions for Multi-Threading in LEON3 林孟諭 電機系, Department of Electrical Engineering 國立成功大學, National Cheng Kung.
1 Out-Of-Order Execution (part I) Alexander Titov 14 March 2015.
Hyper-Threading Technology Architecture and Microarchitecture
Transformer: A Functional-Driven Cycle-Accurate Multicore Simulator 1 黃 翔 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan,
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
Intel Multimedia Extensions and Hyper-Threading Michele Co CS451.
OOO Pipelines - II Smruti R. Sarangi IIT Delhi 1.
OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.
1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
PipeliningPipelining Computer Architecture (Fall 2006)
CS203 – Advanced Computer Architecture ILP and Speculation.
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
CS 352H: Computer Systems Architecture
Dynamic Scheduling Why go out of style?
Smruti R. Sarangi IIT Delhi
PowerPC 604 Superscalar Microprocessor
CS203 – Advanced Computer Architecture
Lecture: Out-of-order Processors
Smruti R. Sarangi Computer Science and Engineering, IIT Delhi
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Lecture 6: Advanced Pipelines
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Smruti R. Sarangi Computer Science and Engineering, IIT Delhi
Lecture 8: Dynamic ILP Topics: out-of-order processors
15-740/ Computer Architecture Lecture 5: Precise Exceptions
* From AMD 1996 Publication #18522 Revision E
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
CSC3050 – Computer Architecture
Lecture 10: ILP Innovations
Lecture 9: ILP Innovations
Lecture 9: Dynamic ILP Topics: out-of-order processors
Conceptual execution on a processor which exploits ILP
Spring 2019 Prof. Eric Rotenberg
Presentation transcript:

SoC CAD 1 Simultaneous Continual Flow Pipeline Architecture 徐 子 傑 Hsu,Zi Jei Department of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C

NCKU SoC & ASIC Lab 2 Hsu, Zi Jei SoC CAD INTRODUCTION(1/2)  Since the introduction of the first industrial out-of-order superscalar processors in the 1990s, instruction buffers and cache sizes have kept increasing with every new generation of out-of-order cores.  The motivation behind this continuous evolution towards larger buffers and caches has been performance of single-thread applications.  Achieving performance this way has come at the expense of area, power, and complexity.  We show that this is not the most energy efficient way to achieve performance.  Instead, sizing the instruction buffers to the minimum size necessary for the common case of L1 data cache hits and using new latency-tolerant microarchitecture to handle loads that miss the L1 data cache, improves execution time and energy consumption.

NCKU SoC & ASIC Lab 3 Hsu, Zi Jei SoC CAD INTRODUCTION(2/2)  Paper Contributions:  we present a new reduced buffers out-of-order core architecture that is non-blocking in case of L1 data cache misses.  The architecture introduces novel out-of-order execution algorithms that extend Continual Flow Pipelines [25] to allow L1 data cache miss independent instructions and dependent instructions, to execute simultaneously in the core pipeline.  To support the S-CFP core architecture, we present novel algorithms for performing fast register results integration and load-store memory ordering while executing simultaneously in the core pipeline non-contiguous miss dependent and independent instructions.  We show that S-CFP improves execution time and energy consumption on Spec CPU 2000 benchmarks by 10% and 12% respectively, as compared to a large superscalar baseline.

NCKU SoC & ASIC Lab 4 Hsu, Zi Jei SoC CAD Microarchitecture Overview(1/5)  Figure 1 shows a block diagram of the S-CFP core.  Unlike previous latency tolerant out-of-order architectures, the S-CFP core executes cache miss dependent and independent instructions concurrently using two different hardware thread contexts.  In order to support two hardware threads,  S-CFP has two register alias tables (RAT) for renaming the independent and the dependent thread instructions.  S-CFP also has two retirement register file contexts (RRF), one for retiring independent instruction results and the other for retiring dependent instruction results.

NCKU SoC & ASIC Lab 5 Hsu, Zi Jei SoC CAD Microarchitecture Overview(2/5) Figure 1. S-CFP microarchitecture block diagram

NCKU SoC & ASIC Lab 6 Hsu, Zi Jei SoC CAD Microarchitecture Overview(3/5)  In S-CFP, execution initially starts using a hardware thread that we call the independent thread.  When an L1 data cache load miss occurs, a poison bit is set in the destination register of the load.  Poison bits propagate through instruction dependences, and identify all instructions that depend on the load miss.  The miss load and its dependents, identified by the poison bits in the reorder buffer (ROB), pseudo-retire in program order and move from the ROB into a dependent slice and data buffer (SDB) outside the pipeline.  Therefore, miss-dependent instructions do not consume or occupy precious pipeline resources such as reservation stations or ROB entries, while waiting for the load miss data.

NCKU SoC & ASIC Lab 7 Hsu, Zi Jei SoC CAD Microarchitecture Overview(4/5)  Since poisoned instructions are reordered using the ROB before they are written into the SDB,  the complexity of physical to physical register renaming, deadlock avoidance hardware, and rename filter, required in some previous latency tolerant architectures [25], are eliminated in S-CFP.  Since the SDB needs to store any completed non-poisoned source registers with its instructions,  S-CFP uses the ROB data array to propagate completed source operands with the poisoned instructions to the SDB  A poisoned instruction has at most one completed source since SCFP uses RISC-like uops with at most two source registers.

NCKU SoC & ASIC Lab 8 Hsu, Zi Jei SoC CAD Microarchitecture Overview(5/5)  When the miss-data is fetched into the L1 data cache, the dependent instructions wake up and issue again from the SDB into the pipeline using a hardware thread that we call dependent thread.  The dependent thread executes simultaneously with the independent thread until the SDB is drained.  Since the dependent and independent threads execute simultaneously using different register contexts, it is not necessary to flush the pipeline.  When all dependent instructions re-execute and the SDB is drained, the execution results of the dependent and independent threads are integrated, with a single-cycle flash copy within the retirement register file (RRF).  The independent thread continues execution by itself without any interruption or pipeline flush.

NCKU SoC & ASIC Lab 9 Hsu, Zi Jei SoC CAD Independent Thread Execution and Dependent Thread Construction(1/2)  The independent hardware thread is the main execution thread in S-CFP.  It is responsible for instruction fetch and decode, branch prediction, and memory dependence prediction.  It also propagates poison bits after a cache miss to identify and remove from the pipeline miss-dependent instructions.  The independent thread executes instructions that are independent of L1 data cache misses, and pseudo-retires all instructions, miss independent as well as dependent.  The retirement process copies the poison bit of each retired instruction into the retirement register file (RRF).  Poison bits in the RRF are not sticky. At any time, a poison bit of a register entry in the RRF can be either true or false, depending on whether the last retired writer of this logical register was independent or dependent on the load miss.

NCKU SoC & ASIC Lab 10 Hsu, Zi Jei SoC CAD Independent Thread Execution and Dependent Thread Construction(2/2)  When the SDB is empty and a load miss retires and enters the SDB, the independent thread retirement register file(RRF) contains the precise state of the execution up to the load miss.  S-CFP saves a checkpoint of this RRF for recovery in case of a subsequent miss dependent branch misprediction or exception[1].  In case of an independent mispredicted branch, instructions before the branch pseudo retire and the dependents among these instructions enter the SDB.  Instructions after the branch are flushed before they pseudo retire, thus no bogus mispredicted instructions ever enter the SDB.

NCKU SoC & ASIC Lab 11 Hsu, Zi Jei SoC CAD Dependent Thread Execution(1/2)  Dependent thread execution starts when load miss data fetch completes and the load is woken from the SDB. It continues until the SDB is empty.  Dependent loads and stores carry with them unique sequence IDs assigned to them when they were originally fetched by the independent thread.  These sequence IDs allow the load-store ordering hardware to identify the order of dependent and independent loads and stores within the program.  The SDB could contain at any time multiple load misses and their dependents, stored in program order.  Load miss wakeup and reissue from the SDB is done in program order, even though load miss wakeups could arrive to the SDB out of order.

NCKU SoC & ASIC Lab 12 Hsu, Zi Jei SoC CAD Dependent Thread Execution(2/2)  The dependent thread execution uses a separate simultaneous multithreading (SMT) hardware thread, with its own register alias table (DEP RAT), and retirement register file context (RRF).  It executes simultaneously with the independent thread, with which it shares execution resources, such as, reservation stations, functional units, and data cache read and write ports  If a load miss occurs in the dependent thread during its execution, the load stalls until the data is fetched into the L1 data cache.  In other words, dependent instructions do not pseudo retire or enter the SDB more than once.  If a mispredicted branch or an exception occurs during dependent thread execution, S-CFP flushes the SDB and the pipeline, and rolls back execution to the checkpoint.

NCKU SoC & ASIC Lab 13 Hsu, Zi Jei SoC CAD Checkpoints and Results Integration(1/3) Figure 2. RRF cell with checkpoint and result integration support

NCKU SoC & ASIC Lab 14 Hsu, Zi Jei SoC CAD Checkpoints and Results Integration(2/3)  Figure 2 shows the S-CFP retirement register file cell with checkpoint flash copy support.  We use a flash copy of the RRF for creating checkpoints.  In one cycle every independent thread RRF bit (leftmost latch) is shifted into a checkpoint latch within the register cell (center latch).  The register file can be restored from the checkpoint in one cycle by asserting RSTR_CLK.  At the end of dependent execution and when all instructions in the SDB have re-issued and retired, the RRF has all the live-out registers,  some computed by the independent thread and some computed by the dependent thread, as determined by the poison bits in the RRF.

NCKU SoC & ASIC Lab 15 Hsu, Zi Jei SoC CAD Checkpoints and Results Integration(3/3)  To integrate these results back into one context, a restore cycle is performed from the dependent thread context into the independent thread context. However, not all registers are copied.  Figure 2 shows that only the poisoned registers are copied by using the poison bits to enable the clock of the copy operation.  A 2-to-1 multiplexer in the cell restores either the checkpoint bit or the dependent bit during a RSTR_CLK cycle.

NCKU SoC & ASIC Lab 16 Hsu, Zi Jei SoC CAD Memory Ordering and Load, Store Execution(1/7)  Memory ordering:  To maintain proper memory ordering of loads and stores from the independent and dependent threads execution, we use a reduced size load and store queues (LSQ), a Store Redo Log (SRL) [14] and a store-set memory dependence predictor [8].  All stores, dependent and independent, are allocated entries (and IDs) in the SRL in program order by the independent thread.  Every load, dependent or independent, carries the SRL ID of the last prior store.  The SRL IDs assigned to loads and stores are unique and determine the order of memory instructions.  The independent thread performs load memory dependence prediction, with a store-set predictor [8], to determine the store on which a load may depend.  The predictor uses SRL IDs in the prediction and writes store poison bits in the SRL, thus allowing propagation of poison bits from stores to predicted-dependent loads.

NCKU SoC & ASIC Lab 17 Hsu, Zi Jei SoC CAD Memory Ordering and Load, Store Execution(2/7)  L1 Data cache state:  In order to support simultaneous execution of dependent and independent loads and stores, a data cache block has 2 new states:  Speculative Independent (Spec_Ind) and Speculative Dependent (Spec_Dep).  A block that is not in one of these two states is committed and would be in one of the states defined by the cache coherence protocol, e.g. Shared, Exclusive, or Modified in a MESI coherence protocol.

NCKU SoC & ASIC Lab 18 Hsu, Zi Jei SoC CAD Memory Ordering and Load, Store Execution(3/7)  Independent thread store execution:  Independent stores from the independent thread are written speculatively into the first level cache after they pseudo-retire, setting the Spec_Ind bit of the written block.  If an independent store address matches the address of a Spec_Dep block, it is handled as a cache miss and another cache block in the set is allocated to the independent store.  Independent stores are also written in the SRL buffer at the same time they are written into the L1 data cache, as shown in Figure 1.

NCKU SoC & ASIC Lab 19 Hsu, Zi Jei SoC CAD Memory Ordering and Load, Store Execution(4/7)  Dependent thread store execution:  When dependent stores from the dependent thread execute, they are written into the SRL, but not in the LSQ or data cache.  After the dependent stores retire, the dependent thread writes all stores, dependent and independent, from the SRL into the data cache in program order, setting the written cache block to Spec_Dep.  Notice that Independent stores are written twice into the data cache:  1) speculatively by the independent thread to forward data to independent loads, and  2) by the dependent thread, interleaved with dependent stores in program order to enforce a final, correct order of memory writes.  After the dependent thread executes, and the SDB and SRL become empty,  Spec_Ind blocks in the cache are bulk flushed,  and Spec_Dep blocks are bulk committed, leaving in the cache only ordered stores data.

NCKU SoC & ASIC Lab 20 Hsu, Zi Jei SoC CAD Memory Ordering and Load, Store Execution(5/7)  Independent thread load execution:  Independent loads read Spec_Ind or Committed blocks in the L1 cache.  Independent stores therefore forward data to their descendent independent loads through the data cache, long before they are actually committed.  Independent loads that hit a Spec_Dep block in the cache cannot be safely disambiguated, and is consequently poisoned and changed to a dependent load to be re-issued again later from the SDB.

NCKU SoC & ASIC Lab 21 Hsu, Zi Jei SoC CAD Memory Ordering and Load, Store Execution(6/7)  Dependent thread load execution:  Dependent loads are re-issued from the SDB and execute from the data cache after all stores ahead of them in the SRL are written to the cache.  Synchronizing the dependent load execution with the stores ahead of them in the SRL is performed using the SRL ID assigned to every load.  The load SRL ID identifies the entry in the SRL that was assigned to the last prior store.  A dependent load can only read data from a committed block or Spec_Dep block. If the address of a dependent load matches the address of a Spec_Ind block, it is treated as a miss.

NCKU SoC & ASIC Lab 22 Hsu, Zi Jei SoC CAD Memory Ordering and Load, Store Execution(7/7)  Resource Sharing During Simultaneous Execution:  The strategy we use for resource sharing is the following.  When the SDB has instructions to issue, we use a round robin policy to schedule rename cycles between the dependent thread and the independent thread. This is one typical policy used in simultaneous multithreading architectures.  However, we do not partition in S-CFP the reorder buffer between the two threads, and simply allow dependent and independent instructions to be interleaved in the reorder buffer.  This complicates the retirement stage to some degree.  However, we believe that this policy is implementable, since dependent thread branch mispredictions and exceptions are not taken at instruction granularity.

NCKU SoC & ASIC Lab 23 Hsu, Zi Jei SoC CAD SIMULATION METHODOLOGY(1/4) TABLE I. SIMULATED MACHINE CONFIGURATIONS

NCKU SoC & ASIC Lab 24 Hsu, Zi Jei SoC CAD SIMULATION METHODOLOGY(2/4) TABLE II. S-CFP EXECUTION STATISTICS

NCKU SoC & ASIC Lab 25 Hsu, Zi Jei SoC CAD SIMULATION METHODOLOGY(3/4)  TABLE I shows the baseline machine configuration.  TABLE II shows various other S-CFP execution statistics.  Particularly of interest are the statistics for eon and gap, which underperform the baseline, by 6% and 7% respectively.  Both benchmarks have low cache miss rate, which reduces the potential performance benefit of S-CFP1, and high dependent mispredicted branches, which cause frequent and costly pipeline flushes and restarts from checkpoints.

NCKU SoC & ASIC Lab 26 Hsu, Zi Jei SoC CAD SIMULATION METHODOLOGY(4/4)  TABLE III shows relative average energy consumption of various functional blocks as well as the total baseline and SCFP cores from our SpecCPU 2000 benchmarks simulation traces.  All values in TABLE III are normalized relative to the total energy of the baseline core including L1 data and instruction caches. TABLE III. ENERGY CONSUMPTION RELATIVE TO BASELINE CORE

NCKU SoC & ASIC Lab 27 Hsu, Zi Jei SoC CAD CONCLUSION  With S-CFP, we have succeeded in downsizing the L1 data cache and instruction buffers in out-of-order superscalar core to reduce power, while getting better single-thread performance.  We have achieved this by combining SMT hardware mechanisms with an innovative core design that is simpler, with less overhead than previous out-of-order CFP cores.  This S-CFP core provides latency tolerance, not only to cache misses that go to DRAM, but also to L1 data cache misses that hit the on-chip cache.  S-CFP is a promising, high performance, energy efficient design option for future multicore processors.