Ryota Shioya, Masahiro Goshimay and Hideki Ando Micro 47 Presented by Kihyuk Sung
For Single Thread Performance in Mobile – Out-of-Order Superscalar processors Consume much more energy than In-Order Processor – Dynamic Instruction Scheduling Issue Queue Reorder Buffer Load/Store Queue Propose Front-end Execution Architecture (FXA). – In-Order Execution Unit (IXU) – Out-of-Order Execution Unit (OXU) – The IXU and the OXU are placed in series. 2
3
4
The IXU functions as a filter for the OXU. In-Order Execution Unit (IXU) – Check whether instructions are ready. Read From the Physical Register File (PRF). Bypassed from the Functional Units(FU) in the IXU. – Depending on Whether an Instruction is ready, the instruction is processed as follows A ready instruction is executed and is not dispatched to the IQ (Issue Queue). A not-ready instruction goes through the IXU as a NOP. The instruction is dispatched to the IQ. (No Stall) – The instruction is committed as in conventional superscalar processor. (Reorder Buffer) Out-of-Order Execution Unit (OXU) – Same way as it is executed in conventional superscalar processor. 5
6
7
8
9
10
11
IXU Cannot Execute I3 – Because of a long and consecutive chain of dependent instructions. Generally, dependent instructions are rarely placed in a long and consecutive chain. -> IXU can execute many instructions. 12
Branch – The IXU can execute branch instructions with handling misprediction. Floating Point – The IXU cannot execute FP operations. – Long latency -> the pipeline length is prolonged. Load/Store – Use Load Store Queue (LSQ) 13
Bypassing between IXU and OXU – IXU -> OXU is not necessary. Order – OXU -> IXU is omitted. Performance degradation is not significant 14
Optimization of IXU – The latency of bypass network is increased because of FUs. Decrease the number of FUs in backward stages. [3, 1, 1] – Partially omit operand-bypassing in IXU. Bypassing between FUs that are more distant than two stages 15 FU
Instructions Executed in IXU – Instructions that are already ready when they are entered to the IXU Very small (5.5%) – Instructions that become newly ready in the IXU – 35% (1 Stage) to 54% (3 Stage, FU[3, 1, 1]) Performance Improvement – Effects of FUs in IXU 4 stage (Conventional Superscalar Processor) to 7 stage (FXA) FU : 4(4 issue OoO Superscalar) to 7 (5 in IXU, 2 in OXU) – Variable Branch Misprediction Penalty IXU / OXU 16
The number of FUs is increased. – IXU and OXU – Static energy consumption : increased. – Dynamic energy consumption : increased or equal. PRF – IXU/OXU access PRF simultaneously. The number of Issue Queue Access is decreased. – Because of IXU – Reduce 86% of energy consumption. 17
Evaluate IPCs using an in-house cycle-accurate processor simulator. Run SPEC CPU – Compiled using gcc with –O3 evaluated energy consumption and chip areas using the McPAT simulator (Parameter : Table 2) 18
BIG – Out-of-Order superscalar (ARM Cortex-A57 big Core) – baseline HALF – Issue width and IQ capacity are half those in BIG LITTLE – In-Order processor (ARM Cortex-A53 LITTE Core) HALF+FX – HALF with IXU (3 Stage, FU [3, 1, 1]) BIG+FX – BIG with IXU (3 Stage, FU [3, 1, 1]) 19
20
21 Maximum : 67%, geometric mean : 5.7%
22 Geometric mean : 7.4%
23 Geometric mean : 4.5%
24
25
26
27
Proposed FXA, which has two execution units, the IXU and OXU. 5.7% higher performance 17% lower energy consumption 25% higher performance/energy ratio 28
29