Presentation is loading. Please wait.

Presentation is loading. Please wait.

Reducing Issue Logic Complexity in Superscalar Microprocessors Survey Project CprE 585 – Advanced Computer Architecture David Lastine Ganesh Subramanian.

Similar presentations


Presentation on theme: "Reducing Issue Logic Complexity in Superscalar Microprocessors Survey Project CprE 585 – Advanced Computer Architecture David Lastine Ganesh Subramanian."— Presentation transcript:

1 Reducing Issue Logic Complexity in Superscalar Microprocessors Survey Project CprE 585 – Advanced Computer Architecture David Lastine Ganesh Subramanian

2 Introduction The ultimate goal of any computer architect – designing a fast machine Approaches Increasing clocking rate (Help from VLSI) Increasing bus width Increasing pipeline depth Superscalar architectures Tradeoffs between hardware complexity and clock speed Given a particular technology, the more complex the hardware, the lesser is the clocking rate

3 A New Paradigm Retaining the effective functionality of complex superscalar processors Target the bottleneck in present day microprocessors Instruction scheduling is the throughput limiter Need to effectively handle register renaming, issue window and wakeup selector Increase the clocking rate Rethinking circuit design methodologies Modifying architectural design strategies Wanting to have the cake and eat it too? Aim at reducing power consumption too

4 Approaches to Handle Issue Logic Complexity Performance = IPC * Clock Frequency Pipelining scheduling logic reduces the IPC Non-pipelined scheduling logic reduces clocking rate Architectural solutions Non-pipelined scheduling with dependence queue based issue logic – Complexity Effective [1] Pipelined scheduling with speculative wakeup [2] Generic speed up and power conservation using tag elimination [3]

5 Baseline Superscalar Model The rename and the wake-up select stages of the generic superscalar pipeline model need to be targeted Consider VLSI effects and decide to redesign a particular design component

6 Analyzing Baseline Implementations Physical layout implementation of microprocessor circuits optimized for speed Usage of dynamic logic for bottleneck circuits Manual sizing of transistors in critical path Logic optimizations like two level decomposition Components analyzed Register rename logic Wakeup Logic / Issue window Selection logic Bypass logic

7 Register Rename Logic RAM vs. CAM Focus on RAM due to scalability Decreasing feature sizes do not correspondingly scale down wire delays, but only logic delays Delay relation with issue width is quadratic, but effectively linear Need to handle wordline and bitline delays in future

8 Wakeup Logic CAM is preferred Tag drive times are quadratic functions of window size as well as issue width Matching times are quadratic functions of issue width only All delays are effectively linear for considered design space Need to handle broadcast operation delays in future

9 Selection Logic Tree of arbiters Requests flow down while functional unit grants flow up to the issue window Necessity of a selection policy (Oldest First / Leftmost First) Delays proportional to the logarithm of the window size All delays considered are logic delays

10 Bypass Logic Number of bypass paths dependent upon pipeline depth (linear) and issue width (quadratic) Composed of operand muxes and buffer drivers Delays are quadratically proportional to length of result wires and hence issue width Insignificant compared to other delays as feature size reduces

11 Complexity Effective Microarchitecture Design Premises Retain benefits of complex issue schemes but enable faster clocking Design assumption: Should not pipeline wakeup + select, or data bypassing, as these are atomic operations (if dependent instruction should be executable in consecutive cycles)

12 Dependence Based Microarchitecture Replace Issue Window by FIFOs with each queue composed of dependent instructions Steer instructions to the appropriate FIFO in rename stage using heuristics ‘SRC_FIFO’ and ‘Reservations Tables’ to handle dependencies and wakeup IPC reduces but clocking rate increases to give a faster implementation

13 Clustering Dependence Based Microarchitectures Reducing bypass delays by reducing length of bypass paths Minimization of inter- cluster communication, extra cycle penalty otherwise Clustered Microarchitecture Types Single Window, Execution Driven Steering Two Windows, Dispatch Driven Steering - Best Two Windows, Random Steering

14 Pipelining Dynamic Instruction Scheduling Logic Wakeup+Select was held atomic in previous implementation Increase performance by pipelining it, but retain execution of dependent instruction in consecutive cycles Speculate on the wakeup by predicting based on both parent and grandparent instructions Integrated into the Tomasulo approach

15 Wakeup Logic Details Tag broadcast as soon as instruction begins execution Broadcast – Execution Completion latency specified as shown Match bit acts as the sticky bit to enable delay countdown Need not always be correct due to unexpected stalls Select logic remains as in previous work

16 Pipelining Rename Logic Assumption by child instruction that parent would broadcast its tag in the next cycle, IF grandparent instructions broadcasts tag Speculative wakeup on grandparent tag receiving for selection in the next cycle Speculative since parent selection for execution is not guaranteed Modifications in rename map and dependency analysis logic

17 Wakeup and Select Logic Wakeup request sent after looking into ready bits from the parents’ and grandparents’ tags A multi-cycle parent’s field can be ignored In addition to speculative readiness signified by request line, a confirm line is activated when all parents are ready False selection involve non-confirmed requests Problematic only when really ready instructions are not selected

18 Implementation & Experimentation Details Usage of a cycle accurate execution driven simulator for the Alpha ISA Baseline conventional scheduled (2) pipeline Budget / Deluxe – speculatively woken up scheduling Ideal – 1 cycle scheduling pipeline Factors like issue width and reservation station depth considered Significant reduction in critical path with minor IPC impacts Enables higher clock frequencies, deeper pipelines and larger instruction windows for better performance

19 Paradigm shift So far we’ve added hardware to improve performance However issue window could also be improved by removing hardware

20 Current Situation of Issue Windows Content Addressable Memory (CAM) latency dominates instruction window latency. Load Capacitance of CAM is a major limiting factor for speed. Parasitic Capacitance also waste power. Issue logic uses a lot of the power budget 16% for the Pentium Pro 18% for Alpha 21264

21 Unnecessary Circuity Observation: Register stations compare broadcast tags to both operands. Often, this is unnecessary. Only 25% to 35% of architectural instructions have two operands. Simulation of speck2k programs shows only 10% to 20% of instructions need two comparators during runtime.

22 Simulation Used SimpleScalar Varied instruction window size 16, 64, 256. Load/Store queue of half window size.

23 Removing extra comparators Specialize the reservation stations. Number of comparators varies by station from 2 to 0. Stall if no station with minimum comparator available Remove some operands by speculating on last operand to complete. Needs predictor Miss-predict penalty

24 Predictor Paper discuses GSHARE predictor Its based off branch predictor not seen in class. Idea behind it starts by noting good indexes for selecting binary predictors are Branch address Global history Thus if both are good, XORing them together should produce an index embodying more information than ether alone.

25 Predictor II Here is how GSHARE does for various sizes of the prediction table.

26 Mis-pridiction Alpha has scoreboard of valid registers called RDY. Check if all operands available in register read stage, if not flush pipeline in the same fashion as latency miss-prediction. RDY must be expanded to have the number of read ports match the issue width.

27 IPC losses Reservation stations with two ports can be exhausted. Causes stalls for speck2k benchmarks like SWIM Adding last tag prediction improves SWIM performance but causes 1-3% losses for benchmarks such as Crafly and Gcc due to misprediction

28 Simulation Format show is for number of two tag/one tag/ zero tag Last tag predictor used only on entries with no two tag reservation stations.

29 Benefits of comparator removal In most cases clock rate can be 25-45% faster since Tag bus no longer must reach all reservation stations Removing comparators removes load capacitance Energy saved from capacitance removal is 30- 60% Power savings don’t track energy saves this clock rate can now increase.

30 Simulation results for benefits

31 References 1. Complexity-effective superscalar processors 1. Subbarao Palacharla and Norman P. Jouppi and J. E. Smith 2. On pipelining dynamic instruction scheduling logic 1. J. Stark, M. D. Brown, and Yale N. Patt 3. Efficient Dynamic Scheduling Through Tag Elimination 1. Dan Ernst and Todd Austin 4. Combining Branch Predictors 1. Scott McFarling

32 Questions?


Download ppt "Reducing Issue Logic Complexity in Superscalar Microprocessors Survey Project CprE 585 – Advanced Computer Architecture David Lastine Ganesh Subramanian."

Similar presentations


Ads by Google