Download presentation
Presentation is loading. Please wait.
Published byBrandon Spencer Modified over 8 years ago
1
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis
2
Alpha 21264 (EV6) pipeline
3
Instruction Fetch 4 commands per cycle Techniques for maximum fetch efficiency Large 64KB 2-way associative instruction cache Line and set prediction to indicate where to fetch the next block from including which set should be used Low mispredict cost of line and set prediction (single-cycle bubble) Branch Predictor Branch prediction scheme dynamically chooses between local and global history
4
Register Renaming Assignment of a unique storage location with each write reference to a register Elimination of WAR and WAW register dependencies, but preservation of all RAW register dependencies necessary for correct computation 64 architectural registers + 41 integer + 41 floating point registers available for holding speculative results prior to instruction retirement in an 80 instruction in- flight window
5
Out of Order Issue Queues Separate integer and floating-point queues Each cycle the queues select from pending instructions as they become data-ready, using register scoreboards based on the renamed register numbers Scoreboards maintain the status of renamed registers by tracking the progress of single-cycle, multiple-cycle and variable-cycle instructions When FU available, scoreboard unit notifies instructions in queue that require the register value These instructions can issue when bypass result is available from FU or load.
6
Out of Order Issue Queues (cont.) 20-entry integer queue can issue 4 instructions per cycle 15-entry floating-point queue can issue 2 instructions per cycle Static assignment of instructions to 2 of 4 execution pipes before entering the queue Issue queue has 2 arbiters that dynamically issue the oldest 2 instructions each cycle within the upper and lower pipes respectively Queues issue instructions speculatively Queue is collapsing (an entry becomes available) when the instruction issues or is squashed due to mis-speculation
7
Execution Engine All execution units require access to the register file 14 ports needed to support 4 simultaneous instructions in addition to 2 load operations large size of register file The 21264 splits the register file into 2 clusters that contain duplicates of the 80-entry register file. 2 pipes access a single register file to form a cluster, and 2 clusters are combined to support 4 way-integer instruction execution Incremental cost: additional cycle of latency to broadcast results from each integer cluster to the other cluster small cost Integer issue queue dynamically schedules instructions to minimize the 1 cycle cross-cluster communication cost 2 FP execution pipes access a single 72-entry register file
8
Execution Engine (cont.) New functionality not present in prior Alpha microprocessors: Fully-pipelined integer multiply unit Integer population count and leading/trailing zero count unit Floating-point square root FU Instructions to move register values directly between FP and integer registers
9
Memory System Supports in-flight memory references and out-of-order operation Receives up to 2 memory operations from the integer execution pipes every cycle Data cache operates at twice the frequency of the processor cycle 3-cycle latency for integer loads and 4 cycles for FP loads
10
Store/Load Memory Ordering Hazard detection logic to recover from mis-speculation that allows a load to incorrectly issue before an earlier store to same address After the first time of a load mis-speculation training of the out-of-order execution core to avoid it on subsequent executions of the same load. This is done by setting a bit in a load wait table that is examined at first time. If the bit is set, the 21264 forces the issue point of the load to be delayed until all prior stores have issued.
11
Load Hit/ Miss Prediction To achieve the 3-cycle integer load hit latency, it is necessary to speculatively issue consumers of integer load data before knowing if the load hit or missed in the on-chip data cache. In case of a load miss mini-restart When consumers speculatively issue 3 cycles after a load that misses, 2 integer issue cycles are squashed and all instructions that issued during these 2 cycles are pulled back into the issue queue to be re-issued later less costly method than a full pipeline restart, but still expensive for applications for many integer load misses The 21264 predicts when loads will miss and does not speculatively issue the consumers of the load in that case. Effective load latency: 5 cycles for an integer load hit that is incorrectly predicted to miss
12
Load Hit Speculation SymbolMeaning QIssue queue RRegister file read EExecute DDCache Access BData bus active Cycle Number 123456 Integer LoadQBDER Instruction 1 Instruction 2 Q R Q Hit Pipeline Timing for Integer Load
13
Load Hit Speculation (cont.) There are 2 cycles in which the issue queue may speculatively issue instructions that use load data before Dcache hit information is known Any instructions issued in these 2 cycles are kept in the issue queue until the load hit condition is known, even if they are not dependent on the load operation. If load hits instructions are removed from queue If load misses execution of these instructions is aborted and instructions are allowed to request service again In the previous example, instructions 1 and 2 are issued within the speculative window of the load instruction. If load hits, instructions will be removed from queue by the start of cycle 7 while if it misses, both instructions will be aborted from execution pipelines.
14
Load Hit Speculation (cont.) If software misses are likely, the 21264 can still benefit from scheduling the instruction stream for Dcache miss latency. Saturating 4-bit counter incremented by 1 when load hits and decremented by 2 when load misses. When the upper bit of the counter=0 integer load latency is increased to 5 cycles and speculative window is removed
15
Load Hit Speculation (cont.) SymbolMeaning QIssue queue RRegister file read EExecute DDCache Access BData bus active Cycle Number 123456 Integer LoadQBDER Instruction 1 Instruction 2 Q R Q Hit Pipeline Timing for Floating Point Load
16
Load Hit Speculation (cont.) Speculative window for FP loads= 1 cycle FQ-issued instructions within this window of an FP load that has missed are aborted only if they depend on the load being successful. In the example, only instruction 1 is issued in the speculative window. If this instruction is not a user of data returned by the load, it is removed from the queue at the normal time (cycle 7). But if it is dependent on the load instruction data and the load hits, then it is removed from the queue one cycle later. If the load misses, instruction 1 is aborted from execution pipelines and may request service again in cycle 7.
17
Conclusion 21264: fastest microprocessor available Combines high Alpha clock speeds with many advanced micro-architectural techniques, i.e. out-of-order and speculative execution with many in-flight instructions High-bandwidth memory system to quickly deliver data values to the execution core robust performance for many applications
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.