Download presentation
Presentation is loading. Please wait.
Published byEthen Heckstall Modified over 9 years ago
1
Decoupled Pipelines: Rationale, Analysis, and Evaluation Frederick A. Koopmans, Sanjay J. Patel Department of Computer Engineering University of Illinois at Urbana-Champaign
2
2 Outline Introduction & Motivation Background DSEP Design Average Case Optimizations Experimental Results
3
3 Motivation Why Asynchronous? No clock skew No clock distribution circuitry Lower power (potentially) Increased modularity But what about performance? What is the architectural benefit of removing the clock? Decoupled Pipelines!
4
4 Motivation Advantages of a Decoupled Pipeline Pipeline achieves average-case performance Rarely taken critical paths no longer affect performance New potential for average-case optimizations
5
5 Synchronous vs. Decoupled goack data goack data Stage1Stage2Stage3 Control3Control2Control1 Decoupled Stage1Stage2Stage3 Synchronous data clock Asynchronous Communication Protocol Self-Timing Logic Elastic Buffer Synchronous Latch Synchronizing mechanism
6
6 Outline Introduction & Motivation Background DSEP Design Average Case Optimizations Experimental Results
7
7 Self-Timed Logic Bounded Delay Model Definition: event = signal transition start event provided when inputs are available done event produced when outputs are stable Fixed delay based on critical path analysis Computational circuit is unchanged Computational Circuit Self-Timing Circuit Input StartDone Output
8
8 Asynchronous Logic Gates C-gate logical AND Waits for events to arrive on both inputs XOR-gate logical OR Waits for an event to arrive on either input SEL-gate logical DEMUX Routes input event to one of the outputs SELSEL 1 XORXOR C 0
9
9 Asynchronous Communication Protocol 2-Step, Event Triggered, Level Insensitive Protocol Transactions are encoded in go / ack events Asynchronously passes instructions between stages go ack data_1data_2 Transaction 2Transaction 1 go ack data Sender Stage Receiver Stage 0 1 0 1
10
10 Outline Introduction & Motivation Background DSEP Design Average Case Optimizations Experimental Results
11
11 DSEP Microarchitecture At a high-level: 9 stage dynamic pipeline Multiple instruction issue Multiple functional units Out-of-order execution Looks like Intel P6 µarch What’s the difference? Decoupled, Self-Timed, Elastic Pipeline RetireRetire Results Retire Write back Commit Flush From I–Cache Fetch Decode Rename Read/Reorder Issue Execute Data Read
12
12 DSEP Microarchitecture Decoupled: Each stage controls its own latency Based on local critical path Stage balancing not important Each stage can have several different latencies Selection based on inputs Decoupled, Self-Timed, Elastic Pipeline RetireRetire Results Retire Write back Commit Flush From I–Cache Fetch Decode Rename Read/Reorder Issue Execute Data Read Pipeline is operating at several different speeds simultaneously!
13
13 Pipeline Elasticity Definition: Pipeline’s ability to stretch with the latency of its instruction stream Global Elasticity Provided by reservation stations and reorder buffer Same for synchronous and asynchronous pipelines FetchExecuteRetire When Execute stalls, the buffers allow Fetch and Retire to keep operating
14
14 Pipeline Elasticity Local Elasticity Needed for a completely decoupled pipeline Provided by micropipelines Variable length queues between stages Efficient implementation, little overhead Behave like shock absorbers RetireRetire Results Retire Write back Commit Flush From I–Cache Fetch Decode Rename Read/Reorder Issue Execute Data Read
15
15 Outline Introduction & Motivation Background DSEP Design Average Case Optimizations Experimental Results
16
16 Analysis Synchronous Processor Each stage runs at the speed of the worst-case stage running its worst-case operation Designer: Focus on critical paths, stage balancing DSEP Each stage runs at the speed of its own average operation Designer: Optimize for most common operation Fundamental advantage of Decoupled Pipeline
17
17 Average-Case Optimizations Designer’s Strategy: Implement fine grain latency tuning Avoid latency of untaken paths Consider a generic example: If short op is much more common, throughput is proportional to the select logic Generic Stage Outputs Inputs MUXMUX Long operation Short operation Select logic
18
18 Average-Case ALU Tune ALU latency to closely match the input operation ALU performance is proportional to the average op Computational Circuit is unchanged ALU Self-Timing Circuit Arithmetic Logic Shift Compare Inputs ALU Computational Circuit SELSEL XORXOR Start Output Done
19
19 Average-Case Decoder Tune Decoder latency to match the input instruction Common instructions often have simple encodings Prioritize most frequent instructions Decoder Self-Timing Circuit Format 1 Format 3 Format 2 Input Decoder Computational Circuit SELSEL XORXOR Start Output Done
20
20 Average-Case Fetch Alignment Optimize for aligned fetch blocks If the fetch block is aligned on a cache line, it can skip alignment and masking overhead Optimized Fetch Alignment Fetch Block MUXMUX FetchAlign/Mask Aligned? Address Inst. Block Optimization is effective when software/hardware alignment optimizations are effective
21
21 Average-Case Cache Access Optimize for consecutive reads to the same cache line Allows subsequent references to skip cache access Optimized Cache Access Cache Line MUXMUX Read line from cache To Same Line? Address Previous line Effective for small stride access patterns, tight loops in I-Cache Very little overhead for non-consecutive references
22
22 Average-Case Comparator Optimize for the case that a difference exists in the lower 4 bits of the inputs 4-bit comparison is > 50% faster than 32-bit Optimized Comparator MUXMUX 32-bit Compare 4-bit Compare ? ? Inputs Output Very effective for iterative loops Can be extended for tag comparisons
23
23 Outline Introduction & Motivation Background DSEP Design Average Case Optimizations Experimental Evaluation
24
24 Simulation Environment VHDL Simulator using Renoir Design Suite MIPS I Instruction set Fetch and Retire Bandwidth = 1 Execute Bandwidth ≤ 4 4-entry split Instruction Window 64-entry Reorder Buffer Benchmarks BS 50-element bubble sort MM 10x10 integer matrix multiply
25
25 Two Pipeline Configurations OperationDSEP LatenciesFixed Latencies Fetch100120 Decode50/80/120120 Rename80/120/150120 Read120 Execute20/40/80/100/ 130/150/360/600 120/360/600 Retire5/100/150120 Caches100120 Main Memory960 Micropipeline Register55 “Synchronous” Clock Period = 120 time units
26
26 DSEP Performance Compared Fixed and DSEP configurations DSEP increased performance 28% and 21% for BS and MM respectively Execution Time
27
27 Micropipeline Performance Goals: Determine the need for local elasticity Determine appropriate lengths of the queues Method: Evaluate DSEP configurations of form AxBxC A Micropipelines in Decode, Rename and Retire B Micropipelines in Read C Micropipelines in Execute All configurations include fixed length instruction window and reorder buffer
28
28 Measured percent speedup over 1x1x1 2x2x1 best for both benchmarks 2.4% performance improvement for BS, 1.7% for MM Stalls in Fetch reduced by 60% for 2x2x1 Micropipeline Performance Bubble-Sort Matrix-Multiply Percent Speedup
29
29 OOO Engine Utilization Measured OOO Engine utilization Instruction Window (IW) and Reorder Buffer (RB) Utilization = Avg # of instructions in the buffer IW-Utilization up 75%, RB-Utilization up 40% Instruction Window Reorder Buffer Utilization
30
30 Total Performance Compared Fixed and DSEP configurations DSEP 2x2x1 increased performance 29% and 22% for BS and MM respectively Execution Time
31
31 Conclusions Decoupled, Self-Timing Average-Case optimizations significantly increase performance Rarely taken critical paths no longer matter Elasticity Removes pipeline jitter from decoupled operation Increases utilization of existing resources Not as important as Average-Case Optimizations (At least for our experiments)
32
32 Questions?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.