Download presentation
Presentation is loading. Please wait.
Published byRafe Marsh Modified over 9 years ago
1
Gennette Gill Montek Singh Bottleneck Analysis and Alleviation in Pipelined Systems: A Fast Hierarchical Approach Univ. of North Carolina Chapel Hill, NC, USA
2
Part of a Larger Design Flow 2 Big Picture: High-level specifications asynchronous implementations Design space exploration (this work) is part of overall flow High-level Specification Implementation This work: Use various optimizations together in one tool Exploits circuit hierarchy to accelerate analysis/optimzn
3
Our Contribution Identify bottlenecks in a pipelined system Recognize multiple components that limit throughput Bottlenecks represented in a Boolean expression Classify bottlenecks Latency, cycle time, and occupancy dependent Choose which transformation(s) apply Given a list of possible transforms List is open ended; allows for additions 3
4
4 Background Pipelines and Canopy Graphs
5
Background: Asynchronous Pipelines 5 Each stage characterized by three delays: Forward latency, L f time for data to propagate forward Reverse latency, L r time for a stage to receive and process ack time for a ‘hole’ to travel backward Cycle time, T = L f + L r (typically) Throughput, tpt = 1 / cycle time An abstracted view of the pipeline Lf /LrLf /Lr Lf /LrLf /Lr reqcontroller LLL controllercontroller logiclogic Cycle time in an asynchronous pipeline ack
6
Background: Pipeline Rings 6 Throughput of ring depends on occupancy (#items) For small #items: underutilization limits throughput For small #holes: congestion limits throughput Throughput also limited by the slowest stage Graph is a convex shape: “Canopy Graph” 12 N-2N-10 Ring Occupancy Ring Throughput N datalimiteddatalimited holelimitedholelimited limited by slowest stage limited by slowest stage
7
Background: Composition 7 A B A B Combined Pipeline Throughput Pipeline Occupancy A B Combined Pipeline Throughput Pipeline Occupancy A B Sequential Composition [Lines98] Parallel Composition [Lines98]
8
Conditionals 8 Conditional branches: Implement if-then-else non-speculatively Split sends data along only one path Boolean decision determines path Merge also uses Boolean; maintains order of data Performance depends on: Canopy graphs of then and else branches Boolean probability of choosing each branch … … then else splitmergefork data in data out boolean … …
9
Conditionals 9 Simplifying assumption (relaxed later) Boolean choices evenly distributed given a probability p0 = 2/3 → 001001001… Constraints on joint operation: Each branch’s occupancy (k) ∝ its probability: Why? Because items must exit in order Each branch’s throughput ∝ its probability: Throughput of composition: Scale each branch’s canopy graph by 1/p i Intersect the scaled canopy graphs … … then else splitmergefork data in data out boolean … …
10
Conditionals: A Simple Example 10 Example: pipelined implementation of CRC algorithm occupancy throughput probability throughput … … 1/15/1 1/1 3/1 2/11/1 splitmerge 1/1 10 stages 3/1 9 stages branch 0 branch 1 branch 0 branch 1
11
Conditionals: A Simple Example 11 Example: pipelined implementation of CRC algorithm occupancy throughput probability throughput branch 0 branch 1 … … 1/15/1 1/1 3/1 2/11/1 splitmerge 1/1 10 stages 3/1 9 stages branch 0 branch 1
12
Conditionals: A Simple Example 12 Example: pipelined implementation of CRC algorithm occupancy throughput probability throughput branch 0 branch 1 … … 1/15/1 1/1 3/1 2/11/1 splitmerge 1/1 10 stages 3/1 9 stages branch 0 branch 1
13
Conditionals: A Simple Example 13 Example: pipelined implementation of CRC algorithm occupancy throughput probability throughput branch 0 branch 1 … … 1/15/1 1/1 3/1 2/11/1 splitmerge 1/1 10 stages 3/1 9 stages branch 0 branch 1
14
Conditionals: A Simple Example 14 Example: pipelined implementation of CRC algorithm occupancy throughput probability throughput branch 0 branch 1 … … 1/15/1 1/1 3/1 2/11/1 splitmerge 1/1 10 stages 3/1 9 stages branch 0 branch 1
15
Conditionals: A Simple Example 15 Example: pipelined implementation of CRC algorithm occupancy throughput probability throughput branch 0 branch 1 … … 1/15/1 1/1 3/1 2/11/1 splitmerge 1/1 10 stages 3/1 9 stages branch 0 branch 1
16
Conditionals: A Simple Example 16 Example: pipelined implementation of CRC algorithm occupancy throughput probability throughput branch 0 branch 1 … … 1/15/1 1/1 3/1 2/11/1 splitmerge 1/1 10 stages 3/1 9 stages branch 0 branch 1
17
Conditionals: A Simple Example 17 Example: pipelined implementation of CRC algorithm occupancy throughput probability throughput branch 0 branch 1 … … 1/15/1 1/1 3/1 2/11/1 splitmerge 1/1 10 stages 3/1 9 stages branch 0 branch 1
18
occupancy throughput probability throughput branch 0 branch 1 Conditionals: A Simple Example 18 Example: pipelined implementation of CRC algorithm … … 1/15/1 1/1 3/1 2/11/1 splitmerge 1/1 10 stages 3/1 9 stages branch 0 branch 1
19
Conditionals: A Simple Example 19 Example: pipelined implementation of CRC algorithm occupancy throughput probability throughput branch 0 branch 1 … … 1/15/1 1/1 3/1 2/11/1 splitmerge 1/1 10 stages 3/1 9 stages branch 0 branch 1
20
Conditionals: A Simple Example 20 Example: pipelined implementation of CRC algorithm branch 0 branch 1 occupancy throughput probability throughput … … 1/15/1 1/1 3/1 2/11/1 splitmerge 1/1 10 stages 3/1 9 stages branch 0 branch 1
21
Conditionals: Example with Slack Mismatch 21 Slack mismatch implicitly handled by analysis method occupancy throughput occupancy branch 0 branch 1 branch 0 branch 1 slack matched … 1/15/1 1/1 3/1 2/1 splitmerge … 10 stages 3/1 9 stages branch 0 branch 1 slack mismatch
22
Conditionals: Generalized Choice Model 22 Extend to more general choice model: Until now: assumed non-clustered decisions Now: consider clustering Allow arbitrary runs of 0’s and 1’s for decisions Long runs reduce throughput: other branch is underutilized Our Analysis Approach: Introduces a “clustering factor” to quantify decision run lengths e.g., for random uncorrelated data: ave. run length of 0’s is 1/p1 Analysis approach can handle arbitrary amounts of clustering probability of choosing branch 1 throughput occupancy non-clustered random acts as bottleneck
23
s2 i < N s3 s4 s5 i++ s6 s7 s1 s8 interface Pipelined Loops 23 Analysis approach can handle single-token and multi- token loops Loop’s throughput depends on #iterations per item Assume given: #iterations/item or prob. of exiting the ring Note: Previous analysis looked at a different throughput #iterations/second, not #completions/second
24
Pipelined Loops 24 Analysis approach for loops: Construct canopy graph for loop body Scale down based on expected number of iterations 5/1 forkjoin branch 0 branch 1 Boolean fork Loop interface occupancy throughput Loop body loop body overall loop
25
25 Bottleneck Identification
26
System Representations Block-level system representation 26 n0n0 n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 par leaf seq leaf par n2n2 n0n0 n1n1 n3n3 n4n4 n5n5 n6n6 top C forward C reverse 1,C reverse 0,C Occupancy Throughput Tree representing circuit hierarchy Canopy graph
27
Our Definition of Bottleneck Set of hierarchical nodes that limit canopy graph Expressed as a Boolean combination e.g. n0 OR n2 OR n3 OR n5 AND n6 27 n0n0 n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 par leaf seq leaf par n2n2 n0n0 n1n1 n3n3 n4n4 n5n5 n6n6 Occupancy Throughput What caused this segment? Usually more than one node Often several conspire together
28
Find Limiting Segments 28 par n2n2 n0n0 n1n1 What sets this limit? Occupancy Throughput Begin with top segment of root node Find which child/children contribute to segment If more than one, is it AND or OR blame? Continue to lower levels of hierarchy
29
Find Limiting Segments: example 1 29 par n2n2 n0n0 n1n1 Occupancy Throughput n1n1 n2n2 Scenario: parallel operator limited by slow child n1 and n2 contribute to bottleneck bottleneck (top n0 ) = top n0 OR bottleneck(top n1 ) AND bottleneck(top n2 ) next, find which children of n1 and n2 limit throughput What sets this limit?
30
Find Limiting Segments: example 1 30 par n2n2 n0n0 n1n1 Occupancy Throughput n1n1 n2n2 Scenario: parallel operator limited by slow child n2 causes to bottleneck bottleneck (top n0 ) = top n0 OR bottleneck(top n2 ) next, find which children of n1 and n2 limit throughput
31
Find Limiting Segments: example 2 31 par n4n4 n2n2 n3n3 Occupancy Throughput Scenario: parallel operator limited by slack mismatch changing n3 or n4 could fix bottleneck bottleneck(top n2 ) = top n2 OR bottleneck(reverse n3 ) OR bottleneck(forward n4 ) To fix, change n2, n3, or n4 n3n3 n4n4 What sets this limit?
32
Find Limiting Segments: example 3 32 seq n6n6 n4n4 n5n5 Occupancy Throughput n5n5 n6n6 Scenario: sequential operator limited by forward latency changing n5 or n6 could fix the bottleneck bottleneck(forward n4 ) = forward n4 OR bottleneck(forward n5 ) OR bottleneck(forward n6 ) If just one child is slow, only one contributes n4n4 What sets this limit?
33
Bottleneck Alleviation 33
34
Bottleneck Categorization 34 top C forward C reverse 1,C reverse 0,C Occupancy Throughput Type I: Latency Dependent Type II: Cycle Time Dependent Type III: Occupancy Dependent Categories based on which c.g. segment limits tpt
35
TRansformations for Increasing the Canopy Graph A TRIC increases throughput for some occupancies Idea: collect a bag of TRICs Categorize circuit optimizations by bottleneck type Use different optimizations in one framework Effects of few example TRICs: Suggestions for addl. TRICS needed. 35 Occupancy Throughput Occupancy Throughput Occupancy Throughput Parallelization Buffer InsertionStage Splitting Fixes: Type I Fixes: Type III Causes: Type I Fixes: Type II Causes: Types I,III
36
Applying TRICs Tool lists TRICs that alleviate current bottleneck Designer chooses one option Check for next bottleneck as needed 36 TRICType IType IIType III Coalescing ✔ XX Parallelization ✔ -X Stage SplittingX ✔✔ Loop PipeliningX ✔✔ Duplication- ✔✔ Loop Unrolling- ✔✔ Buffer InsertionX- ✔
37
Results Successful with 20% throughput goals on examples Suggest examples, please. 37 Example Throughput Type origgoalfinal # iter III III TRICS CRC 286342345 4 103 coalesce; add bufffers Cordic cond 90.9109111 2 002 add buffers Cordic 83.3100101 2 012 split stages Diffeq 182218267 1 300 split stages; duplicate Mult 38.446.262.5 6 501 coalesce; add buffers
38
Conclusion & Future Work This Work: Employed multiple microarch. optimizations in one tool User-guided application to a few examples More is needed: Clever ways to automate Additions to the bag of TRICs More examples and applictions 38
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.