Presentation is loading. Please wait.

Presentation is loading. Please wait.

Alexander Smirnov Alexander Taubin.  Determine ◦ max throughput ◦ causes of throughput limit ◦ max achievable throughput ◦ cost of achieving a given.

Similar presentations

Presentation on theme: "Alexander Smirnov Alexander Taubin.  Determine ◦ max throughput ◦ causes of throughput limit ◦ max achievable throughput ◦ cost of achieving a given."— Presentation transcript:

1 Alexander Smirnov Alexander Taubin

2  Determine ◦ max throughput ◦ causes of throughput limit ◦ max achievable throughput ◦ cost of achieving a given throughput level  Data independent token flow ◦ No early evaluation ◦ DEMUXes send data all ways  Cells across library/design implement the same handshaking protocol

3  Previous work  Cell characterization  Protocol characterization  Throughput of asynchronous pipelines (reminder)  Throughput analysis  Throughput optimization

4  Early works on the throughput of async. pipelines: M. Greenstreet, K. Steighlitz; T. Willams; A. Lines  Time separation of events (TSE) based approaches to throughput analysis: T. Amon, H. Hulgaard, S. Burns, G. Boriello; S. Chakraborty, D. Dill; P. McGee, S. Nowick;  Simulation based approaches: C. Brej; K. Fazel  Slack matching (throughput optimization) approaches: P. Prakash, A. Martin; P. Beerel, M. Davies, A. Lines, N. Kim;

5 Cell characterization example (in Liberty) 5  Cell (in ASIC) is a physical implementation of a gate  Characterization is a way of abstracting away the details and specifying the parameters needed on the higher level of hierarchy  Cell characterization ◦ abstracts away cell implementation details ◦ specifies functionality, timing, area, power consumption, etc ◦ necessary and sufficient for efficient synthesis, optimization and simulation  De-facto standard – Synopsys “Liberty”

6  Conventional gate: ◦ Implements function of input wires ◦ Special signals  clock  set  clear  etc 6  Asynchronous stage ◦ Implements function of input channels  Special signals  request  acknowledge  data0  data1  reset

7  Reuse Synopsys Liberty whenever possible  Use attributes to specify roles of pins in handshaking, channel, etc  Specify functionality in terms of channels (abstract out control functionality)  Use Data → Data timing arcs to specify channel → channel attributes: slack, number of tokens at initialization * PCHB stage example

8  Abstract channel: forward/backward control and forward data propagation  Assumption: handshake protocol is the same across the library/design 8 L - Left/Right F - Forward/Backward C - Control/Data E - Evaluation/Reset

9  Abstract channel: forward/backward control and forward data propagation  Assumption: handshake protocol is the same across the library/design  Use cell characterization to infer handshake protocol  Abstraction and characterization allow identifying protocol loops in every stage for every pair of channels 9 L - Left/Right F - Forward/Backward C - Control/Data E - Evaluation/Reset

10  Goal: enumerate all handshake cycles ◦ handshake cycles are same across the design (assumption) ◦ for practical protocols a handshake cycle covers  3 stages ◦ enumerate all possible cycles in a full timing graph of a 4- stage FIFO, normalize cycles and remove identical 10 * PCHB stage example Complexity negligible

11  Asynchronous pipeline throughput is determined by loops ◦ Handshaking ◦ Algorithmic (rings) and congestion  Pipeline throughput is known for basic pipeline compositions  Bottleneck based – pipeline compositions are bottleneck candidates

12  T. Willams (1990), A. Lines (1995): Throughput T ◦ x – token count ◦ s – slack ◦ d – dynamic slack ◦ c – cycle time  x is invariant for a ring in a pipeline with deterministic (data independent) token flow 12 liflif cici

13  for serial composition of pipelines with throughputs T 1, T 2 the resulting throughput T resulting = min{T 1,T 2 }  T resulting is observed at d min  x  d max TjTj TiTi T2T2 T1T1 13

14  for parallel composition of pipelines with throughputs T 1, T 2 the resulting throughput  T resulting is observed at T2T2 T1T1 14

15  Peak throughput of a is limited by the slowest component  to determine the throughput of a pipeline it is sufficient to discover that slowest combination of stages - throughput bottleneck  Bottleneck candidates (BCs): ◦ Handshake (h/s) cycle ◦ Re-converging paths ◦ Algorithmic cycle (ring)  BC characterized by cycle time rang

16  Length of each h/s cycle in the protocol computed for each window of length 2  m  3 (HB stages).  Handshake cycles are known from protocol analysis  Lengths of each cycle (  i min and  i max ) are computed for each cycle “in place” and then  Heuristic: cycles involving multiple branches not considered  complexity or where v i are primary outputs of a stages environment reaction times * PCHB stages example

17  Theorem: if a BC is a bottleneck, reaction times on its borders never exceed those used to compute  It follows from the theorem that BC can be analyzed in isolation to determine  BCs are sorted with respect to  BC with the highest is a bottleneck – it defines the throughput of the design

18  Requires results of handshake cycle analysis  Identify pairs of re-converging paths, compute  Reduce the number of pairs of re-converging paths: ◦ one pair of re-converging paths identified per fork-join ◦ pipelines is assumed to have deterministic (data independent token flow)  number of initial tokens in any two re-converging paths is the same  Number of BCs can be reduced if optimization not needed

19  Heuristics for identifying rings, re-converging paths include: ◦ consider two of any set of rings with common arc(s) (longest and shortest)

20  Throughput of rings, re-converging path pairs is computed using the equations from T. Willams, A. Lines BUT ◦ If a handshake cycle covers re-converging paths (if the length of the shorter branch is 0-2 half-buffer stages) the equations from T. Willams, A. Lines do not apply  Throughput such bottleneck candidate is determined by the handshake cycles

21  Identify handshake bottlenecks (slide window)  Optimize handshake bottlenecks (if necessary)  Identify BCs due to algorithmic loops and dynamic slack imbalance ◦ CPM, modified to handle loops ◦ Trade memory for time – store arrival times, significant predecessors ◦ Eliminate unnecessary graph exploration

22  Predicted throughput variation range (% of the actual simulated throughput)  Predicted throughput variation depend on: ◦ Due to asymmetry in library cells throughput varies depending on the data (actual throughput variation) ◦ Uncertainty introduced by heuristics (currently incomplete synchronization trees introduce height uncertainty)

23  Throughput estimation is heuristic based i.e. error is possible  Shown is the % difference of the actual throughput and the predicted variation range bound weighted by actual throughput  In 92.5% of test cases measured throughput is within the predicted variation range, the maximum error observed is 27%

24  Alleviate bottlenecks with throughput less than the goal by ◦ Handshake pipelining ◦ Ring padding, slack matching  Iteratively ◦ insert stages ◦ update all BCs

25  Alleviate bottlenecks with throughput less than the goal by ◦ Handshake pipelining ◦ Ring padding, slack matching  Iteratively ◦ insert stages ◦ update all BCs

26  The approach allows automatically optimize the throughput up to the level limited by: ◦ library cells ◦ data deficient (long non-pipelined) rings  Fully optimized throughput is higher (cycle time smaller) for ◦ FIFOs ◦ circuits without synchronization trees (fan-out 1)

27  Based on Synopsys Liberty developed asynchronous cell/stage characterization used for synthesis, throughput analysis/optimization  Protocol characterization automatically inferred from cell characterization  Support for hierarchical designs (with possible loss of precision)  All bottlenecks are identified  All bottlenecks except for data deficient rings are automatically alleviated  Optimization tested with stage insertion but other optimizations can be used  Analysis results easily adjusted to reflect non- structural changes


29  Currently not considering handshake cycles involving branches  Unless merges/forks are properly characterized analysis in hierarchical designs is imprecise  Currently synchronization trees are assumed balanced, for incomplete trees one sync cell delay I added to the variation range

Download ppt "Alexander Smirnov Alexander Taubin.  Determine ◦ max throughput ◦ causes of throughput limit ◦ max achievable throughput ◦ cost of achieving a given."

Similar presentations

Ads by Google