Unified Adaptivity Optimization of Clock and Logic Signals Shiyan Hu and Jiang Hu Dept of Electrical and Computer Engineering Texas A&M University
2 Outline of Post-Silicon Tuning Introduction and Motivation Introduction and Motivation Problem Formulation Problem Formulation Algorithms Algorithms Experimental Results Experimental Results Conclusion Conclusion
3 Pre-Silicon Optimization Pre-silicon (i.e., design-time) statistical optimization Pre-silicon (i.e., design-time) statistical optimization –Determine the circuit parameters in design time –Apply the resulting design to all dies –Problems Hard to get accurate statistical variation model Hard to get accurate statistical variation model Each die has own specific parameter deviations, so the solution is not necessarily ideal for each die Each die has own specific parameter deviations, so the solution is not necessarily ideal for each die Large computation overhead Large computation overhead 50ps Deterministic DesignStatistical Design
4 Post-Silicon Tuning After fabrication, tune e.g., V dd, body voltage of gates. After fabrication, tune e.g., V dd, body voltage of gates. –Post-silicon tuning handles each die separately, compensate specific parameter deviations for each die. In design time In design time –What are the tuning ranges of gates? Tunability/overhead tradeoff Tunability/overhead tradeoff
5 Previous Works Logic signal tuning: body voltage tuning Logic signal tuning: body voltage tuning –Good tunability –Large overhead: DA converter and many control signals, applied to a circuit block Clock signal tuning: tunable clock buffer Clock signal tuning: tunable clock buffer –Small tunability –Small overhead: padding different loads to buffers
6 Logic Tuning and Clock Tuning FF Tune the body voltage Padding different load to clock buffers Clock
7 Example For Unified Adaptivity Optimization Target clock period = 10, yield target: 99% Target clock period = 10, yield target: 99% It is a zero-skew design with nominal delay shown It is a zero-skew design with nominal delay shown Each combinational path has 10% variation Each combinational path has 10% variation 10 FF 99 10
8 Worst Delay Due To Variations 11 FF % variations on each combinational logic Target clock period: 10
9 Logic Tuning 11 FF % variations on each combinational logic Target clock period: 10 Tuning body voltage of combinational logic blocks
10 Clock Tuning 11 FF % variations on each combinational logic Target clock period: 10 Skewing cannot make them simultaneously satisfy timing constraint 11 - skew at right buffer + skew at left buffer 11 + skew at right buffer - skew at left buffer
11 Unified Optimization - tuning logic and clock signal simultaneously FF Skew = 1 10% variations on each combinational logic Target clock period: 10 Logic tuning
12 Observation Logic tuning only Logic tuning only –waste area Clock tuning only Clock tuning only –may not satisfy the yield target A unified approach can satisfy yield target with small overhead A unified approach can satisfy yield target with small overhead
13 Limitations of Previous Work Mostly restricted to continuous adaptivity optimization even when they only perform logic or clock signal tuning Mostly restricted to continuous adaptivity optimization even when they only perform logic or clock signal tuning –In practice, options are often discrete Assumption on variation distribution Assumption on variation distribution –Limited to Gaussian distribution, not always true in reality –If no such assumption, then depends on computationally expensive Monte Carlo simulation We seek to overcome the above limitations We seek to overcome the above limitations
14 Problem Given a sequential circuit, perform optimizations Given a sequential circuit, perform optimizations –the yield target can be achieved by post-silicon tuning on logic and clock signals –the overhead is minimized
15 Continuous Problem FF Continuous body voltage Continuous loads
16 Continuous Problem Formulation Minimize Overhead Minimize Overhead Subject to: Subject to: Long path constraint Long path constraint Short path constraint Short path constraint Tuning bound at each tunable element Tuning bound at each tunable element FF T 12 S1S1 S2S
17 Robust Linear Programming Linear programming with random variables Linear programming with random variables Worst-case solution Worst-case solution –All S and T can simultaneously be the worst-case values. Robust solution Robust solution –Specify p ≤ total number of random variables –In the solution, at most random variables can be simultaneously the worst-case –Variations of the other random variables rely on p. –Degree of conservatism is controlled by a single parameter. Constraint violation probability (related to yield) is exponentially decreased with increase of p. Constraint violation probability (related to yield) is exponentially decreased with increase of p.
18 Linear Programming With Uncertainty Some coefficients are random variables Assume that we have j random variables
19 Soyster’s Worst Case Solution (I) a 11 is a random variable Deterministic constraint Guarantees the worst- case values
20 Soyster’s Worst Case Solution
21 Robust Solution (I)
22 Robust Solution (II) Additional variables.
23 Nominal-Case Design (P=0) q ij =0 Free to set Z i
24 Worst-Case Design (P=j)
25 Worst-Case Design (P=j)
26 Worst-Case Design (P=j)
27 Discretization In reality, tuning is allowed for some steps. In reality, tuning is allowed for some steps. Rounding from continuous solution Rounding from continuous solution –Rounding up continuous solution Increase tuning range more overhead Increase tuning range more overhead –Rounding down continuous solution Reduce tuning range not satisfying yield target Reduce tuning range not satisfying yield target –Nearest rounding not satisfy yield target not satisfy yield target waste area waste area
28 Our Approach Continuous solution Clock rounding Logic rounding Rounding by dynamic programming w/ fast pruning A set of solutions w/ discrete clock buffers For each solution, discretize body voltage for logic gates
29 Clock Rounding Larger tuning range Smaller tuning range
30 Solution Characterization and Solution Update Each candidate solution is associated with Each candidate solution is associated with –C: cumulative area overhead –Y: yield estimation Tunable clock buffer b is being processed, Tunable clock buffer b is being processed, –C is updated by the overhead of b –Y is computed by fast yield estimation
31 Fast Pruning For rounding up, no need to estimate the yield. For rounding up, no need to estimate the yield. For rounding down, sort solutions by C and perform yield estimation in a binary search fashion. For rounding down, sort solutions by C and perform yield estimation in a binary search fashion. When the solution set size reaches a threshold, pick top few solutions with smallest C. When the solution set size reaches a threshold, pick top few solutions with smallest C.
32 Logic Rounding Reducibility based discretization Reducibility based discretization –Body voltage tuning range of a block is rounded up Timing critical Timing critical Few gates are tunable Few gates are tunable Reducibility cost: total slack x number of gates Reducibility cost: total slack x number of gates
33 Batch Optimization Round up blocks with reducibiity cost < threshold and round down others If yield not satisfied, increase the threshold Start from small reducibility threshold Yield estimation is expensive
34 Monte Carlo Simulation (Yield Estimation)
35 Latin Hypercube Sampling Based Monte Carlo Simulation
36 Experimental Setup ISCAS’89 benchmark circuits ISCAS’89 benchmark circuits Pentium IV machine with 3.0G CPU and 2G memory Pentium IV machine with 3.0G CPU and 2G memory 130nm technology 130nm technology Timing yield target 99% Timing yield target 99% For continuous solution, compare to Logic optimization only and Clock optimization only For continuous solution, compare to Logic optimization only and Clock optimization only For discretization, compare to simple batch and nearest rounding approach For discretization, compare to simple batch and nearest rounding approach
37 Continuous Solution (Area) In many cases, optimizing clock signal alone cannot find feasible solutions satisfying yield constraint
38 Continuous Solution (Yield)
39 Continuous Solution (CPU in seconds)
40 Observations in Continuous Solution Unified optimization often saves >20% area over Logic optimization while having larger yield Unified optimization often saves >20% area over Logic optimization while having larger yield Clock optimization only cannot satisfy yield target for many circuits Clock optimization only cannot satisfy yield target for many circuits The algorithms run fast The algorithms run fast
41 Discretization (Area)
42 Discretization (Yield)
43 Discretization (CPU in seconds)
44 Observations in Discrete Solutions Nearest rounding cannot satisfy yield target (could be <90%). Nearest rounding cannot satisfy yield target (could be <90%). Simple batch is slow and solution quality is not good due to not being guided by continuous solution. Simple batch is slow and solution quality is not good due to not being guided by continuous solution. Our algorithm runs faster than Simple batch and saves >30% area. Our algorithm runs faster than Simple batch and saves >30% area.
45 Conclusion Unified adaptivity optimization on logical signal and clock signals shows advantage on cost-effectiveness Unified adaptivity optimization on logical signal and clock signals shows advantage on cost-effectiveness Provide both continuous and discrete solutions Provide both continuous and discrete solutions Use robust linear programming which does not depend on variation distribution Use robust linear programming which does not depend on variation distribution Computation acceleration techniques, e.g., accelerated dynamic programming, batch-based optimization, Latin Hypercube sampling based fast simulation, are used Computation acceleration techniques, e.g., accelerated dynamic programming, batch-based optimization, Latin Hypercube sampling based fast simulation, are used Our algorithm can be used for optimizing logic or clock signal separately while still having the above advantages Our algorithm can be used for optimizing logic or clock signal separately while still having the above advantages