Toward Holistic Modeling, Margining and Tolerance of IC Variability

Toward Holistic Modeling, Margining and Tolerance of IC Variability
Andrew B. Kahng UCSD CSE and ECE Departments

IC Variability In manufacturing process During operation
FEOL BEOL During operation Voltage Temperature Across lifetime Aging Breakdown Here I use fixed voltage scaling for area, power, drive current. And show full scaling in Vdd plot.

Challenge: Value of Technology
Design quality (e.g., frequency) Margin  lost benefits of technology Nominal Scaling margin Lost benefits! One of the main challenge in VLSI is the value of technology scaling. As shown in the figure, the design quality improves as technology scales. However, the magnitude of manufacturing process variation is also increasing. To cover the process variation, we have to insert margins in the design to ensure they can work correct. And as a result of that, we lost the benefits of technology scaling. Design with margins Technology generation

Solutions: Modeling, Margining, Tolerance
Holistic mitigation of variability spans models, margins, tolerance mechanisms Signoff criteria, monitors, adaptivity/resilience, approximate computing, … Solutions Modeling Margining Tolerance BEOL Corner Optimization √ Process-Aware Vdd Scaling {BTI, EM}-AVS Interactions Overdrive Signoff Min Cost of Resilience

Outline Introduction Modeling of IC Variability
Tolerance of IC Variability Margining of IC Variability Conclusions

BEOL Corner Optimization
20nm and below: increased timing variation due to interconnect R, C Design closure becomes much more difficult Costs of BEOL variations More design effort (e.g., “last month” of manual ECO iteration) Compromised circuit performance at high Vdd Recent work: reduce signoff margin by using tightened BEOL corners without sacrificing parametric yield Signoff at conventional BEOL corners is pessimistic for most timing-critical paths We identify paths which can be safely signed off using tightened BEOL corners (TBC) Joint work with Sorin Dobre (Qualcomm) and Tuck-Boon Chan At 20nm node and below, timing variation due to interconnect R, C is increasing. Timing violation caused by BEOL variation is difficult to be fixed because the path data is dominated by wire delay, so increasing cell size does not work. The costs of BEOL variations are significant. To fix the timing violations, designers need to spend more design effort which may delay product schedule. In case when we cannot fix the timing violations, we have to compromise for a lower circuit performance. To address this problem, we propose to reduce signoff margin by using tightened BEOL corners without sacrificing parametric yield. We observe that signoff at the conventional BEOL corners is pessimistic for most of the timing-critical paths. Therefore, we propose a method to identify paths which can be safely signed off using tightened BEOL corners.

Proposed Timing Signoff Flow
Routed design Routed design Classify timing critical paths GTBC GCBC ECO using CBC Timing analysis using TBC violation = 0? Timing analysis using CBC using TBC done This work No ECO using CBC Timing analysis using conventional BEOL corners (CBC) This figure shows a conventional signoff flow. Given a routed design, we will run timing analysis. If there are timing violations, we will fix them in the ECO step. In this work, we propose a different approach. Given the routed design we will analyze and classify the timing critical paths into two groups, GTBC and GCBC. For the paths in GTBC, we signoff these paths using the tightened BEOL corner. For the remaining paths, we will use the conventional BEOL corners. Since a lot of paths will use TBC, this makes the timing signoff a lot easier. violation = 0? No done Conventional Signoff

Conventional BEOL Corners
H3 ΔW ΔT ΔH Ytyp typical Typical Ycb min max Ycw Yrcb Yrcw T3 M3 H2 Inter-layer dielectric T2 M2 S2 W2 H1 T1 M1 Inter-metal dielectric Three major variation sources per layer: {ΔW, ΔT, ΔH} Conventional BEOL corners (CBC) Homogeneous corners: all variation sources are skewed in the same direction BEOL RC variations are modeled in interconnect technology file (.itf) There are three major variation sources for each layer. The metal width variation, metal thickness variation and inter-layer dielectric thickness variation. In the conventional BEOL corners (CBC), , all variation sources are skewed in the same direction. This table shows examples of conventional corners and the corresponding skewed parameters.

Statistical RC Model 3 variation sources in each layer, {ΔW, ΔT, ΔH}
9-layer metal stack has 27 variation sources z1, z2, …, z27 BEOL layers in the same process module use the same manufacturing equipment and process steps zu and zv are correlated if and only if zu and zv are the same type (ΔW, ΔT or ΔH) zu and zv are in the same process module ΔW ΔT ΔH M9: z25, z26, z27 Examples: ΔW in layer M4 has a positive correlation with ΔW in layers M5, M6, and M7 But ΔW in layer M4 is not correlated with ΔT in M4 Process module #3 M8: z22, z23, z24 M7: z19, z20, z21 Another approach to model RC variation is to use statistical RC extraction. Our model consider 3 variation sources in each layer and there is a total of 27 variation sources In a 9-layer metal stack. We assume the BEOL layers in the same process module use the same manufacturing equipment and process steps. Two variation sources are correlated if and only if they are the same variation and they are in the same process module. For example, ΔW in layer M4 has a positive correlation with ΔW in layers M5, M6, and M7 because they are the same type and they are in the same process module. But ΔW in layer M4 is not correlated with ΔH in M4. M6: z16, z17, z18 Process module #2 M5: z13, z14, z15 M4: z10, z11, z12 M3: z7, z8, z9 Process module #1 M2: z4, z5, z6 M1: z1, z2, z3

Pessimism of Conventional BEOL Corners (CBC)
Assumption: a max (setup) path pj is “safe” when delay evaluated at a given CBC is larger than nominal delay + 3σj dj(YCBC) ≥ 3σj + dj(Ytyp) For a given path, we can compare the statistical delay variation and the delay obtained from a given CBC αj = 3σj / Δdj(YCBC) Δdj(YCBC)= [dj(YCBC) - dj(Ytyp)] YCBC  {Ycw, Ycb, Yrcw, Yrcb} Small αj  large pessimism of CBC Based on the extracted delay variation, we assume a setup path is safe when the delay evaluated at a conventional corner is larger than the nominal + 3 sigma delay obtained from the statistical analysis. For a given path, we define an alpha factor, which is the ratio of the 3 sigma delay of the path to the delta delay at a corner. The alpha is an indicator of pessimism. A small alpha implies that the delta delay at a BEOL corner is larger than the 3 sigma delay. delay -3σ dj(YCBC) - dj(Ytyp) 3σj Large pessimism

Intuition on Delay Variability Across Cw, RCw
Some paths have α > 1.0  a CBC can underestimate delay variations But these paths often have smaller α values at the other corner (!) Dominated by RC-worst: Δdelay at RC-worst > Δdelay at C-worst C-worst corner underestimates delay variations, but these paths are dominated by the RC-worst corner Dominated by C-worst: Δdelay at C-worst > Δdelay at RC-worst α α α < 1.0 here  delay variations covered by RC-worst corner Δdelay (vs. typ) at C-worst [d(Ycw) – d(Ytyp)] / d(Ytyp) Δdelay (vs. typ) at RC-worst [d(Ycw) – d(Ytyp)] / d(Ytyp)

Intuition on Delay Variability Across Cw, RCw
Some paths have α > 1.0  a CBC can underestimate delay variations But these paths often have smaller α values at the other corner (!) Dominated by RC-worst: Δdelay at RC-worst > Δdelay at C-worst C-worst corner underestimates delay variations, but these paths are dominated by the RC-worst corner Dominated by C-worst: Δdelay at C-worst > Δdelay at RC-worst α α α < 1.0  delay variations are covered by the RC-worst corner Paths are more sensitive to R or to C Using RC-worst or C-worst only will underestimate delay variations Need both RC- and C-worst corners to cover process variations In the following, α is defined at the dominant corner Δdelay at C-worst [d(Ycw) – d(Ytyp)] / d(Ytyp) Δdelay at RC-worst [d(Ycw) – d(Ytyp)] / d(Ytyp)

Scaling Factor α and Delay Variation
Paths with small Δdrcw and Δdcw have large α E.g., here we see αj > 0.6 when ((Δdrcw < 3%) AND (Δdcw < 3%)) Identify paths for tightened BEOL corners based on Δdrcw and Δdcw Δd(Yrcw)/d(Ytyp) α In this figure, we plot the alpha values for a set of critical paths. The x-axis is the delta delay of the paths at the C-worst corner, the y-axis is the delta delay at the RC-worst corner. Each circle in the plot is a timing-critical path and the color represents the value of alpha. As shown in the figure, we figure out that the paths with small delta delay at the C-worst and RC-worst corners have a large alpha. For example, the alpha is larger than 0.6 when delta delay at the C-worst and RC-worst corners are smaller than 3% of the nominal delay. This means that by checking the delta delay we can identify the paths for tightened corner. Δd(Ycw)/d(Ytyp)

Find Paths for Which TBCs Can Be Used
Paths with small Δdrcw and Δdcw have large α E.g., there are αj > 0.6 when ((Δdrcw < 3%) AND (Δdcw < 3%)) Identify paths for tightened BEOL corners based on Δdrcw and Δdcw Gtbc = Set of paths that can be safely signed off using TBC: ( (Path with Δdcw larger than Acw) OR (Path with Δdrcw larger than Arcw) ) Acw Arcw Δd(Yrcw)/d(Ytyp) α Therefore, we propose to filter out the paths which can use tightened corners by using two threshold values, Acw and Arcw. A path which has a delta delay larger than any of the threshold can be signed off using TBC corners. Δd(Ycw)/d(Ytyp)

Determining α, Arcw and Acw
Δd at RC-worst corner (%) Δd at C-worst corner (%) Δd at C-worst corner (%) To determine the threshold values Arcw and Acw, we assume that the critical paths in different design have similar trends. Therefore, we can extract the alpha, Arcw and Acw from a set of representative paths. For example, the figure shows the maximum alpha of the paths in a design and the number of paths in the GTBC. Based on the figure, we can select the values of alpha and the corresponding Arcw and Acw. Note that when we use a smaller alpha value, the threshold values are larger. This means that when we use a tighter corner, the number of paths which can use the tightened corner is also reduced. To account for sampling error, we add 1% margin on Arcw and Acw. Assumption: critical paths in different designs have similar trends Extract Arcw and Acw from a set of representative paths Plot α vs. Δdelay, find Arcw and Acw for a given α Add +1% margin on Arcw and Acw to account for sampling error Smaller α  larger thresholds (Arcw and Acw)  fewer paths in GTBC

Benefits of Tightened BEOL Corners
Correlation factor, γ = 0.5 WNS and TNS are reduced by up to 100ps and 53ns #Timing violations reduced by 24% to 100% TBC-0.6 : more benefits Tradeoff between reduced margin vs. #paths which use TBC In this experiment, we apply our filtering method for three testcases. We use a 0.5 correlation factor for statistical analysis. The results show that, by using the tightened corners, we can reduce WNS and TNS by upto 100ps and 53ns. The timing violations are also reduced. From the figure, we can see that the tightened corner with an alpha value 0.6 is better compared to an alpha value 0.5 and 0.7. When the alpha value is too small, not many paths can use the tightened corners. But when the alpha value is too big, the different between the conventional corner and tightened corner becomes smaller, so there is not much benefit to use the tightened corner.

How to Minimize Cost of Resilience ?
Additional circuits  area and power penalties Recovery from errors  throughput degradation Large hold margin  short-path padding cost Want benefits (e.g., energy) to maximally outweigh costs Razor Razor-Lite TIMBER Power penalty 30% [Das08] ~0% [Kim13] 100% [Choudhury09] Area penalty 182% [Kim13] 33% [Kim13] 255% [Chen13] #recovery cycles 5 [Wan09] 11 [Kim13] 0 [Choudhury09] Razor Razor-Lite TIMBER

Tradeoff: Resilience Cost vs. Datapath Cost
#Razor FFs (resilience cost) Power/area of fanin circuits Tradeoff We seek to minimize total energy via this tradeoff (joint work with Seokhyeong Kang and Jiajia Li; extensions ongoing in collaboration with NXP)

Selective-Endpoint Optimization (SEOpt)
Optimize fanin cone of an endpoint w/ tighter constraints  Allows replacement of Razor FF w/ normal FF Pick endpoints based on heuristic sensitivity functions 𝑆𝐹1=|𝑠𝑙𝑎𝑐𝑘 𝑝 | 𝑆𝐹2=|𝑠𝑙𝑎𝑐𝑘 𝑝 |×𝑛𝑢𝑚𝑐𝑟𝑖(𝑝) 𝑆𝐹3=|𝑠𝑙𝑎𝑐𝑘 𝑝 |× 𝑛𝑢𝑚𝑐𝑟𝑖(𝑝) 𝑛𝑢𝑚𝑡𝑜𝑡𝑎𝑙(𝑝) 𝑆𝐹4=|𝑠𝑙𝑎𝑐𝑘 𝑝 |× 𝑐𝜖𝑓𝑎𝑛𝑖𝑛(𝑝) 𝑃𝑤𝑟(𝑐) 𝑆𝐹5= 𝑐𝜖𝑓𝑎𝑛𝑖𝑛(𝑝) |𝑠𝑙𝑎𝑐𝑘 𝑐 | ×𝑃𝑤𝑟(𝑐) Candidate Sensitivity Functions p negative slack endpoint c cells within fanin cone Numcri number of negative slack cells Vary #endpoints  compare area/power penalty

Clock Skew Optimization (SkewOpt)
Increase slacks on timing-critical and/or frequently- exercised paths Generate sequential graph Find cycle of paths with minimum total weight  adjust clock latencies  contract the cycle into one vertex Iterate Step 2 until all endpoints are optimized W’ W’ = average weight on cycle FF1 FF2 FF3 W12 W23 Clock Data path Clock tree W31 Setup slack of path p-q 𝑊 𝑝𝑞 = 𝑆𝑙𝑎𝑐𝑘𝑝,𝑞 1+β×𝑇𝐺(𝑝,𝑞) Weighting factor Toggle rate of path p-q

Overall Optimization Flow
Iteratively optimize with SEOpt and SkewOpt Initial placement (all FFs = error-tolerant FFs) OR-tree insertion Margin insertion on K paths based on sensitivity function Replace error-tolerant FFs w/ normal FFs SEOpt Energy < min energy? Save current solution Activity aware clock skew optimization SkewOpt

Benefit of Low-Cost Resilience
Reference flows Pure-margin (PM): conventional method w/ only margin insertion Brute-force (BF): use error-tolerant FFs for timing-critical endpoints Proposed method (CO) achieves up to 21% energy reduction compared to reference methods Resilience benefits increase with larger process variation Large margin Medium margin Small margin MUL EXU Small/medium/large margin  1σ/2σ/3σ for SS corner Technology: foundry 28nm

Increased Benefit of Resilience with AVS
Adaptive voltage scaling allows a lower supply voltage for resilient designs, thus reduced power Proposed method trades off between timing-error penalty vs. reduced power at a lower supply voltage Proposed method achieves an average of 17% energy reduction compared to pure-margin designs  Resilience benefits increase in the context of AVS strategy MUL EXU Minimum achievable energy Technology: foundry 28nm

Breaking Chicken-Egg Loops  Less Margin
Example: Interaction between reliability margin and AVS designs Bias temperature instability (BTI) aging  higher |ΔVth|  lower fmax AVS can be used to compensate for performance degradation Circuit frequency Without AVS Circuit Closed-loop AVS On-chip aging monitor Circuit performance With AVS target Voltage regulator time Because of the BTI aging, threshold voltages of the devices increases, which leads to a slower circuit. To compensate for the performance degradation we can use adaptive voltage scaling to speed up the circuit. Whenever the circuit performance drops below the target, we can increase the supply voltage. Vdd time

Derated Library Characterization and AVS
VBTI = Voltage for BTI aging estimation Vlib = Voltage for circuit performance estimation (library characterization) VBTI and Vlib are required in signoff VBTI and Vlib selection should consider BTI + AVS interaction Aging and Vfinal are unknowns before circuit implementation VBTI |Vt| Step 1 Vlib Derated library Step 2 Circuit implementation and signoff circuit Step 3 ? To have a better understanding on the effect of VBTI and Vlib, lets us look at the derated library characterization flow and adaptive voltage scaling. In the first step of a library characterization, we use VBTI to estimate delta Vth due to aging. Then, we add the delta threshold voltages to standard cells to characterize a derated library. In this step, Vlib is used as the supply voltage of the standard cells. After getting a derated library, we can implement and signoff the circuit. During operation, the circuit will experience BTI degradation and AVS will be used to compensate for the aging effect. Therefore, at the end of the circuit lifetime, the circuit will operate at a higher voltage level and we call it as Vfinal. In this figure, we can see that the VBTI and Vlib are required in signoff and they should be defined to represent the effect of aging and AVS. However, circuit aging and Vfinal resulted from AVS are unknowns before circuit implementation. BTI degradation and AVS Vfinal

Library Characterization for AVS
VBTI = Voltage for BTI aging estimation Vlib = Voltage for circuit performance estimation (library characterization) VBTI and Vlib are required in signoff VBTI and Vlib depend on aging during AVS Aging and Vfinal are unknowns before circuit implementation Inconsistency among Vfinal , Vlib , VBTI What is the design overhead when timing libraries are not properly characterized? Can we define BTI- and AVS-aware signoff corners that ensure product goals with small design, lifetime energy overheads? Joint work with Wei-Ting Jonas Chan, Tuck-Boon Chan, Siddhartha Nath No obvious guideline to define VBTI and Vlib Vlib VBTI Derated library |Vt| Circuit implementation and signoff circuit BTI degradation and AVS Vfinal ? Step 1 Step 2 Step 3 The figure shows that the there is no obvious guideline to define VBTI and VLIB when Vfinal remains unknown. This can lead to inconsistency among Vfinal, Vlib and Vbti. In this work, we are interested in addressing two questions. First, what is the design overhead when timing libraries are not properly characterized? and second, what are the guidelines to define BTI- and AVS-aware signoff corners that guarantee timing correctness with little design overhead?

Power vs. Area Across Different Signoffs
Pessimistic signoff corner Ovestimate aging and/or underestimate circuit performance Large area overhead Optimistic signoff corner AVS increases supply voltage aggressively to compensate aging Large lifetime energy overhead May fail to meet timing if desired supply voltage > Vmax This figure shows the data points for four designs overlay on the same plot. From the figure, we can see that the circuits signed off using our derated libraries are located at the knee point in the power vs. area tradeoff curve. This shows that the circuits will have good area and power tradeoffs. Large area or power penalty can be avoided using our derated libraries. From the figure, we can see that some corners are pessimistic that they overestimate the aging effect or underestimate circuit performance. As a result, the implemented circuit has large area overheads. On the other hand, the are some optimistic corners which underestimate the aging effect. As a result, the AVS needs to increases supply voltage rapidly to compensate for aging. This causes large power overhead. In certain cases, the circuit signoff using these libraries may fail to meet timing if the desired supply voltage is larger than the maximum allowed voltage. “Knee” point for balanced area and power tradeoff

Heuristics #1 VBTI = Vlib ≈ Vfinal
Model BTI degradation with Vfinal throughout lifetime Aging of a flat Vfinal ≈ aging of an adaptive Vdd But slightly pessimistic VBTI = Vlib ≈ Vfinal Vdd time NBTI PBTI Since the Vdd of a circuit will quickly converge to the Vfinal, we propose to model BTI degradation with the Vfinal throughout circuit lifetime. This figure shows that BTI degradation due to a flat Vfinal is similar to the time varying Vdd in a AVS using a flat Vfinal slightly overestimate the aging effect. The small overestimation is okay because we want to be conservative during signoff.

Vfinal Estimation Problem: Vfinal is not available at early design stage (design has not been implemented) Vfinal = end of life (to compensate BTI aging) Gates along critical path Timing slack at t = 0 Circuit activity (BTI aging) BTI aging depends on circuit activity Assume DC or AC stress in derated library characterization ✔ ? However, making the Vfinal approximation is insufficient because Vfinal is not avaible at the early design stage. To estimate Vfinal, we need to understand what are the factors that affect Vfinal. Vfinal is the circuit voltage at the end of lifetime. This value is determine by the gate of critical paths because each gate response differently to the aging and voltage scaling. Another factor that affect Vfinal is the timing slack of the circuit at the beginning of its lifetime. For example, if the circuit has very large timing slack, the AVS may not happen at all through the circuit lifetime. The Vfinal is also affected by circuit activity. But this is not a critical issue because we can assume DC or AC stress as the worst-case BTI aging. As a short summary, we can assume a worst-case circuit activity for Vfinal estimation but we still need to estimate the impact of gate type and timing slack.

Observation and Heuristic #2
Observation #2: Vfinal is not sensitive to gate types Heuristic #2: use average Vfinal of different gate types Vfinal is a function of timing slack Assume timing slack = 0 10mV To analyze the impact of gate type and timing slack, we construct different artificial critical paths with different gate types and extract the value of Vfinal for different timing slack. In this figure, the x-axis is the timing slack of the critical paths and each line represent a critical path. From the figure, we can see that the Vfinal value of different paths are very similar. The difference between them is typically less than 10mV. Therefore, we propose our second heuristic that we use average Vfinal of different cells to estimate the Vfinal for library characterization. The figure also shows that when timing slack increases, Vfinal reduces and eventually converge to the supply voltage at the beginning of circuit lifetime.

Proposed Library Characterization Flow
Heuristic: obtain Vheur by averaging Vfinal of different cells Heuristic: use a “flat” Vheur to estimate BTI degradation Obtain Vheur (average of standard cells) Obtain derated library with VBTI = Vlib = Vheur To solve the problem, we propose two heuristics. By using the first heuristic, we obtain the average Vifnal of critical paths with different cell types and we call it as V heuristic. Then, we can use this Vheuristic to estimate BTI degradation. In other words, we can now characterize a derated library with both VBTI and Vlib equal to the V heuristic. Signoff circuit with derated library

Power vs. Area for All Designs
4 designs x {DC, AC} x {derating methods}) Optimistic signoff corner AVS increases supply voltage aggressively to compensate aging Consume more power May fail to meet timing if desired supply voltage > Vmax Pessimistic signoff corner Ovestimate aging and/or underestimate circuit performance Large area overhead Circuit signed off using other derated libraries Proposed method This figure shows the data points for four designs overlay on the same plot. From the figure, we can see that the circuits signed off using our derated libraries are located at the knee point in the power vs. area tradeoff curve. This shows that the circuits will have good area and power tradeoffs. Large area or power penalty can be avoided using our derated libraries. From the figure, we can see that some corners are pessimistic that they overestimate the aging effect or underestimate circuit performance. As a result, the implemented circuit has large area overheads. On the other hand, the are some optimistic corners which underestimate the aging effect. As a result, the AVS needs to increases supply voltage rapidly to compensate for aging. This causes large power overhead. In certain cases, the circuit signoff using these libraries may fail to meet timing if the desired supply voltage is larger than the maximum allowed voltage. “Knee” point for balanced area and power tradeoff

Also: Multi-Mode Signoff Choices Matter !
Signoff mode = (voltage, frequency) pair Multi-mode operation requires multi-mode signoff Example: nominal mode and overdrive mode Selection of signoff modes affects area, power ASP-DAC 2013: Optimization of signoff modes  Improve performance, power, or area  Reduce overdesign NOM OD time Vdd tnom tOD Power of circuits w/ different overdrive modes Different overdrive modes  26% power range fnom = 800MHz Vnom = 0.8V 12% Fix fOD, still 14% power range

Also: Tunable Monitors  Less Margin
Aggressive config.  Vmin_est < Vmin_chip  Some chips will fail Optimized config. Increase % high resistance passgates Vmin_est ≈ Vmin_chip Default config. Low resistance passgates Guardband for worst-case Vmin_est > Vmin_chip 13mV margin Our result show that when we use the default configuration to guardband for the worst-case scenario there is a 13mV Vmin margin. By tuning the passgates to high resistance, we can reduce the margin to 0. When we tune the ROs to allow for more aggressive voltage scaling some of the chips will start having a lower Vmin and fail to meet circuit timing.

Also: Tunable Monitors  Less Margin
Aggressive config.  Vmin_est < Vmin_chip  Some chips will fail Optimized config. Increase % high resistance passgates Vmin_est ≈ Vmin_chip Default config. Low resistance passgates Guardband for worst-case Vmin_est > Vmin_chip 13mV margin Benefits of tunability Compensate for difference between model vs. silicon Recover margin when variation is reduced due to improved process Our result show that when we use the default configuration to guardband for the worst-case scenario there is a 13mV Vmin margin. By tuning the passgates to high resistance, we can reduce the margin to 0. When we tune the ROs to allow for more aggressive voltage scaling some of the chips will start having a lower Vmin and fail to meet circuit timing.

Margining of IC Variability Tolerance of IC Variability Conclusions

Conclusions Variability severely challenges IC value
In manufacturing process, during operation, across lifetime Benefit of “next node” is increasingly hard to find Entire node is a “20/20/20” value proposition 5-10% in P/P/A metrics is now substantial at leading edge Variability is connected to tapeout, IC properties by models, margins, tolerances used in signoff Some takeaways from this talk Substantial benefit from tightening BEOL corners (= signoff) “Minimum cost of resilience” is a rich optimization challenge Chicken-egg loops in signoff definition can be broken Holistic approaches will provide “equivalent scaling” that extends the value trajectory of Moore’s Law

Thank You !

Backup

Power Penalty to Fix EM with AVS
Core power increases due to elevated voltage P/G power increases due to both elevated voltage and mesh degradation A tradeoff between invested guardband in signoff 14% power penalty Least invested guardband We also study the power impact due to AVS in a similar way. The core powers of implementations are different due to different Vfinal and the difference could be 14%. We apply Mishra’s model to study EM degradation on the P/G mesh. Power penalty on the P/G mesh is more significant than core. The possible reason is because the increased resistance on the mesh contributes to more power penalty. Highest invested guardband

Homogeneous Corners Example: worst-case capacitance corner
(1) Define RC corners of each layer separately (2) Use corners from each layer to construct a homogeneous corner for an interconnect stack Example: worst-case capacitance corner Interconnect stack with M1 and M2 M1 C M2 C 3σ Homogeneous Cw corner C -3σ Layer M2 3σ Pessimism To construct a homogeneous corner, we first define the corners for each layer. For example, here we define the worst-case capacitance corners for layers M1 and M2. Then, these corners are used to define the C-worst corner for an interconnect stack with M1 and M2. From the figure, we can clearly see that this approach is pessimistic because the C-worst corner is far away from the 3sigma distribution of capacitance. C -3σ Layer M1 3σ

Homogeneous Corners (1) Define RC corners of each layer separately (2) Use corners from each layer to construct a homogeneous corner for an interconnect stack Example: worst-case capacitance corner Interconnect stack with M1 and M2 Homogeneous Cw corner M2 C C -3σ Layer M2 3σ 3σ Pessimism When variations in different layers are not fully correlated, pessimism of homogeneous corners increase with #layers M1 C Moreover, when the variations in different layers are not fully correlated, the pessimism of homogeneous corners increase with #layers. C -3σ Layer M1 3σ

Correlation Matrix Let Σ be the correlation matrix for variation sources M1 M2 M3 M4 ΔW ΔT ΔH 1 γ = Σ Correlation for variation sources with the same variation type and in the process module, γ  0.5 Variation sources in different process modules are independent

Wiring Structure in Timing-Critical Paths (2)
92% of paths have < 60% of wirelength on any single layer Variations in different layers are not fully correlated Averaging uncorrelated variation  smaller RC variation Max. wirelength ratio across all layers (%) Cumulative probability 0.92 60%

Delay Variation Some paths have α > 1.0  a CBC can underestimate delay variations But these paths have larger delays at the other corner Dominated by RC-worst: Δdelay at RC-worst > Δdelay at C-worst C-worst corner underestimates delay variations, but these paths are dominated by the RC-worst corner Dominated by C-worst: Δdelay at C-worst > Δdelay at RC-worst α α α < 1.0  delay variations are covered by the RC-worst corner Δdelay at C-worst [d(Ycw) – d(Ytyp)] / d(Ytyp) Δdelay at RC-worst [d(Ycw) – d(Ytyp)] / d(Ytyp)

Delay Variation Some paths have α > 1.0  a CBC can underestimate delay variations But these paths have larger delays at the other corner Dominated by RC-worst: Δdelay at RC-worst > Δdelay at C-worst C-worst corner underestimates delay variations, but these paths are dominated by the RC-worst corner Dominated by C-worst: Δdelay at C-worst > Δdelay at RC-worst α α α < 1.0  delay variations are covered by the RC-worst corner Paths are more sensitive to R or to C Using RC-worst or C-worst only will underestimate delay variations Need both RC- and C-worst corners to cover process variations In the following discussions, α is defined at the dominant corner Δdelay at C-worst [d(Ycw) – d(Ytyp)] / d(Ytyp) Δdelay at RC-worst [d(Ycw) – d(Ytyp)] / d(Ytyp)

Non-Homogeneous Corner
Each layer can have different skewed variations 3σ Interconnect stack with M1 and M2 M1 C Non-homogeneous corner M1 == Cw (3σ) M2 == Ctyp M2 C Less pessimism with non-homogeneous corners Challenge: Many feasible combinations A corner can only cover certain paths How to choose the best combinations?

Opportunities for Tightened BEOL Corners
Challenge: how to avoid underestimating delay variation to preserve parametric yield 3σj/d(Ytyp) x 100% Based on the same data, we found that the conventional BEOL corner is pessimistic, because most of the paths have a alpha value smaller than 0.5. Therefore, we propose to use tightened BEOL corners, for example, scale the variation parameters in itf by 0.5. Scaling the corner is simple but the challenge is, how do we avoid underestimate delay variation to preserve parametric yield. Δdj(Yrcw)/dj(Ytyp) x 100% CBC can be pessimistic! Most paths have α < 0.5 Use tightened BEOL corners, e.g., scale BEOL variation in .itf with α = 0.5

Wiring Structure in Timing-Critical Paths
Wirelength ratio (%) Testcase: 45nm foundry library (wire resistivity scaled by 8X) Netlist: NETCARD 1mm2, 570K standard cell instances 9 metal layers Extract critical paths from different PVT and BEOL corners Critical paths are structurally similar Wires on critical paths are routed on many layers Structure is an outcome of the design flow To study the pessimism in the conventional BEOL corner, we implemented a testcase using 45nm library with 9 metal layers. From the results, we can see that the critical paths have similar wiring structure. All the wires are routed on layers 2 to 6 and the wirelengths are not dominated by a single layer. This result is expected because the wiring structure is driven by the design flow such as the maximum transition constraint, preferred routing direction, etc.

Proposed Timing Signoff Flow
Extract RC at RC-worst, C-worst and the typical corners Calculate Δdelay of critical paths Put path j in the group Gtbc if Δdelay is larger than a threshold Fix only the paths in Gtbc using tightened BEOL corners Since tightened corners have smaller delay variations, timing closure is easier Routed design Timing analysis at BEOL corners Ytyp, Ycw, Yrcw ECO using TBC ECO using CBC GTBC GCBC Timing analysis using TBC Timing analysis using CBC This figure shows our timing signoff flow. First we implement the design using conventional corners. Given a routed design, we extract RC and calculate the delta delay of the critical paths. Then, we filter the paths based on their delta delay. For the paths in Gtbc, we will run timing analysis using the tightened corner. For all other paths, we will use the conventional corner. Then, we run the timing eco with different corners until there are no timing violations. Since the paths which use the tightened corner will have smaller delay variation, the timing closure becomes easier. violation = 0? violation = 0? done

Experiment Setup Testcases for validation (45nm library with 8X wire resistivity) LEON3MP NETCARD SUPERBLUE12 Clock period (ns) 1.8 2.0 3.1 Gate count 232K 575K 1031K Utilization (%) 84 79 82 Core area (mm2) 0.45 1.04 1.91 Max. transition (ps) 330 Statistical models: (1) no correlation and (2) same kind of variation sources in the same process module have correlation factor = 0.5 In this experiment setup, we use three testcases and we try two statistical setups. First, we assume there is no correlation among the variation sources, and second, we assume same kind of variation sources in the same process module have a correlation factor 0.5. Also, we extract the threshold values for different alpha values. α Correlation factor = 0.5 Acw (%) Arcw (%) TBC-0.5 0.5 4.3 7.3 TBC-0.6 0.6 3.3 5.0 TBC-0.7 0.7 3.0 3.4 Implement another NETCARD (clock period = 2.3ns) to obtain α, Acw and Arcw

Further Analysis Paths with small Δd(Yrcw) and Δd(Ycw) have large α
A path has small Δdelays  the path is equally sensitive to R and C Example: dj = dj(Ytyp) ΔdR-M ΔdC-M1 For a given CBC = Ycw, ΔdR-M1 is small but ΔdC-M1 is large  delay variation of ΔdR-M1 and ΔdC-M1 are cancelled out  Δd(Ycw)  0 < σj Nominal delay Delay sensitivity to unit change in M1 resistance Delay sensitivity to unit change in M1 capacitance Based on our study, we see the same trend on different critical paths. When a path has a small delta delay, this means that the path is not dominated by resistance or capacitance. For example, in this case, a path is equally sensitive to R and C. For a given BEOL corner, if the resistance becomes small, the capacitance becomes larger. Because of the opposite directions in delta R and C, the delay variation is cancelled out at the BEOL corner. As a result, the delay variation becomes very small compared to the delay variation obtained from a statistical analysis. Therefore, the path will have a large alpha.

Scaling Factor Results
Similar trends in different designs Large α when Δd(Yrcw)/d(Ytyp) and Δd(Ycw)/d(Ytyp) are small LEON3MP α > 0.5 NETCARD SUPERBLUE12 α > 0.5 α > 0.5

Benefits of Tightened BEOL Corners (1)
Correlation factor, γ = 0 (variation sources are independent) WNS and TNS are reduced by up to 120ps and 61ns #Timing violations reduces by 31% to 100%

Heuristics #1 VBTI = Vlib ≈ Vfinal
Model BTI degradation with Vfinal throughout lifetime Aging of a flat Vfinal ≈ aging of an adaptive Vdd But slightly pessimistic VBTI = Vlib ≈ Vfinal Vdd time NBTI PBTI To address the inconsistency between the voltages, we study the trend of BTI degradation. Because of the front-loaded nature of BTI effect, we found that we can model BTI degradation with a fix Vdd throughout the lifetime. This figure shows that BTI degradation due to a flat Vfinal is similar to the time varying Vdd in a AVS using a flat Vfinal slightly overestimate the aging effect. The small overestimation is okay because we want to be conservative during signoff.

Vfinal Estimation Problem: Vfinal is not available at early design stage (design has not been implemented) Vfinal = end of life (to compensate BTI aging) Gates along critical path Timing slack at t = 0 Circuit activity is not an issue Because BTI effect is not sensitive to circuit activity DC or AC stress model is sufficient However, making the Vfinal approximation is insufficient because Vfinal is not avaible at the early design stage. To estimate Vfinal, we need to understand what are the factors that affect Vfinal. Vfinal is the circuit voltage at the end of lifetime. This value is determine by the gate of critical paths because each gate response differently to the aging and voltage scaling. Another factor that affect Vfinal is the timing slack of the circuit at the beginning of its lifetime. For example, if the circuit has very large timing slack, the AVS may not happen at all through the circuit lifetime. Circuit activity is not a critical issue because BTI effect is not sensitive to circuit activity. Therefore, it is sufficient to use a DC or AC stress model.

Observation and Heuristic #2
Observation #2: Vfinal is not sensitive to gate types Heuristic #2: use average Vfinal of different gate types Vfinal is a function of timing slack Assume timing slack = 0 10mV To analyze the impact of gate type and timing slack, we construct different artificial critical paths with different gate types and extract the value of Vfinal for different timing slack. In this figure, the x-axis is the timing slack of the critical paths and each line represent a critical path. From the figure, we can see that the Vfinal value of different paths are very similar. The difference between them is typically less than 10mV. Therefore, we propose our second heuristic that we use average Vfinal of different cells to estimate the Vfinal for library characterization. The figure also shows that when timing slack increases, Vfinal reduces and eventually converge to the supply voltage at the beginning of circuit lifetime.

Technology and Benchmark Circuits
NANGATE library with 32nm PTM technology Signoff for setup time violation Temperature = 125C Process corner = slow NMOS and PMOS BTI degradation = {DC, AC} Supply voltages Vmax 1.05V Vinit 0.90V Vheur1 (DC) 0.97V Vheur1 (AC) 0.95V Vheur2 (DC) Vheur2 (AC) 0.93V Circuit Frequency (GHz) C5315 1.38 c7552 1.25 AES 0.89 MPEG2 1.05 In this experiment, we use the NANGATE library with 32nm PTM technology for the active devices. The circuits are signoff for setup time violation at 125 degree celcius and at the slow slow process corner. We consider both DC and AC BTI degradation separately in our experiment We use four benchmark circuits in this experiment. The frequencies of the benchmark circuits ranges from 900MHz to 1.4 Ghz. We assume the initial voltage of the circuit Is 0.9V and the maximum allowed voltage is 1.05V. As expected the V heuristic for our method is slightly smaller for the AC stress scenario. Also, the value of Vheuristics is smaller when we read the value of Vheuristic with 3% timing slack.

A Reference Signoff Flow
Basic idea: keep a consistent VBTI , VLIB and Vdd throughout circuit lifetime Signoff flow: Estimate aging at each time step Update circuit timing and Vdd Repeat until t = tfinal Modify circuit and start over if Vfinal > maximum allowed voltage No overhead in timing analysis, but very slow Many STA runs and library Before I describe the details of our experiment, I want to introduce a reference signoff flow which is used in this work for comparison. The basic idea of the reference signoff flow is to keep a consistent VBTI, VLIB and Vdd throughout circuit lifetime. In this signoff flow, we estimate aging at each time step. Based on the estimated aging, we update Vdd and circuit timing. The circuit is modified and the analysis flow will start over if the Vfinal is larger than maximum allowed voltage. Although this reference signoff flow has no overhead in timing analysis, it is very slow because we need to validate timing at each time step and update the libraries according to the aging and changes in supply voltage. Vstep: AVS voltage step Vfinal: converged voltage

Experiment Setup Characterize different derated libraries
Evaluate impact of library characterization Seven setups 1 : VBTI = Vlib = Vinit  Ignore AVS 2 : Most pessimistic derated library 3 : VBTI = Vlib = Vmax  Extreme corner for AVS 4 : VBTI = Vfinal  Do not overestimate aging but ignores AVS 5 : No derated library (reference) 6 : Proposed method with α=0 7 : Proposed method with α=0.03 Case 1 2 3 4 5 6 7 Vlib(V) Vinit Vmax N/A Vheur1 Vheur2 VBTI (V) Vfinal In our experiment setup, we characterize different derated libraries to evaluate the impact of library characterization. There are 6 testcases. In the first testcase, we let both VBTI and Vlib to be the same as the initial voltage of the circuit. This setup represent the case where we ignore AVS. In the second testcase, we let the Vlib to be Vinit and use maximum voltage value to estimate BTI aging. This represents the most pessimistic derated library. In the third testcase, we let both VBTI and Vlib to be the same as the maximum voltage. This is an extreme corner for a circuit with AVS. In the fourth experiment, we let the VBTI to be the same as the final obtained from the reference flow but keep the Vlib same as initial voltage of the circuit. This setup represent the case where we do not overestimate the aging but the effect of AVS is ignored. The fifth testcase is the results of the reference flow and the Last two testcases are our proposed method which uses V heuristics to characterize the derated library. In the six testcase we choose the Vheuristic without timing slack. In the seventh testcase, we choose the Vheur with 3% timing slack to see the impact of using a different timing slack. Hypothesis

“Chicken and Egg” Loop “Chicken and egg” loop in signoff Vfinal
Derated library characterization is related to BTI + AVS AVS affected by circuit implementation Timing constraints, critical paths, etc. Circuit is affected by library characterization Vfinal Circuit From the previous slides, we know that the signoff problem is actually a chicken and egg loop. As you can see, the characterization of a derated library is related to BTI aging and adaptive voltage scaling. At the same time, the adaptive voltage scaling is affected by the circuit implementation such as the timing constraints of the circuit, the critical paths, etc. Unfortunately, the circuit is also affected the library characterization. So everything is inter-related. The derated library, the circuit, BTI aging and adaptive voltage scaling. To solve such a chicken and egg problem, we need to break the loop. Vlib , VBTI Derated Libraries

Bias Temperature Instability (BTI)
[TCAS’14] Bias Temperature Instability (BTI) |ΔVth| increases when device is on (stressed) |ΔVth| is partially recovered when device is off (relaxed) NBTI: PMOS PBTI:NMOS |Vgs| ON OFF ON OFF time Device aging (|ΔVth|) accumulates over time BTI aging is a physical phenomenon which affects threshold voltage of a device. We call it NBTI for a PMOS and PBTI for a NMOS. When a device is turned on, BTI aging increases the threshold voltage. When the device is turned off, part of the threshold voltage increment will be recovered. Since the threshold voltage cannot be fully recovered, delta threshold voltage due to the BTI aging will accumulates over time. [VattikondaWC06]

Observation #1 BTI is a “front-loaded” phenomenon
50% BTI aging happens within the 1st year of circuit lifetime (total lifetime = 10 years) [Chan11] ≈70% Vdd increment in 1 year (remaining 30% over 9 years) Vfinal Based on previous studies, we observe that BTI degradation is “front-loaded”. For example, 50% of BTI degradation happens within the first year of chip lifetime. This means that Vdd increment due to AVS is very frequent initially and the gap between Vdd and Vfinal of the circuit reduces rapidly over it’s lifetime. Most Vdd increment happens in early lifetime Gap between Vdd and Vfinal reduces rapidly

Results for DC Scenario
1 : VBTI = Vlib = Vinit  Ignore AVS 2 : Most pessimistic derated library 3 : VBTI = Vlib = Vmax  Extreme corner for AVS 4 : Vbti = Vfinal  Do not overestimate aging but ignores AVS 5 : No derated library (reference) 6 : Proposed method with α=0 7 : Proposed method with α=0.03 Good corners Optimistic signoff corner AVS increases supply voltage aggressively to compensate aging Consume more power May fail to meet timing if desired supply voltage > Vmax Pessimistic signoff corner Ovestimate aging and/or underestimate circuit performance Large area overhead These figures show the power and are of different circuits implemented and signoff using different derated libraries. We can see that circuit implemented using derated library number 3 has slightly less area but the power is approximately 10% larger than the reference method. This is because the derated library is too optimistic. Therefore, AVS has to increase the supply voltage aggressively to compensate aging. This cases the circuit to consume more power. Moreover, the circuit may fail to meet timing if the desired supply voltage is larger than maximum voltage. On the other hand, we can see that testcases 1,2 and 4, that are pessimistic in aging or circuit performance will lead to larger circuit area. On the other hand, we can see that by using our heuristics to characterize the derated libraries, the implemented circuits have similar area and power compared to the one obtained from the reference flow. This shows that our method can avoid power or area design overheads.

Problem: Signoff Corner Definition
Timing signoff: ensure circuit meets performance target under PVT variations & aging Conventional signoff approach: Analyze circuit timing at worst-case corners Fix timing violations, re-run timing analysis With BTI aging and AVS, what is the Vdd of the worst-cast corner for timing analysis? With BTI aging and AVS, the worst-case voltage corner is not obvious Vlib for circuit performance estimation Min Vdd Max Vdd VBTI for aging estimation MinVdd Not applicable (Optimistic) In timing signoff, we want to ensure that a circuit meets its performance target under process, voltage, temperature variations as well as aging. There are two important steps in timing signoff. First, we analyze circuit timing at worst-case corners and check is there any timing violation. Second, if there is any violation, we need to fix the violation by making change in the circuit and then re-run timing analysis. The problem is that, with BTI aging and AVS, what Vdd should we use in the worst-case corner for timing analysis? This table shows that there are two variables when we define a worst case corner. Here VBTI is the voltage for aging estimation and Vlib is the voltage for circuit performance estimation. You can treat Vlib as the supply voltage of the standard cells when we characterize a timing library. Now, suppose we use the minimum voltages for both BTI aging and circuit performance estimations. We get the slowest circuit speed but at the same time we are underestimating the aging effect. So, it is not clear whether we should use this combination. Now, suppose we use the maximum Vdd for both BTI aging estimation and circuit performance estimation, the circuit will have the worst-case aging but the circuit is faster because of the higher Vlib. The worst-case corner is when we use the maximum voltage for aging estimation and minimum voltage for circuit performance estimation. However, we know that this combination is pessimistic because we will not have different voltages at the same time for a circuit. By now, I hope it is clear that with BTI aging and AVS, the worst-case voltage corner is not obvious. ? Slowest circuit Less aging Slowest circuit Worst-case aging Faster circuit Worst-case aging Too pessimistic ?

AVS Signoff Corner Selection
Optimistic about AVS Pessimistic about AVS For different signoff corners, a design has different area-power tradeoffs. Let us take #2 and #3 as examples, #3 is the most optimistic in signoff, so the supply voltage increases by more than other implementations at runtime; hence, consumes more power. #2 is the most pessimistic in signoff, so the supply voltage is fixed at runtime but consumes more area during implementation.

AVS Impact on EM Lifetime
Assume no EM fix at signoff BTI degradation is checked at each step and MTTF is updated as 𝑀𝑇𝑇𝐹 𝑖 =𝑀𝑇𝑇𝐹(𝑖−1)× 𝑉 𝐷𝐷 𝑖−1 𝑉 𝐷𝐷 𝑖 2 Vfinal (V) 30% MTTF penalty 200mV voltage compensation We study the impact on EM MTTF degradation by Black’s equation. We assume the current density is proportional to the supply voltage, and MTTF is inversely proportional to the current density or voltage. We assume there is no EM fix for AVS at the signoff; The implementation has 10 year MTTF if the supply voltage is fixed at its nominal voltage. When the voltage increases to compensate BTI degradation, lifetime decreases because of higher EM stress. We update the delay degradation at each time step in the simulation, and increase the supply voltage to meet the timing constraint. We then update the MTTF at each time step until the end of the circuit’s lifetime. We observe up to 30% MTTF difference between the most pessimistic and optimistic signoff corners. The discrepancy arises due to Vfinal. This implies that AVS scheduling is important for EM degradation.

EM Impact on AVS Scheduling
1.2 years MTTF penalty The simulation results are shown here with five schedules S1 to S5. Each schedule uses a different voltage step size. When we increase the voltage step sizes, MTTF changes significantly. With schedule S5, the MTTF penalty can be up to 1.5 years. S5 increases voltage early in the lifetime as compared to other schedules. As a result, it causes worst EM stress on the wires.

“margin stack” for voltage signoff
What is “Signoff”? Foundation of contract between design house and foundry “chip should work”: stack of models, margins, analyses Function, timing, signal integrity, power integrity, … Problem: Margins = pessimism  overdesign, schedule delay Operating voltage Voltage Nominal Vdd Static IR drop Power grid IR gradient Dynamic IR HCI/NBTI “margin stack” for voltage signoff Signoff Vdd

Statistical Timing Analysis (1)
Delay sensitivity of path pj to variation source zv Assumptions: Δdj,v is linear with respect to variation sources Variation sources are normal distributions Obtain Δdj,v using 28 runs of RC extraction and static timing analysis (STA) Δdj,v = [ ] / 3 dj(Yv) dj(Ytyp) 28 .itf files (27 variation sources + Ytyp) Routed Netlist RC extraction STA Δdj,v To extract the impact of BEOL variation, we need to collect path delay sensitivities. The delay sensitivity is defined as the delta delay at a perturbed corner divided by the magnitude of the variation source. In this work, we extract the BEOL variations from the itf files and assume the variations are normal distributions and the value in the existing corners is the 3 sigma value. Given a routed netlist, we run 28 RC extractions and STA to collect the delay sensitivities of setup data path to each BEOL variation source. Note: Path delay includes gate and wire delays

Statistical Timing Analysis (2)
Σ is the correlation matrix for variation sources (e.g., 27 x 27) Σ = λλT (Note: λ is obtained by Cholesky decomposition) Delay sensitivities with correlation [Δd’j,1 … Δd’j,27] = [Δdj,1 … Δdj,27].λ Standard deviation of path delay σj = ((Δd’j,1)2 + … + (Δd’j,27)2)0.5 Given the correlation matrix of variation sources, we factor out the lambda using Cholesky decomposition. Then, by using lambda, we can obtain the delay sensitivities with correlation and calculate the standard deviation of the path delay. Note that this statistical analysis is only used for reference. This kind of analysis cannot be used for design implementation because the RC extractions and timing analysis is very slow. Note: we use the delay variation from the statistical analysis as a reference

Resilient Designs Detect and recover from timing errors  Ensure correct operation with dynamic variations (e.g., IR drop, temperature fluctuation, cross-coupling, etc.) Trade off design robustness vs. design quality  E.g., enable margin reduction Improve performance (i.e., timing speculation) Conventional design: Worst-case signoff No Vdd downscaling Resilient design: Typical-case signoff Vdd downscaling  reduced energy 15% reduction

Resilience Cost Reduction Problem
Given: RTL design, throughput requirement and error-tolerant registers Objective: implement design to minimize energy Estimation of design energy: 𝐸𝑛𝑒𝑟𝑔𝑦= 𝑃𝑜𝑤𝑒𝑟 𝑇ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 𝑇ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡= 1−𝐸𝑅 𝑇 + 1−𝐸𝑅 𝑟×𝑇 Error rate [Kahng10] #recovery cycles Clock period

Selective-Endpoint Optimization
Optimize fanin cone w/ tighter constraints  Allows replacement of Razor FF w/ normal FF Trade off cost of resilience vs. data path optimization Question: Which endpoint to be optimized?

Process-Aware Vdd Scaling (PVS)
AVS classes approaches Open-Loop AVS Freq. & Vdd LUT Post-silicon characterization AVS Pre-characterize LUT [Martin02] Process-aware AVS Post-silicon characterization [Tschanz03] Generic monitor Design dependent replica In-situ monitor Process and temperature-aware AVS Generic on-chip monitor [Burd00] Design-dependent monitor [Elgebaly07, Drake08, Chan12] In-situ performance monitor Measure actual critical paths [Hartman06, Fick10] Closed- Loop AVS Error Detection System Error detection and correction system Vdd scaling until error occurs [Das06,Tschanz10] Power There are many existing AVS techniques and they can be broadly classified into three catagories. The open-loop AVS, closed-loop AVS, and Error tolerance AVS. Open-loop AVS typically adjust the voltage based on a pre-characterized lookup table. To achieve more power reduction, closed-loop AVS utilize on-chip monitor to measure the performance variation of the chip and adjust the voltage accordingly. For further power reduction, error tolerance AVS reduce the voltage aggressively until an error occurs. In this work, we focus on closed-loop AVS because it offers better power reduction compared to the open-loop AVS, and it is easier to design compared to the Error tolerance AVS. Error Tolerance AVS

Challenge: Variability
DENSITY Ideal Non-ideality SUPPLY VOLTAGE Source: [CPUDB] Ideal Non-ideality POWER Source: [JeongK08] Ideal Non-ideality DRIVE CURRENT Ideal Source: [ITRS] Non-ideality Here I use fixed voltage scaling for area, power, drive current. And show full scaling in Vdd plot.

Energy Reduction in AVS Context
Adaptive voltage scaling allows lower supply voltage for resilient designs, thus reduced power Proposed method trades off between timing-error penalty vs. reduced power at a lower supply voltage Proposed method achieves an average of 18% energy reduction compared to pure-margin designs  Resilience benefits increase in the context of AVS strategy MUL EXU Minimum achievable energy

Our Concept: Mode Dominance
Design cone (of mode A) is the union of all the feasible operating modes for circuits signed of at mode A Design cone is determined by tradeoff between voltage and frequency (mainly threshold voltages) One mode is outside of the design cone of the other  failed design / overdesign Mode A has positive timing slacks with respect to mode B  mode A dominates mode B Equivalent dominance: no mode is dominated by the other Modes are in each others’ design cone Frequency Design Cone of mode A Negative Slacks = failed design C Multi-mode signoff at modes which do not exhibit equivalent dominance leads to overdesign Positive Slacks = overdesign B Guideline: search for signoff modes within design cone  reduce overdesign A Voltage

Our Method: Global Optimization
Iteratively sample and refine power models Avoid circuit implementation at each mode Small constant # of runs is enough  Scalable Global optimization flow Power estimation of adaptive search Ovals indicate sample points 1st / 2nd: power from power models at first / second iteration real: power from real implemented circuits Design: AES f : 700MHz Sample (SP&R) Construct power models Estimate optimal signoff modes Sample (SP&R) Refine power models Adaptive search

Classes of Closed-Loop AVS
Generic monitor Design-dependent replica In-situ monitor Does not capture design-specific performance variation Critical path may be difficult to identify (IP from 3rd party) Calibrating monitors at multiple modes/voltages requires long test time A closed loop AVS can be implemented by different monitors. For example, we can use a generic monitor such as an inverter-based ring-oscillator to measure the performance variation of a chip. But this monitor does not capture the design-specific performance variation. To capture the design-specific performance variation, we may use design-dependent replica circuits or in-situ monitors. However, these monitors usually requires information on critical paths, which may be difficult to be identified. Also, calibrating monitors at multiple modes and voltages requires long test time. In this work, we propose a method to design a closed-loop AVS monitor, which can be used as a generic monitor or tuned for a specific design when silicon samples are available. This work: Tunable monitor for closed-loop AVS Can be applied as a generic monitor Or tuned to capture design-specific performance

Design of RO with Tunable Vmin
Identified two circuit knobs to tune Vmin Series resistance Cell types (INV, NAND, NOR) Proposed circuit Different cell type covers different process corners Tune series resistance of each stage to high or low 1 bit Control pins High resistance Low resistance Based on the analysis, we have identified that Vmin of a circuit is mainly affected by series resistance and the cell types. To guardband for the worst-case voltage scaling across different process conditions, we use the cells with the largest Vmin in our RO design. We also introduce a series passgate at each stage along the signal path. By tuning all stages to low resistance, we will obtain the maximum Vmin, which represents the worst-case voltage scaling condition. To reduce the Vmin, we can tune the passgates to high resistance. To design ROs with the maximum Vmin across different process conditions, we have ROs with different cell types. To tune the Vmin of the ROs, we add a configurable series passgate at each stage of the RO. When the series passgate have low resistance, the Vmin is higher and vise versa.

Benefit of Resilience Cost Reduction
Reference flows Pure-margin (PM): conventional methodology w/ only margin insertion Brute-force (BF): insert error-tolerant FFs at timing-critical endpoints Proposed method (CO) achieves up to 20% energy reduction compared to reference methods Resilience benefits increase with safety margin EXU Large margin Medium margin Small margin Large margin Medium margin Small margin MUL Small/medium/large margin  safety margin = 5%/10%/15% of clock period

Increased Benefit of Resilience With AVS
AVS (Adaptive Voltage Scaling) allows lower supply voltage for resilient designs reduced power We trade off between timing-error penalty vs. reduced power at a lower supply voltage Average 18% energy reduction compared to pure-margin designs  Resilience benefits increase in AVS context MUL EXU Minimum achievable energy

Overall Optimization Flow
Iteratively optimize with SEOpt and SkewOpt Initial placement (all FFs = error-tolerant FFs) Margin insertion on K paths based on sensitivity function Replace error-tolerant FFs w/ normal FFs SEOpt Energy < min energy? Save current solution Activity-aware clock skew optimization SkewOpt

Toward Holistic Modeling, Margining and Tolerance of IC Variability

Similar presentations

Presentation on theme: "Toward Holistic Modeling, Margining and Tolerance of IC Variability"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Toward Holistic Modeling, Margining and Tolerance of IC Variability

Similar presentations

Presentation on theme: "Toward Holistic Modeling, Margining and Tolerance of IC Variability"— Presentation transcript:

Similar presentations

About project

Feedback