Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Hardware Reliability Margining for the Dark Silicon Era Liangzhen Lai and Puneet Gupta Department of Electrical Engineering University of California,

Similar presentations


Presentation on theme: "1 Hardware Reliability Margining for the Dark Silicon Era Liangzhen Lai and Puneet Gupta Department of Electrical Engineering University of California,"— Presentation transcript:

1 1 Hardware Reliability Margining for the Dark Silicon Era Liangzhen Lai and Puneet Gupta Department of Electrical Engineering University of California, Los Angeles puneet@ee.ucla.edu This work is supported in part by NSF Variability Expedition grant CCF-1029030

2 Outline Overview Accumulation Model and Management Policies Problem Formulation Experimental Results Conclusion 2

3 Hardware Reliability Margin 3 Parametric margin Voltage/Frequency or sign-off corners E.g., BTI, HCI Physical margin Metal width, layout spacing E.g., current-dependent minimum metal width for EM Typically worst-case driven Mostly derived at hardware design time Uncertainty in workload, circuit operating points etc.

4 Reliability vs. Operating Points Most reliability-related phenomena depends heavily on the circuit operating points Voltage, Frequency, Temperature etc. 4

5 Dynamic Range of Operations Efficiency needed for the Dark Silicon Era Multi/Many-core design with less powerful cores Low voltage/current/power -> less margin “Turbo X”: Turbo Boost (Intel), Turbo Core (AMD) Under certain conditions High voltage/current/power-> more margin 5 Moderate Parallel Intensive Single-thread Workload Low stress states Reliability margin Known pessimistic Known optimistic High stress states

6 Dark Silicon Contexts Pessimism depends on the difference between peak power/temperature and sustainable power/temperature Quantify silicon “darkness” Dark ratio: Power constraint Limit on maximum instantaneous power Thermal constraint Limit on maximum on-chip temperature 6

7 Margining Methodology 7 Formulate as workload optimization Maximize the reliability degradation Still meets the power/thermal constraints

8 Outline Overview Accumulation Model and Management Policies Problem Formulation Experimental Results Conclusion 8

9 Dynamic Reliability Model Most reliability models are static Derived for constant voltage/current/temperature Need a highly dynamic model for optimization Comparing different degradation scenarios 9 P1 P3 t v P1 P3 P2 t v vs.

10 Accumulation Model Some can be derived from the model itself E.g., EM can be modeled by effective current density J eff Other can be derived by simulator E.g. Worst-case BTI degradation can be derived by simulating different power state ordering and picking the worst-case Fitting and interpolation can also be used 10 Time spent in each power states Worst-case degradation at the end of lifetime Accumulation Model

11 Spatial problem vs. Temporal problem With accumulation model, reliability degradation can be modeled as temporal distribution problems The workload and power/thermal constraints are spatial problems 11 P1P3P1 P2P1P2 P1 P2 P3 t v

12 System Management Policy We assume a fair round-robin policy Iterate scheduling priorities among all processor cores Iterating frequency can be of hours to days Assuming this policy because: Simple: open-loop, reasonable to assume at hardware design time Effective: sufficient iterations to balance workload during typical hardware life time of multiple years Pessimistic: more sophisticated policies are likely to perform better, i.e., margin is pessimistic 12

13 Bridging Spatial and Temporal Problems Management policy will iterate workload among all cores Spatial distribution is equivalent to temporal distribution 13 P1P3P1 P2P1P2 P1 P2 P3 t v Spatial constraints Temporal distribution

14 Outline Overview Accumulation Model and Management Policies Problem Formulation Experimental Results Conclusion 14

15 Optimization Under Power Constraints x is the number of cores at each power states Also the input to the accumulation model f(x) P is the power corresponding to the power states P max is the power constraint Formulated as Integer Linear Programing (ILP) problem 15

16 Thermal Problem Thermal limit can be reached by two scenarios Heat up then cool down (left) Constant temperature (right) The constant stress will result in worse degradation Higher average temperature More time in high power state 16

17 Optimization Under Thermal Constraints S is time spend in each power states for each cores A is the temperature sensitivity matrix Temperature increase per unit power T max is the maximum temperature constraint T bak is the background power for each cores Formulated as Linear Programming (LP) problem 17

18 Outline Overview Accumulation Model and Management Policies Problem Formulation Experimental Results Conclusion 18

19 Experimental Setup Power model Based on a commercial processor benchmark Using libraries characterized at different supply voltages from 0.6V to 0.9V Thermal model Using HotSpot simulator Consider the cases of 2x2, 4x4, 8x8 and 16x16 cores BTI: both NBTI and PBTI EM: metal sized to have the same current density (MTTF) 19

20 Local Power Network EM Results 20 Power constraint Thermal constraint 40% reduction

21 Signal Wire EM Results 21 Power constraint Thermal constraint 60% reduction

22 BTI Results 22 Power constraint Thermal constraint 20% reduction

23 Conclusion We propose hardware reliability margining methodology for chips in the dark silicon era We formulate the margining problem under power and thermal constraints Experimental results show that at 60% dark ratio, our method can achieve 40%-60% reduction in metal width margin and 20% reduction in BTI delay margin 23

24 Backup slides 24

25 EM Accumulation Model Effective current density: For local power mesh Jeff can be calculated by average power consumed For signal wires: J eff is proportional to V * f 25

26 BTI Accumulation Model Two steps: Identify the worst-case ordering by simulator Worst BTI degradation happen when power states are applied in increasing order of stress voltages Fitting the accumulation model First pick a set of power state distribution sample x Simulate the degradation g(x) Assuming the fitting function is Formulated as: 26


Download ppt "1 Hardware Reliability Margining for the Dark Silicon Era Liangzhen Lai and Puneet Gupta Department of Electrical Engineering University of California,"

Similar presentations


Ads by Google