-1- UCSD VLSI CAD Laboratory and UIUC PASSAT Group Recovery-Driven Design: A Power Minimization Methodology for Error-Tolerant Processor Modules Andrew B. Kahng †, Seokhyeong Kang †, Rakesh Kumar ‡ and John Sartori ‡ † VLSI CAD LABORATORY, UCSD ‡ PASSAT GROUP, UIUC DAC, June 17, 2010
-2- Outline Background and Motivation Background and Motivation –Voltage scaling and error-tolerant design –Error-tolerant design vs. recovery-driven design Recovery-Driven Design Recovery-Driven Design –Related work –Heuristic: power minimization –Error rate estimation Experimental Framework and Results Experimental Framework and Results –Design methodology –Results and analysis Conclusions and Ongoing Work Conclusions and Ongoing Work
-3- Reducing Power with Voltage Scaling Power is a first-order design constraint Power is a first-order design constraint –Moore’s law implies power density of processors continues to escalate Voltage scaling reduces power but eventually causes massive timing violations Voltage scaling reduces power but eventually causes massive timing violations Voltage Timing errors begin to occur Error-resilience allows deeper voltage scaling Error-resilience allows deeper voltage scaling
-4- *Hedge et al. “Energy-Efficient Signal Processing via Algorithmic Noise- Tolerance”, ISLPED 1999 Error-Tolerance Mechanisms Traditional IC design Error-Tolerant design No errors allowed No errors allowed Error correction architecture allows timing errors Error correction architecture allows timing errors Overclocking and voltage overscaling not enabled Overclocking and voltage overscaling not enabled Overclocking and voltage overscaling enabled Overclocking and voltage overscaling enabled Hardware error-tolerance Hardware error-tolerance –Errors are detected and corrected during runtime –Razor (MICRO 2003) Application-level error-tolerance* Application-level error-tolerance* –Errors are allowed to propagate to software resulting in reduced performance or output quality
-5- Our Work: From Error-Tolerance to Recovery-Driven Error-Tolerant design Recovery-Driven design Designed “from ground up” for specific target error rate Designed “from ground up” for specific target error rate Design methodology exploits functional information Design methodology exploits functional information Design still optimized for correct operation Design still optimized for correct operation Design methodology based on STA, workload-agnostic Design methodology based on STA, workload-agnostic
-6- Recovery-Driven Design 1. Minimize error rate to extend range of voltage scaling Error rate (traditional) Error rate (optimized) 1. OptimizePaths 2. Reduce design power with cell downsizing or Vt swap Power lower voltage Target error rate 2. ReducePower Power (traditional) Power (optimized) P min V min Operating point P min V min New operating point Error rate How to minimize power in recovery-driven design?
-7- Outline Background and motivation Background and motivation –Voltage scaling and error-tolerant processor –Error-tolerant design vs. recovery-driven design Recovery-Driven Design Recovery-Driven Design –Related work –Heuristic: power minimization –Error rate estimation Experimental Framework and Results Experimental Framework and Results –Design methodology –Results and analysis Conclusions and Ongoing Work Conclusions and Ongoing Work
-8- Related Works: Design-Level Optimizations for Error-Tolerant Processors BlueShift* BlueShift* –Increase frequency up to a target error rate –Speed up error paths with timing overrides and FBB *Grescamp et al. “Blueshift: Designing Processors for Timing Speculation from the Ground up”, HPCA 2009 **Kahng et al. “Slack Redistribution for Graceful Degradation Under Voltage Overscaling”, ASPDAC 2010 Slack Optimizer** Slack Optimizer** –Make gradual slope slack to achieve gracefully increasing error rate –Estimate error rate using switching activity from SAIF
-9- Recovery-Driven Design Methodology Problem: minimize processor power (leakage + dynamic) for a target error rate Problem: minimize processor power (leakage + dynamic) for a target error rate Approach: we use slack redistribution and power reduction enabled by accurate error rate estimation Approach: we use slack redistribution and power reduction enabled by accurate error rate estimation Slack redistribution: reshape path slack based on path activity (toggle rate) to minimize error rate and extend voltage scaling (OptimizePaths and ReducePower heuristics) Slack redistribution: reshape path slack based on path activity (toggle rate) to minimize error rate and extend voltage scaling (OptimizePaths and ReducePower heuristics) Error rate estimation using a simulation dump file (VCD) Error rate estimation using a simulation dump file (VCD)
-10- Slack Redistribution Redistribute slack from paths that rarely toggle to paths that frequently toggle Redistribute slack from paths that rarely toggle to paths that frequently toggle OptimizePaths ReducePower
-11- Slack Redistribution Flow Toggle Information: simulation dump file is loaded Toggle Information: simulation dump file is loaded Path Optimization: minimize error rate to extend range of voltage scaling Path Optimization: minimize error rate to extend range of voltage scaling Power Reduction: downsize cells to obtain additional power savings Power Reduction: downsize cells to obtain additional power savings Error Rate Estimation: estimate with toggle info and STA results Error Rate Estimation: estimate with toggle info and STA results NetlistVCD Analyze activity Timing Analysis OptimizePaths ER > ER target Reduce Voltage ECO P&R YES NO ReducePower ER Compute Error Rate
-12- Heuristic Details – OptimizePaths Main idea: increase slack of frequently-exercised paths in order of decreasing toggle rate Main idea: increase slack of frequently-exercised paths in order of decreasing toggle rate Procedure Procedure 1. Pick a critical path p with maximum toggle rate 2. Resize cell instance c i in p 3. If the path slack is not improved, cell change is restored 4. Repeat 2. ~ 3. for all cell instances in path p 5. Repeat 2.~ 4. for all critical paths OptimizePaths → ReducePower → Voltage Scaling OptimizePaths → ReducePower → Voltage Scaling
-13- Heuristic Details – ReducePower Main idea: downsize cells on non-critical paths in order of decreasing sensitivity Main idea: downsize cells on non-critical paths in order of decreasing sensitivity Sensitivity (c) = (power c – power c’ ) / (slack c – slack c’ ) Sensitivity (c) = (power c – power c’ ) / (slack c – slack c’ ) Procedure Procedure 1. Pick a cell c with maximum sensitivity 2. Downsize cell c with logically equivalent cell 3. Incremental timing analysis and check error rate 4. If error rate is increased, cell change is restored 5. Repeat 1. ~ 4. → ReducePower → Voltage Scaling OptimizePaths → ReducePower → Voltage Scaling
-14- Path Extraction for Error Rate Estimation Instead of simulation, we use toggle information from value change dump (VCD) file Instead of simulation, we use toggle information from value change dump (VCD) file List of toggled nets in each cycle time
-15- Toggle and Error Rate Calculation 20X faster than actual simulation and accurate 20X faster than actual simulation and accurate Toggle rate: Toggle rate: Error rate: Error rate: p: path χ toggle : set of cycles which p has toggled X tot : total cycle # *Kahng et al. “Slack Redistribution...”, ASPDAC 2010.
-16- Evaluation of Heuristic Design Choices Path ordering Path ordering –toggle rate * slack –toggle rate Optimization radius Optimization radius –path only –fan-in/out network Starting netlist Starting netlist –loosely constrained –tightly constrained Voltage step size Voltage step size –0.01V and 0.05V
-17- Outline Background and motivation Background and motivation –Voltage scaling and error-tolerant processor –Error-tolerant design vs. recovery-driven design Recovery-Driven Design Recovery-Driven Design –Related work –Heuristic: power minimization –Error rate estimation Experimental Framework and Results Experimental Framework and Results –Design methodology –Results and analysis Conclusions and Ongoing Work Conclusions and Ongoing Work
-18- Design Methodology System level simulation using Simics with real benchmarks System level simulation using Simics with real benchmarks Gate level simulation to get signal toggle information (NC verilog) Gate level simulation to get signal toggle information (NC verilog) Prepare Synopsys Liberty file using Cadence Signal Storm Prepare Synopsys Liberty file using Cadence Signal Storm Implement in C++ and use Tcl socket to communicate with PrimeTime Implement in C++ and use Tcl socket to communicate with PrimeTime Perform ECO P&R with cell swap list Perform ECO P&R with cell swap list
-19- Power Analysis for Real Workloads system-level simulation Simics + Transplant functional simulation VCS or NCVerilog design implementation DC, SOCE memory modeling MEMGEN, CACTI power analysis PrimeTime-PX RTL design OpenSPARC benchmark binary (bzip, twolf...) input pattern VCD netlist SPEF Liberty (.lib) System level simulation with real benchmark binary and input patterns are captured System level simulation with real benchmark binary and input patterns are captured Estimate power of memory – MEMGEN, CACTI Estimate power of memory – MEMGEN, CACTI Analyze leakage and dynamic power using PT-PX Analyze leakage and dynamic power using PT-PX
-20- Testbed Target design: sub-modules of OpenSPARC T1 Target design: sub-modules of OpenSPARC T1 Benchmark: ammp, bzip2, equake, twolf, sort. Fast-forward, capture vectors Benchmark: ammp, bzip2, equake, twolf, sort. Fast-forward, capture vectors Implementation: TSMC 65GP technology with standard SP&R Implementation: TSMC 65GP technology with standard SP&R Alternative design techniques: Alternative design techniques: –SP&R with loose constraints and tight constraints –Slack Optimizer (make a “gradual slope”) [ASPDAC2010]
-21- Power Consumption of Each Design Technique Power savings compared to tradition SP&R design Power savings compared to tradition SP&R design 25% power 0.125% error rate (average) Area overhead and power savings (from loose SP&R) Area overhead and power savings (from loose SP&R) Tight SP&RSlack OptimizerPower Optimizer Area overhead25.9%3.7%7.7% Power 0.125% error 12%14%25% Error rate (%) LSU_STB_CTL
-22- Power Consumption for HW-Based Error Tolerance Razor architecture was assumed for error detection and correction – account for Razor overhead (area, power) and power cost of error correction Razor architecture was assumed for error detection and correction – account for Razor overhead (area, power) and power cost of error correction LSU_STB_CTL 0.84V 0.76V 21% additional power savings
-23- Conclusions and Ongoing Work We propose recovery-driven design which minimizes power for a target timing error rate We propose recovery-driven design which minimizes power for a target timing error rate –Optimize designs with functional information and iterative voltage scaling –We also develop a fast and accurate technique for post-layout activity and error rate estimation We demonstrate significant power benefits – up to 25% power savings compared to traditional P&R at an error rate of 0.125% We demonstrate significant power benefits – up to 25% power savings compared to traditional P&R at an error rate of 0.125% Ongoing work Ongoing work –Recovery-driven design for different error resilience mechanisms, different sources of variation –Design / architecture co-exploration
-24- Thank you
-25- BACKUP
-26- Related Work: BlueShift BlueShift* : maximize frequency for a given error rate BlueShift* : maximize frequency for a given error rate BlueShift speedup BlueShift speedup –Paths with the highest frequency of timing errors –FBB (forward body-biasing) & Timing override Limitation Limitation –Repetitive gate level simulation – impractical –Design overhead of FBB Compute error rate ER < Target Gate-level simulation YES NO Speed up paths Finish *Grescamp et al. “Blueshift: Designing processors for timing speculation from the ground up”, HPCA 2009
-27- Exploiting Error Resilience for Multi-core Design Design of heterogeneously reliable multi-core processor Design of heterogeneously reliable multi-core processor Power-optimized for different mixes of workloads Power-optimized for different reliability target Individual cores are customized for a specific workload class
-28- Lifetime Energy Minimization Maximizing energy efficiency of DVFS-based designs Maximizing energy efficiency of DVFS-based designs –Inefficiency is due to a design optimized for a single power / performance point –Minimize energy when the processor spends R of its lifetime at high freq. (e.g., talk mode) and (1 – R) of its lifetime at low freq. (e.g., standby mode) Replication-based methodology: area overhead vs. power tradeoffs Co-optimization methodology: optimize design with two operating constraints – (freq_hi, V_hi) and (freq_lo, V_lo) Both methodologies can be applied alternatively in each sub- modules
-29- Sensitivity-Based Optimization Platform Post-layout stage cell swap Post-layout stage cell swap –Cell sizing + ECO –Multi-V t swap –Multi-L gate swap Swap cell and check STA with PrimeTime socket interface Swap cell and check STA with PrimeTime socket interface Cell swap according to the sensitivity S Cell swap according to the sensitivity S –For leakage optimization, S = Δleakage x slack –For timing closure, S = Δslack / (slack – WNS) MMMC (Multi-Mode Multi-Corner) can be considered with multiple PrimeTime sockets MMMC (Multi-Mode Multi-Corner) can be considered with multiple PrimeTime sockets L gate biasing
-30- Limitations of Traditional CAD Flow In modern digital design, vast majority of paths have near-critical slack – wall of slack distribution In modern digital design, vast majority of paths have near-critical slack – wall of slack distribution Scaling beyond a critical operating point causes massive errors and power benefits can be limited* Scaling beyond a critical operating point causes massive errors and power benefits can be limited* zero slack timing slack number of paths error rate lower voltage (higher frequency) operatingpoint Error rate Error rate = # cycles which have timing error # total cycles 0.0 % at 1.00V 1.0 % at 0.95V 20.0 % at 0.90V ‘wall of slack’ *Kahng et al. “Slack Redistribution...”, ASPDAC 2010.