-1- UCSD VLSI CAD Laboratory and UIUC PASSAT Group Recovery-Driven Design: A Power Minimization Methodology for Error-Tolerant Processor Modules Andrew.

Slides:

Advertisements

Similar presentations

(1/25) UCSD VLSI CAD Laboratory - ISQED10, March. 23, 2010 Toward Effective Utilization of Timing Exceptions in Design Optimization Kwangok Jeong, Andrew.

Advertisements

Thank you for your introduction.

International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.

NTHU-CS VLSI/CAD LAB TH EDA De-Shiuan Chiou Da-Cheng Juan Yu-Ting Chen Shih-Chieh Chang Department of CS, National Tsing Hua University, Taiwan Fine-Grained.

Mapping for Better Than Worst-Case Delays In LUT-Based FPGA Designs Kirill Minkovich and Jason Cong VLSI CAD Lab Computer Science Department University.

Designing a Processor from the Ground Up to Allow Voltage/Reliability Tradeoffs Andrew Kahng (UCSD) Seokhyeong Kang (UCSD) Rakesh Kumar (Illinois) John.

Timing Margin Recovery With Flexible Flip-Flop Timing Model

Minimum Implant Area-Aware Gate Sizing and Placement

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science August 20, 2009 Enabling.

UCSD VLSI CAD Laboratory and UIUC PASSAT Group - ASPDAC, Jan. 21, 2010 Slack Redistribution for Graceful Degradation Under Voltage Overscaling Andrew B.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

Power-Aware Placement

UC San Diego Computer Engineering VLSI CAD Laboratory UC San Diego Computer Engineering VLSI CAD Laboratory UC San Diego Computer Engineering VLSI CAD.

Toward PDN Resource Estimation: A Law of General Power Density Kwangok Jeong and Andrew B. Kahng

Energy Evaluation Methodology for Platform Based System-On- Chip Design Hildingsson, K.; Arslan, T.; Erdogan, A.T.; VLSI, Proceedings. IEEE Computer.

Architectural-Level Prediction of Interconnect Wirelength and Fanout Kwangok Jeong, Andrew B. Kahng and Kambiz Samadi UCSD VLSI CAD Laboratory

NTHU-CS VLSI/CAD LAB TH EDA Student : Da-Cheng Juan Advisor : Shih-Chieh Chang Fine-Grained Sleep Transistor Sizing Algorithm for Leakage Power Minimization.

On-Line Adjustable Buffering for Runtime Power Reduction Andrew B. Kahng Ψ Sherief Reda † Puneet Sharma Ψ Ψ University of California, San Diego † Brown.

1 UCSD VLSI CAD Laboratory ISQED-2009 Revisiting the Linear Programming Framework for Leakage Power vs. Performance Optimization Kwangok Jeong, Andrew.

A Cost-Driven Lithographic Correction Methodology Based on Off-the-Shelf Sizing Tools.

ECE 510 Brendan Crowley Paper Review October 31, 2006.

UC San Diego Computer Engineering VLSI CAD Laboratory UC San Diego Computer Engineering VLSI CAD Laboratory UC San Diego Computer Engineering VLSI CAD.

Software-Based Online Detection of Hardware Defects: Mechanisms, Architectural Support, and Evaluation Kypros Constantinides University of Michigan Onur.

Enhanced Metamodeling Techniques for High-Dimensional IC Design Estimation Problems Andrew B. Kahng, Bill Lin and Siddhartha Nath VLSI CAD LABORATORY,

1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University

EE466: VLSI Design Power Dissipation. Outline Motivation to estimate power dissipation Sources of power dissipation Dynamic power dissipation Static power.

VOLTAGE SCHEDULING HEURISTIC for REAL-TIME TASK GRAPHS D. Roychowdhury, I. Koren, C. M. Krishna University of Massachusetts, Amherst Y.-H. Lee Arizona.

Accuracy-Configurable Adder for Approximate Arithmetic Designs

-1- UC San Diego / VLSI CAD Laboratory A Global-Local Optimization Framework for Simultaneous Multi-Mode Multi-Corner Clock Skew Variation Reduction Kwangsoo.

A New Methodology for Reduced Cost of Resilience Andrew B. Kahng, Seokhyeong Kang and Jiajia Li UC San Diego VLSI CAD Laboratory.

Adopting Multi-Valued Logic for Reduced Pin-Count Testing Baohu Li, Bei Zhang and Vishwani Agrawal Auburn University, ECE Dept., Auburn, AL 36849, USA.

A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

International Symposium on Low Power Electronics and Design NoC Frequency Scaling with Flexible- Pipeline Routers Pingqiang Zhou, Jieming Yin, Antonia.

Low-Power Wireless Sensor Networks

1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.

Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu

Jia Yao and Vishwani D. Agrawal Department of Electrical and Computer Engineering Auburn University Auburn, AL 36830, USA Dual-Threshold Design of Sub-Threshold.

-1- UC San Diego / VLSI CAD Laboratory Construction of Realistic Gate Sizing Benchmarks With Known Optimal Solutions Andrew B. Kahng, Seokhyeong Kang VLSI.

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

A Power Grid Analysis and Verification Tool Based on a Statistical Prediction Engine M.K. Tsiampas, D. Bountas, P. Merakos, N.E. Evmorfopoulos, S. Bantas.

Safe Overclocking Safe Overclocking of Tightly Coupled CGRAs and Processor Arrays using Razor © 2012 Guy Lemieux Alex Brant, Ameer Abdelhadi, Douglas Sim,

EEE2243 Digital System Design Chapter 7: Advanced Design Considerations by Muhazam Mustapha, extracted from Intel Training Slides, April 2012.

3 rd Nov CSV881: Low Power Design1 Power Estimation and Modeling M. Balakrishnan.

High-Performance Gate Selection with a Signoff Timer Andrew B. Kahng *, Seokhyeong Kang *, Hyein Lee *, Igor L. Markov + and Pankit Thapar + UC San Diego.

VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.

Outline Introduction: BTI Aging and AVS Signoff Problem

-1- Statistical Analysis and Modeling for Error Composition in Approximate Computation Circuits Wei-Ting Jonas Chan 1, Andrew B. Kahng 1, Seokhyeong.

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

Enabling System-Level Modeling of Variation-Induced Faults in Networks-on-Chips Konstantinos Aisopos (Princeton, MIT) Chia-Hsin Owen Chen (MIT) Li-Shiuan.

Explicit Modeling of Control and Data for Improved NoC Router Estimation Andrew B. Kahng +*, Bill Lin * and Siddhartha Nath + UCSD CSE + and ECE * Departments.

Patricia Gonzalez Divya Akella VLSI Class Project.

UC San Diego / VLSI CAD Laboratory Learning-Based Approximation of Interconnect Delay and Slew Modeling in Signoff Timing Tools Andrew B. Kahng, Seokhyeong.

Power Analysis of Embedded Software : A Fast Step Towards Software Power Minimization 指導教授 : 陳少傑教授組員 : R 張馨怡 R 林秀萍.

Deterministic Diagnostic Pattern Generation (DDPG) for Compound Defects Fei Wang 1,2, Yu Hu 1, Huawei Li 1, Xiaowei Li 1, Jing Ye 1,2 1 Key Laboratory.

-1- UC San Diego / VLSI CAD Laboratory Optimization of Overdrive Signoff Tuck-Boon Chan, Andrew B. Kahng, Jiajia Li and Siddhartha Nath Tuck-Boon Chan,

-1- Delay Uncertainty and Signal Criticality Driven Routing Channel Optimization for Advanced DRAM Products Samyoung Bang #, Kwangsoo Han ‡, Andrew B.

-1- UC San Diego / VLSI CAD Laboratory Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems Andrew B. Kahng and Siddhartha.

Proximity Optimization for Adaptive Circuit Design Ang Lu, Hao He, and Jiang Hu.

PROCEED: Pareto Optimization-based Circuit-level Evaluation Methodology for Emerging Devices Shaodi Wang, Andrew Pan, Chi-On Chui and Puneet Gupta Department.

Raghuraman Balasubramanian Karthikeyan Sankaralingam

Reza Yazdani Albert Segura José-María Arnau Antonio González

Supervised Learning Based Model for Predicting Variability-Induced Timing Errors Xun Jiao, Abbas Rahimi, Balakrishnan Narayanaswamy, Hamed Fatemi, Jose.

Circuit Design Techniques for Low Power DSPs

FPGA Glitch Power Analysis and Reduction

Post-Silicon Calibration for Large-Volume Products

Low Power Digital Design

Measuring the Gap between FPGAs and ASICs

Chapter 3b Leakage Efficient Chip-Level Dual-Vdd Assignment with Time Slack Allocation for FPGA Power Reduction Prof. Lei He Electrical Engineering Department.

Presentation transcript:

-1- UCSD VLSI CAD Laboratory and UIUC PASSAT Group Recovery-Driven Design: A Power Minimization Methodology for Error-Tolerant Processor Modules Andrew B. Kahng †, Seokhyeong Kang †, Rakesh Kumar ‡ and John Sartori ‡ † VLSI CAD LABORATORY, UCSD ‡ PASSAT GROUP, UIUC DAC, June 17, 2010

-2- Outline Background and Motivation Background and Motivation –Voltage scaling and error-tolerant design –Error-tolerant design vs. recovery-driven design Recovery-Driven Design Recovery-Driven Design –Related work –Heuristic: power minimization –Error rate estimation Experimental Framework and Results Experimental Framework and Results –Design methodology –Results and analysis Conclusions and Ongoing Work Conclusions and Ongoing Work

-3- Reducing Power with Voltage Scaling Power is a first-order design constraint Power is a first-order design constraint –Moore’s law implies power density of processors continues to escalate Voltage scaling reduces power but eventually causes massive timing violations Voltage scaling reduces power but eventually causes massive timing violations Voltage Timing errors begin to occur Error-resilience allows deeper voltage scaling Error-resilience allows deeper voltage scaling

-4- *Hedge et al. “Energy-Efficient Signal Processing via Algorithmic Noise- Tolerance”, ISLPED 1999 Error-Tolerance Mechanisms Traditional IC design Error-Tolerant design No errors allowed No errors allowed Error correction architecture allows timing errors Error correction architecture allows timing errors Overclocking and voltage overscaling not enabled Overclocking and voltage overscaling not enabled Overclocking and voltage overscaling enabled Overclocking and voltage overscaling enabled Hardware error-tolerance Hardware error-tolerance –Errors are detected and corrected during runtime –Razor (MICRO 2003) Application-level error-tolerance* Application-level error-tolerance* –Errors are allowed to propagate to software resulting in reduced performance or output quality

-5- Our Work: From Error-Tolerance to Recovery-Driven Error-Tolerant design Recovery-Driven design Designed “from ground up” for specific target error rate Designed “from ground up” for specific target error rate Design methodology exploits functional information Design methodology exploits functional information Design still optimized for correct operation Design still optimized for correct operation Design methodology based on STA, workload-agnostic Design methodology based on STA, workload-agnostic

-6- Recovery-Driven Design 1. Minimize error rate to extend range of voltage scaling Error rate (traditional) Error rate (optimized) 1. OptimizePaths 2. Reduce design power with cell downsizing or Vt swap Power lower voltage Target error rate 2. ReducePower Power (traditional) Power (optimized) P min V min Operating point P min V min New operating point Error rate How to minimize power in recovery-driven design?

-7- Outline Background and motivation Background and motivation –Voltage scaling and error-tolerant processor –Error-tolerant design vs. recovery-driven design Recovery-Driven Design Recovery-Driven Design –Related work –Heuristic: power minimization –Error rate estimation Experimental Framework and Results Experimental Framework and Results –Design methodology –Results and analysis Conclusions and Ongoing Work Conclusions and Ongoing Work

-8- Related Works: Design-Level Optimizations for Error-Tolerant Processors BlueShift* BlueShift* –Increase frequency up to a target error rate –Speed up error paths with timing overrides and FBB *Grescamp et al. “Blueshift: Designing Processors for Timing Speculation from the Ground up”, HPCA 2009 **Kahng et al. “Slack Redistribution for Graceful Degradation Under Voltage Overscaling”, ASPDAC 2010 Slack Optimizer** Slack Optimizer** –Make gradual slope slack to achieve gracefully increasing error rate –Estimate error rate using switching activity from SAIF

-9- Recovery-Driven Design Methodology Problem: minimize processor power (leakage + dynamic) for a target error rate Problem: minimize processor power (leakage + dynamic) for a target error rate Approach: we use slack redistribution and power reduction enabled by accurate error rate estimation Approach: we use slack redistribution and power reduction enabled by accurate error rate estimation Slack redistribution: reshape path slack based on path activity (toggle rate) to minimize error rate and extend voltage scaling (OptimizePaths and ReducePower heuristics) Slack redistribution: reshape path slack based on path activity (toggle rate) to minimize error rate and extend voltage scaling (OptimizePaths and ReducePower heuristics) Error rate estimation using a simulation dump file (VCD) Error rate estimation using a simulation dump file (VCD)

-10- Slack Redistribution Redistribute slack from paths that rarely toggle to paths that frequently toggle Redistribute slack from paths that rarely toggle to paths that frequently toggle OptimizePaths ReducePower

-11- Slack Redistribution Flow Toggle Information: simulation dump file is loaded Toggle Information: simulation dump file is loaded Path Optimization: minimize error rate to extend range of voltage scaling Path Optimization: minimize error rate to extend range of voltage scaling Power Reduction: downsize cells to obtain additional power savings Power Reduction: downsize cells to obtain additional power savings Error Rate Estimation: estimate with toggle info and STA results Error Rate Estimation: estimate with toggle info and STA results NetlistVCD Analyze activity Timing Analysis OptimizePaths ER > ER target Reduce Voltage ECO P&R YES NO ReducePower ER  Compute Error Rate

-12- Heuristic Details – OptimizePaths Main idea: increase slack of frequently-exercised paths in order of decreasing toggle rate Main idea: increase slack of frequently-exercised paths in order of decreasing toggle rate Procedure Procedure 1. Pick a critical path p with maximum toggle rate 2. Resize cell instance c i in p 3. If the path slack is not improved, cell change is restored 4. Repeat 2. ~ 3. for all cell instances in path p 5. Repeat 2.~ 4. for all critical paths OptimizePaths → ReducePower → Voltage Scaling OptimizePaths → ReducePower → Voltage Scaling

-13- Heuristic Details – ReducePower Main idea: downsize cells on non-critical paths in order of decreasing sensitivity Main idea: downsize cells on non-critical paths in order of decreasing sensitivity Sensitivity (c) = (power c – power c’ ) / (slack c – slack c’ ) Sensitivity (c) = (power c – power c’ ) / (slack c – slack c’ ) Procedure Procedure 1. Pick a cell c with maximum sensitivity 2. Downsize cell c with logically equivalent cell 3. Incremental timing analysis and check error rate 4. If error rate is increased, cell change is restored 5. Repeat 1. ~ 4. → ReducePower → Voltage Scaling OptimizePaths → ReducePower → Voltage Scaling

-14- Path Extraction for Error Rate Estimation Instead of simulation, we use toggle information from value change dump (VCD) file Instead of simulation, we use toggle information from value change dump (VCD) file List of toggled nets in each cycle time

-15- Toggle and Error Rate Calculation 20X faster than actual simulation and accurate 20X faster than actual simulation and accurate Toggle rate: Toggle rate: Error rate: Error rate: p: path χ toggle : set of cycles which p has toggled X tot : total cycle # *Kahng et al. “Slack Redistribution...”, ASPDAC 2010.

-16- Evaluation of Heuristic Design Choices Path ordering Path ordering –toggle rate * slack –toggle rate Optimization radius Optimization radius –path only –fan-in/out network Starting netlist Starting netlist –loosely constrained –tightly constrained Voltage step size Voltage step size –0.01V and 0.05V

-17- Outline Background and motivation Background and motivation –Voltage scaling and error-tolerant processor –Error-tolerant design vs. recovery-driven design Recovery-Driven Design Recovery-Driven Design –Related work –Heuristic: power minimization –Error rate estimation Experimental Framework and Results Experimental Framework and Results –Design methodology –Results and analysis Conclusions and Ongoing Work Conclusions and Ongoing Work

-18- Design Methodology System level simulation using Simics with real benchmarks System level simulation using Simics with real benchmarks Gate level simulation to get signal toggle information (NC verilog) Gate level simulation to get signal toggle information (NC verilog) Prepare Synopsys Liberty file using Cadence Signal Storm Prepare Synopsys Liberty file using Cadence Signal Storm Implement in C++ and use Tcl socket to communicate with PrimeTime Implement in C++ and use Tcl socket to communicate with PrimeTime Perform ECO P&R with cell swap list Perform ECO P&R with cell swap list

-19- Power Analysis for Real Workloads system-level simulation Simics + Transplant functional simulation VCS or NCVerilog design implementation DC, SOCE memory modeling MEMGEN, CACTI power analysis PrimeTime-PX RTL design OpenSPARC benchmark binary (bzip, twolf...) input pattern VCD netlist SPEF Liberty (.lib) System level simulation with real benchmark binary and input patterns are captured System level simulation with real benchmark binary and input patterns are captured Estimate power of memory – MEMGEN, CACTI Estimate power of memory – MEMGEN, CACTI Analyze leakage and dynamic power using PT-PX Analyze leakage and dynamic power using PT-PX

-20- Testbed Target design: sub-modules of OpenSPARC T1 Target design: sub-modules of OpenSPARC T1 Benchmark: ammp, bzip2, equake, twolf, sort. Fast-forward, capture vectors Benchmark: ammp, bzip2, equake, twolf, sort. Fast-forward, capture vectors Implementation: TSMC 65GP technology with standard SP&R Implementation: TSMC 65GP technology with standard SP&R Alternative design techniques: Alternative design techniques: –SP&R with loose constraints and tight constraints –Slack Optimizer (make a “gradual slope”) [ASPDAC2010]

-21- Power Consumption of Each Design Technique Power savings compared to tradition SP&R design Power savings compared to tradition SP&R design 25% power 0.125% error rate (average) Area overhead and power savings (from loose SP&R) Area overhead and power savings (from loose SP&R) Tight SP&RSlack OptimizerPower Optimizer Area overhead25.9%3.7%7.7% Power 0.125% error 12%14%25% Error rate (%) LSU_STB_CTL

-22- Power Consumption for HW-Based Error Tolerance Razor architecture was assumed for error detection and correction – account for Razor overhead (area, power) and power cost of error correction Razor architecture was assumed for error detection and correction – account for Razor overhead (area, power) and power cost of error correction LSU_STB_CTL 0.84V 0.76V 21% additional power savings

-23- Conclusions and Ongoing Work We propose recovery-driven design which minimizes power for a target timing error rate We propose recovery-driven design which minimizes power for a target timing error rate –Optimize designs with functional information and iterative voltage scaling –We also develop a fast and accurate technique for post-layout activity and error rate estimation We demonstrate significant power benefits – up to 25% power savings compared to traditional P&R at an error rate of 0.125% We demonstrate significant power benefits – up to 25% power savings compared to traditional P&R at an error rate of 0.125% Ongoing work Ongoing work –Recovery-driven design for different error resilience mechanisms, different sources of variation –Design / architecture co-exploration

-24- Thank you

-25- BACKUP

-26- Related Work: BlueShift BlueShift* : maximize frequency for a given error rate BlueShift* : maximize frequency for a given error rate BlueShift speedup BlueShift speedup –Paths with the highest frequency of timing errors –FBB (forward body-biasing) & Timing override Limitation Limitation –Repetitive gate level simulation – impractical –Design overhead of FBB Compute error rate ER < Target Gate-level simulation YES NO Speed up paths Finish *Grescamp et al. “Blueshift: Designing processors for timing speculation from the ground up”, HPCA 2009

-27- Exploiting Error Resilience for Multi-core Design Design of heterogeneously reliable multi-core processor Design of heterogeneously reliable multi-core processor Power-optimized for different mixes of workloads Power-optimized for different reliability target Individual cores are customized for a specific workload class

-28- Lifetime Energy Minimization Maximizing energy efficiency of DVFS-based designs Maximizing energy efficiency of DVFS-based designs –Inefficiency is due to a design optimized for a single power / performance point –Minimize energy when the processor spends R of its lifetime at high freq. (e.g., talk mode) and (1 – R) of its lifetime at low freq. (e.g., standby mode) Replication-based methodology: area overhead vs. power tradeoffs Co-optimization methodology: optimize design with two operating constraints – (freq_hi, V_hi) and (freq_lo, V_lo) Both methodologies can be applied alternatively in each sub- modules

-29- Sensitivity-Based Optimization Platform Post-layout stage cell swap Post-layout stage cell swap –Cell sizing + ECO –Multi-V t swap –Multi-L gate swap Swap cell and check STA with PrimeTime socket interface Swap cell and check STA with PrimeTime socket interface Cell swap according to the sensitivity S Cell swap according to the sensitivity S –For leakage optimization, S = Δleakage x slack –For timing closure, S = Δslack / (slack – WNS) MMMC (Multi-Mode Multi-Corner) can be considered with multiple PrimeTime sockets MMMC (Multi-Mode Multi-Corner) can be considered with multiple PrimeTime sockets L gate biasing

-30- Limitations of Traditional CAD Flow In modern digital design, vast majority of paths have near-critical slack – wall of slack distribution In modern digital design, vast majority of paths have near-critical slack – wall of slack distribution Scaling beyond a critical operating point causes massive errors and power benefits can be limited* Scaling beyond a critical operating point causes massive errors and power benefits can be limited* zero slack timing slack number of paths error rate lower voltage (higher frequency) operatingpoint Error rate Error rate = # cycles which have timing error # total cycles 0.0 % at 1.00V 1.0 % at 0.95V 20.0 % at 0.90V ‘wall of slack’ *Kahng et al. “Slack Redistribution...”, ASPDAC 2010.