MicroFix: Exploiting Path-grained Timing Adaptability for Improving Power-Performance Efficiency Guihai Yan, Yinhe Han, Hui Liu, Xiaoyao Liang, Xiaowei.

Slides:



Advertisements
Similar presentations
Online Timing Variation Tolerance for Digital Integrated Circuits Guihai Yan & Xiaowei Li State Key Laboratory of Computer Architecture, Institute of Computing.
Advertisements

Christopher LaFrieda and Rajit Manohar Computer Systems Laboratory Cornell University Reducing Power Consumption with Relaxed Quasi Delay-Insensitive Circuits.
Tunable Sensors for Process-Aware Voltage Scaling
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Keeping Hot Chips Cool Ruchir Puri, Leon Stok, Subhrajit Bhattacharya IBM T.J. Watson Research Center Yorktown Heights, NY Circuits R-US.
Thermal-Scheduling For Ultra Low Power Mobile Microprocessor May, Thermal-Scheduling For Ultra Low Power Mobile Microprocessor George Cai 1 Chee.
SuperRange: Wide Operational Range Power Delivery Design for both STV and NTV Computing Xin He, Guihai Yan, Yinhe Han, Xiaowei Li Institute of Computing.
Designing a Processor from the Ground Up to Allow Voltage/Reliability Tradeoffs Andrew Kahng (UCSD) Seokhyeong Kang (UCSD) Rakesh Kumar (Illinois) John.
Timing Margin Recovery With Flexible Flip-Flop Timing Model
An Analytical Model for Worst-case Reorder Buffer Size of Multi-path Minimal Routing NoCs Gaoming Du 1, Miao Li 1, Zhonghai Lu 2, Minglun Gao 1, Chunhua.
CML CML Presented by: Aseem Gupta, UCI Deepa Kannan, Aviral Shrivastava, Sarvesh Bhardwaj, and Sarma Vrudhula Compiler and Microarchitecture Lab Department.
Adaptive Techniques for Leakage Power Management in L2 Cache Peripheral Circuits Houman Homayoun Alex Veidenbaum and Jean-Luc Gaudiot Dept. of Computer.
A 16-Bit Kogge Stone PS-CMOS adder with Signal Completion Seng-Oon Toh, Daniel Huang, Jan Rabaey May 9, 2005 EE241 Final Project.
Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.
UCSD VLSI CAD Laboratory and UIUC PASSAT Group - ASPDAC, Jan. 21, 2010 Slack Redistribution for Graceful Degradation Under Voltage Overscaling Andrew B.
Power-Aware Placement
Practically Realizing Random Access Scan By Anand Mudlapur ECE Dept. Auburn University.
TH EDA NTHU-CS VLSI/CAD LAB 1 Re-synthesis for Reliability Design Shih-Chieh Chang Department of Computer Science National Tsing Hua University.
Chung-Kuan Cheng†, Andrew B. Kahng†‡,
On-Line Adjustable Buffering for Runtime Power Reduction Andrew B. Kahng Ψ Sherief Reda † Puneet Sharma Ψ Ψ University of California, San Diego † Brown.
1 UCSD VLSI CAD Laboratory ISQED-2009 Revisiting the Linear Programming Framework for Leakage Power vs. Performance Optimization Kwangok Jeong, Andrew.
Jieyi Long and Seda Ogrenci Memik Dept. of EECS, Northwestern Univ. Jieyi Long and Seda Ogrenci Memik Dept. of EECS, Northwestern Univ. Automated Design.
Architectural Power Management for High Leakage Technologies Department of Electrical and Computer Engineering Auburn University, Auburn, AL /15/2011.
UC San Diego Computer Engineering VLSI CAD Laboratory UC San Diego Computer Engineering VLSI CAD Laboratory UC San Diego Computer Engineering VLSI CAD.
1 Reconfigurable ECO Cells for Timing Closure and IR Drop Minimization TingTing Hwang Tsing Hua University, Hsin-Chu.
1 paper I design and implementation of the aegis single-chip secure processor using physical random functions, isca’05 nuno alves 28/sep/06.
Power, Energy and Delay Static CMOS is an attractive design style because of its good noise margins, ideal voltage transfer characteristics, full logic.
By Praveen Venkataramani Vishwani D. Agrawal TEST PROGRAMMING FOR POWER CONSTRAINED DEVICES 5/9/201322ND IEEE NORTH ATLANTIC TEST WORKSHOP 1.
Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.
11 1 Process Variation in Near-threshold Wide SIMD Architectures Sangwon Seo 1, Ronald G. Dreslinski 1, Mark Woh 1, Yongjun Park 1, Chaitali Chakrabarti.
1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.
Folklore Confirmed: Compiling for Speed = Compiling for Energy Tomofumi Yuki INRIA, Rennes Sanjay Rajopadhye Colorado State University 1.
Accuracy-Configurable Adder for Approximate Arithmetic Designs
Low Power Techniques in Processor Design
-1- UC San Diego / VLSI CAD Laboratory A Global-Local Optimization Framework for Simultaneous Multi-Mode Multi-Corner Clock Skew Variation Reduction Kwangsoo.
Mehdi Sadi, Italo Armenti Design of a Near Threshold Low Power DLL for Multiphase Clock Generation and Frequency Multiplication.
A New Methodology for Reduced Cost of Resilience Andrew B. Kahng, Seokhyeong Kang and Jiajia Li UC San Diego VLSI CAD Laboratory.
Determining the Optimal Process Technology for Performance- Constrained Circuits Michael Boyer & Sudeep Ghosh ECE 563: Introduction to VLSI December 5.
Power Reduction for FPGA using Multiple Vdd/Vth
Dept. of Computer Science, UC Irvine
An Efficient Algorithm for Dual-Voltage Design Without Need for Level-Conversion SSST 2012 Mridula Allani Intel Corporation, Austin, TX (Formerly.
Jia Yao and Vishwani D. Agrawal Department of Electrical and Computer Engineering Auburn University Auburn, AL 36830, USA Dual-Threshold Design of Sub-Threshold.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.
1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.
Low Power – High Speed MCML Circuits (II)
A Robust Pulse-triggered Flip-Flop and Enhanced Scan Cell Design
A Faster Approximation Scheme for Timing Driven Minimum Cost Layer Assignment Shiyan Hu*, Zhuo Li**, and Charles J. Alpert** *Dept of ECE, Michigan Technological.
1 A Cost-effective Substantial- impact-filter Based Method to Tolerate Voltage Emergencies Songjun Pan 1,2, Yu Hu 1, Xing Hu 1,2, and Xiaowei Li 1 1 Key.
Skewed Flip-Flop Transformation for Minimizing Leakage in Sequential Circuits Jun Seomun, Jaehyun Kim, Youngsoo Shin Dept. of Electrical Engineering, KAIST,
Houman Homayoun, Sudeep Pasricha, Mohammad Makhzan, Alex Veidenbaum Center for Embedded Computer Systems, University of California, Irvine,
Outline Introduction: BTI Aging and AVS Signoff Problem
Stochastic Current Prediction Enabled Frequency Actuator for Runtime Resonance Noise Reduction Yiyu Shi*, Jinjun Xiong +, Howard Chen + and Lei He* *Electrical.
Basics of Energy & Power Dissipation
Detecting Errors Using Multi-Cycle Invariance Information Nuno Alves, Jennifer Dworak, and R. Iris Bahar Division of Engineering Brown University Providence,
Thermal-aware Phase-based Tuning of Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing This work was supported.
11 Online Computing and Predicting Architectural Vulnerability Factor of Microprocessor Structures Songjun Pan Yu Hu Xiaowei Li {pansongjun, huyu,
Computer Science and Engineering Power-Performance Considerations of Parallel Computing on Chip Multiprocessors Jian Li and Jose F. Martinez ACM Transactions.
FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Topics n Low power design. n Pipelining.
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
Department of Electrical and Computer Engineering University of Wisconsin - Madison Optimizing Total Power of Many-core Processors Considering Voltage.
Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.
Unified Adaptivity Optimization of Clock and Logic Signals Shiyan Hu and Jiang Hu Dept of Electrical and Computer Engineering Texas A&M University.
Power-Optimal Pipelining in Deep Submicron Technology
Temperature and Power Management
Yiyu Shi*, Jinjun Xiong+, Howard Chen+ and Lei He*
Guihai Yan, Yinhe Han, Xiaowei Li, and Hui Liu
Off-path Leakage Power Aware Routing for SRAM-based FPGAs
Guihai Yan, Yinhe Han, and Xiaowei Li
Measuring the Gap between FPGAs and ASICs
Presentation transcript:

MicroFix: Exploiting Path-grained Timing Adaptability for Improving Power-Performance Efficiency Guihai Yan, Yinhe Han, Hui Liu, Xiaoyao Liang, Xiaowei Li Key Laboratory of Computer System and Architecture, ICT (Institute of Computing Technology), CAS, Beijing, P.R. China NVIDIA Corporation, USA

Outline What’s Path-grained Timing Adaptability (PTA) Potential of PTA for Efficiency Improvement How to Exploit PTA Case Study Results Conclusions

Impact of DVFS to Path Delay Traditionally, suppose voltage scaling down makes P1 and P2 timing critical, then what? Scaling down frequency to all stages of pipeline Question: Can these emerging critical paths be salvaged to trade more voltage scaling down? Maybe Yes! By fine-grained time stealing

Timing Imbalance Generous Flip-flop (GFF) Backward Adaptable Flip-flop (BAFF) Forward Adaptable Flip-flop (FAFF) Unadaptable Flip-flop (UAFF) Slack_up > TH, Slack_dn > TH Slack_up > TH, Slack_dn ≤ TH Slack_up ≤ TH, Slack_dn > TH Slack_up ≤ TH, Slack_dn ≤ TH

Intrinsic Timing Imbalance Case study FPU, adopted by OpenSPARC T1 Support all IEEE 754 floating-point data types Synthesized by Synopsys Design Compiler with UMC 0.18um technology Cycle period: (1+10%) ×T critical The GFFs, FAFFs, and BAFFs take considerable even dominated proportion! Attractive Potential

DVFS Exacerbating Imbalance Generally, the time margin of longer paths diminish much more faster than that of short ones Assume that the path delay is the sum of delay of gates on the path T G : the gate delay Delta: the delay change during the voltage scaling down Before voltage scaling down △ S 1 = (n - m) × T G After voltage scaling down △ S 2 = (n - m) × (T G + Delta) Define: △ S=|Slack_dn - Slack_up| Slack_dn Slack_up n gates m gates △ S 1 < △ S 2 Example

If the Imbalance be utilized… Check the lower bound of cycle period T Traditionally: T 1 = n× (T G +Delta) From MicroFix’s perspective: T 2 = (m+n)/2 × (T G +Delta) ≤ T 1 - TH Note: preclude the UAFFs

How to deal with UAFFs? Two-supply voltage scheme [Usami, JSSC’98] [Ghosh, TCAD’07] Critical Isolation: the critical paths resulting in UAFFs The supply voltage of Critical Isolation are more conservative than that of other portion out of Critical Isolation. Critical Isolation Powered by Conservative Voltage Powered by Aggressive Voltage The exploitable scope of MicroFix

How to “Fix’’? Two supply voltage scheme Timing sensors [Yan, DATE’09][Agarwal, VTS’07] Multiple-phase Clocks (generated by a DLL)

Operational Principles Ensure that the restored margin ‘v’ and ‘f ’ can guard safe voltage and frequency turning.

Experimental Setup Gate-level Study the adaptability and overhead with a synthesized FPU –Timing info. -> PrimeTime Transistor-level Investigated the Power-Performance tradeoffs with Hspice simulations –32nm PTM models dedicated for HP and LP applications, respectively.

Exploring Design Tradeoffs ‘TH’ play a critical role in determining the ultimate Efficiency Critical Isolation The exploitable scope of MicroFix Critical Isolation The exploitable scope of MicroFix Smaller ‘TH’, smaller CI, but less aggressive voltage reduction! Larger ‘TH’, larger CI, but more aggressive voltage reduction! What ‘TH’ is optimal?

Exploring Design Tradeoffs /2 Percentage of Cells in Critical Isolation

Exploring Design Tradeoffs /3 Sensor Area Overhead a sensor is about 8x that of a pipeline flip-flop (based on the number of transistors) [Yan, DATE09] The paths in the critical isolation and those with ‘over-larger’ slack (i.e. slack >T × TH + t margin ) do not need to be monitored by sensors

Exploring Design Tradeoffs /4 Sensor Power Overhead in the most pessimistic case (TH=0.3, all sensors simultaneously flag timing errors): 14% HOWEVER, such worst-case power overhead can hardly happen due to three reasons 1) Sensors do not need to be always on 2) It’s almost impossible all sensors flag impending timing errors simultaneously 3) TH=0.3 actually is not a optimal configuration Therefore, the pessimistic power overhead won’t offset much efficiency of MicroFix!

Hspice Simulations Object: Investigate the detailed delay-power relation of the target pipeline It is ideal to directly simulate the transistor- level model of the target pipeline with Hspice; however it is very labor-intensive and time consuming. So we took a indirect way to conduct the Hspice simulations P total (V,F) = P comb (V,F)+P ff (V,F) 1/F = T = t c + t setup + t c−to−q

Combinational Component ISCAS85 (c432, c499, c880, c1355, c1908, c2670) 32nm PTM models (HP and LP versions) Normalized V-D and V-P relations comply well with all of the simulated benchmarks!

Sequential Component V-D V-P

Efficiency Comparsion TH = 0.2 is an optimal choice! Efficiency Improvement: 35% EDP, 28% PDP

Conclusion MicroFix can improve DVFS efficiency by exploiting the path-grained adaptability The timing imbalance threshold, TH, implies a critical design tradeoff The efficiency of EDP for HP application up to 35% and PDP for LP application up to 28%, at the expense of only 7% area overhead

Thanks! Q&A