11 1 Process Variation in Near-threshold Wide SIMD Architectures Sangwon Seo 1, Ronald G. Dreslinski 1, Mark Woh 1, Yongjun Park 1, Chaitali Chakrabarti.

Slides:



Advertisements
Similar presentations
Feb. 17, 2011 Midterm overview Real life examples of built chips
Advertisements

International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
Robust Low Power VLSI R obust L ow P ower VLSI Sub-threshold Sense Amplifier (SA) Compensation Using Auto-zeroing Circuitry 01/21/2014 Peter Beshay Department.
VARIUS: A Model of Process Variation and Resulting Timing Errors for Microarchitects Sarangi et al Prateeksha Satyamoorthy CS
2007 MURI Review The Effect of Voltage Fluctuations on the Single Event Transient Response of Deep Submicron Digital Circuits Matthew J. Gadlage 1,2, Ronald.
Power Reduction Techniques For Microprocessor Systems
CML CML Presented by: Aseem Gupta, UCI Deepa Kannan, Aviral Shrivastava, Sarvesh Bhardwaj, and Sarma Vrudhula Compiler and Microarchitecture Lab Department.
Leakage and Dynamic Glitch Power Minimization Using MIP for V th Assignment and Path Balancing Yuanlin Lu and Vishwani D. Agrawal Auburn University ECE.
1 U NIVERSITY OF M ICHIGAN 11 1 SODA: A Low-power Architecture For Software Radio Author: Yuan Lin, Hyunseok Lee, Mark Woh, Yoav Harel, Scott Mahlke, Trevor.
The Cost of Fixing Hold Time Violations in Sub-threshold Circuits Yanqing Zhang, Benton Calhoun University of Virginia Motivation and Background Power.
Yuanlin Lu Intel Corporation, Folsom, CA Vishwani D. Agrawal
1 A Variation-tolerant Sub- threshold Design Approach Nikhil Jayakumar Sunil P. Khatri. Texas A&M University, College Station, TX.
CS 7810 Lecture 12 Power-Aware Microarchitecture: Design and Modeling Challenges for Next-Generation Microprocessors D. Brooks et al. IEEE Micro, Nov/Dec.
Subthreshold Logic Energy Minimization with Application- Driven Performance EE241 Final Project Will Biederman Dan Yeager.
Super-Drowsy Caches Single-V DD and Single-V T Super-Drowsy Techniques for Low- Leakage High-Performance Instruction Caches Nam Sung Kim, Krisztián Flautner,
Low Power Design for Wireless Sensor Networks Aki Happonen.
University of Michigan Advanced Computer Architecture Lab. March 21, Key PED Challenges David Blaauw University of Michigan
1 SODA: A Low-power Architecture For Software Radio Yuan Lin 1, Hyunseok Lee 1, Mark Woh 1, Yoav Harel 1, Scott Mahlke 1, Trevor.
1 A Single-supply True Voltage Level Shifter Rajesh Garg Gagandeep Mallarapu Sunil P. Khatri Department of Electrical and Computer Engineering, Texas A&M.
11 1 The Next Generation Challenge for Software Defined Radio Mark Woh 1, Sangwon Seo 1, Hyunseok Lee 1, Yuan Lin 1, Scott Mahlke 1, Trevor Mudge 1, Chaitali.
1 Razor: A Low Power Processor Design Presented By: - Murali Dharan.
Towards An Efficient Low Frequency Energy Recovery Dynamic Logic Sujay Phadke Advanced Computer Architecture Lab Department of Electrical Engineering and.
Jan. 2007VLSI Design '071 Statistical Leakage and Timing Optimization for Submicron Process Variation Yuanlin Lu and Vishwani D. Agrawal ECE Dept. Auburn.
Lecture 5 – Power Prof. Luke Theogarajan
Lecture 7: Power.
1 Design and Implementation of Turbo Decoders for Software Defined Radio Yuan Lin 1, Scott Mahlke 1, Trevor Mudge 1, Chaitali.
1 paper I design and implementation of the aegis single-chip secure processor using physical random functions, isca’05 nuno alves 28/sep/06.
Advanced Computing and Information Systems laboratory Device Variability Impact on Logic Gate Failure Rates Erin Taylor and José Fortes Department of Electrical.
“ Near-Threshold Computing: Reclaiming Moore’s Law Through Energy Efficient Integrated Circuits ” By Ronald G. Dreslinski, Michael Wieckowski, David Blaauw,
11 1 Customizing Wide-SIMD Architectures for H.264 Sangwon Seo 1, Mark Woh 1, Scott Mahlke 1, Trevor Mudge 1 Vijay Sundaram 2, Chaitali Chakrabarti 2 1.
Determining the Optimal Process Technology for Performance- Constrained Circuits Michael Boyer & Sudeep Ghosh ECE 563: Introduction to VLSI December 5.
Power Reduction for FPGA using Multiple Vdd/Vth
Baoxian Zhao Hakan Aydin Dakai Zhu Computer Science Department Computer Science Department George Mason University University of Texas at San Antonio DAC.
Research on Analysis and Physical Synthesis Chung-Kuan Cheng CSE Department UC San Diego
International Symposium on Low Power Electronics and Design NoC Frequency Scaling with Flexible- Pipeline Routers Pingqiang Zhou, Jieming Yin, Antonia.
University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.
An Efficient Algorithm for Dual-Voltage Design Without Need for Level-Conversion SSST 2012 Mridula Allani Intel Corporation, Austin, TX (Formerly.
Drowsy Caches: Simple Techniques for Reducing Leakage Power Authors: ARM Ltd Krisztián Flautner, Advanced Computer Architecture Lab, The University of.
1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University
Jennifer Winikus Computer Engineering Seminar Michigan Technological University February 10,2011 2/10/2011J Winikus EE
Statistical Sampling-Based Parametric Analysis of Power Grids Dr. Peng Li Presented by Xueqian Zhao EE5970 Seminar.
LA-LRU: A Latency-Aware Replacement Policy for Variation Tolerant Caches Aarul Jain, Cambridge Silicon Radio, Phoenix Aviral Shrivastava, Arizona State.
Minimum Energy Sub-Threshold CMOS Operation Given Yield Constraints Max Dreo Vincent Luu Julian Warchall.
University of Michigan, Ann Arbor
Variation-Tolerant Circuits: Circuit Solutions and Techniques Jim Tschanz, Keith Bowman, and Vivek De Microprocessor Technology Lab Intel Corporation,
Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.
Patricia Gonzalez Divya Akella VLSI Class Project.
EE201C : Stochastic Modeling of FinFET LER and Circuits Optimization based on Stochastic Modeling Shaodi Wang
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke
Harnessing Soft Computation for Low-Budget Fault Tolerance Daya S Khudia Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan,
University of Michigan Electrical Engineering and Computer Science 1 Embracing Heterogeneity with Dynamic Core Boosting Hyoun Kyu Cho and Scott Mahlke.
Z. Feng MTU EE4800 CMOS Digital IC Design & Analysis 6.1 EE4800 CMOS Digital IC Design & Analysis Lecture 6 Power Zhuo Feng.
Department of Electrical and Computer Engineering University of Wisconsin - Madison Optimizing Total Power of Many-core Processors Considering Voltage.
EE222 Winter 2013 Steve Kang Lecture 5 Interconnects and Clock Signaling Open systems interconnect (
Characterizing Processors for Energy and Performance Management Harshit Goyal and Vishwani D. Agrawal Department of Electrical and Computer Engineering,
Yanqing Zhang University of Virginia On Clock Network Design for Sub- threshold Circuitry 1.
University of Michigan Advanced Computer Architecture Lab. 2 CAD Tools for Variation Tolerance David Blaauw and Kaviraj Chopra University of Michigan.
Power-Optimal Pipelining in Deep Submicron Technology
Green cloud computing 2 Cs 595 Lecture 15.
Alireza Shafaei, Shuang Chen, Yanzhi Wang, and Massoud Pedram
SIMD Lane Decoupling Improved Timing-Error Resilience
Effective mechanism for bufferless networks at intensive workloads
CGRA Express: Accelerating Execution using Dynamic Operation Fusion
Experiment Evaluation
Analytical Delay and Variation Modeling for Subthreshold Circuits
Analytical Delay and Variation Modeling for Subthreshold Circuits
Circuit Design Techniques for Low Power DSPs
Combinational Circuit Design
Implementing Low-Power CRC-Half for RFID Circuits
Presentation transcript:

11 1 Process Variation in Near-threshold Wide SIMD Architectures Sangwon Seo 1, Ronald G. Dreslinski 1, Mark Woh 1, Yongjun Park 1, Chaitali Chakrabarti 2, Scott Mahlke 1, David Blaauw 1, Trevor Mudge 1 University of Michigan 1, Arizona State University 2

22 2 Near Threshold Computing  Super Threshold  high performance  high energy consumption  Near Threshold  10x energy reduction  10x performance degradation  Sub Threshold  exponentially decreasing performance  increasing leakage becomes dominant 2

33 3 Near-threshold Computing  Advantage: High energy efficiency  Disadvantage  Low performance throughput  Compensated with very wide SIMD architecture  Sensitive to variations in threshold voltage  More critical issues in wide SIMD architectures  Increased probability of timing errors  Expensive error recovery mechanisms 3

44 4 Near-threshold Computing  Advantage: High energy efficiency  Disadvantage  Low performance throughput  Compensated with very wide SIMD architecture  Sensitive to variations in threshold voltage  More critical issues in wide SIMD architectures  Increased probability of timing errors  Expensive error recovery mechanisms  How bad is the delay variation in wide SIMD architectures running at near-threshold voltages?  How to mitigate the variation-induced timing errors? 4

55 5 Delay Variations in 90nm 5 ~ 2.3x ~1.6x  Uncorrelated variations are averaged out over the chain.

66 6 Delay Variations – f(Vdd=0.55V, N) 6  A long chain helps, but the effect diminishes as N increases.  Variations are exacerbated with technology scaling.

77 7 Delay Variations – f(Vdd, N=50) 7 LER causes high variations in advanced technology nodes Strict Design Rules Metal-Gates w/ high-k material or SOI Advanced lithography

88 8 Delay Distribution – 90nm GP 8  1 critical path delay = delay of a chain of 50 FO4 inverters.  1-wide system delay = max (delays of 100 critical paths )  128-wide system delay = max (delays of wide system) Performance Drop

99 9 Variation Effects on 128-wide SIMD Architecture 9 - Structural Duplication - Voltage margining - Frequency margining

10 Near-threshold Wide SIMD Architecture: Diet SODA 10 [Seo et al. ISLPED 2010 ]

11 Structural Duplication 11 SIMD Function Unit #7 SIMD Function Unit #6 SIMD Function Unit #5 SIMD Function Unit #4 SIMD Function Unit #3 SIMD Function Unit #2 SIMD Function Unit #1 SIMD Function Unit #0 SIMD Function Unit #9 SIMD Function Unit #8 Crossbar Datapath#7 Datapath#6 Datapath#5 Datapath#4 Datapath#3 Datapath#2 Datapath#1 Datapath#0 8-wide+2-spare system  Increase number of processing resources

12 Structural Duplication 12 SIMD Function Unit #7 SIMD Function Unit #6 SIMD Function Unit #5 SIMD Function Unit #4 SIMD Function Unit #3 SIMD Function Unit #2 SIMD Function Unit #1 SIMD Function Unit #0 SIMD Function Unit #9 SIMD Function Unit #8 Crossbar Datapath#6 Datapath#5 Datapath#4 Datapath#3 Datapath#2 Datapath#1 Datapath#0 8-wide+2-spare system  Use the spares if required.

13 Structural Duplication – 90nm GP 13  6 spares are required to match the chip delay of baseline architecture.

14 Voltage Margining 14 Delay distributions: 45nm PTM model is used  Increase supply voltage

15 Frequency Margining  Increase clock period  Applicable for applications with relaxed time constraints  For advanced technology nodes, this is impractical  Caveat  Consider its impact on system  SIMD subsystem clock period  memory subsystem clock period 15

16 Structural Duplication vs. Voltage Margining 16

17 Combination of two schemes – 45nm GP wide 0.6V 26 spares17mV boost5mV + 8 spares10mV + 2 spares

18 Variation-Aware Diet SODA 18

19 Conclusions  Near-threshold operation of wide SIMD system can have timing problems due to process variations.  Variation effects on a 128-wide SIMD architecture are marginal for 90nm technology node, but could be non- negligible for current/future technology nodes.  A combination of structural duplication and voltage margining provides a minimal power overhead solution to mitigate variation-induced timing problems in wide SIMD architectures. 19

20 Questions?  Thank you! 20

21 Backup Slides 21

22 Local Spares vs. Global Spares 22 Local Sparing 1 out of 4 (2 spares) Global Sparing (2 spares) + small overhead - burst errors + burst errors - Large overhead

23 Local Spares vs. Global Spares 23  Global sparing is better than local sparing.  XRAM crossbar supports global sparing global spares local spares (1 out of 4)

24 Variation-Aware Diet SODA 24  With little area and power overhead, delay variations can be solved.