Department of Electrical and Computer Engineering University of Wisconsin - Madison Optimizing Total Power of Many-core Processors Considering Voltage.

Slides:

Advertisements

Similar presentations

Tunable Sensors for Process-Aware Voltage Scaling

Advertisements

VARIUS: A Model of Process Variation and Resulting Timing Errors for Microarchitects Sarangi et al Prateeksha Satyamoorthy CS

Keeping Hot Chips Cool Ruchir Puri, Leon Stok, Subhrajit Bhattacharya IBM T.J. Watson Research Center Yorktown Heights, NY Circuits R-US.

Digital Integrated Circuits A Design Perspective

CML CML Presented by: Aseem Gupta, UCI Deepa Kannan, Aviral Shrivastava, Sarvesh Bhardwaj, and Sarma Vrudhula Compiler and Microarchitecture Lab Department.

0 1 Width-dependent Statistical Leakage Modeling for Random Dopant Induced Threshold Voltage Shift Jie Gu, Sachin Sapatnekar, Chris Kim Department of Electrical.

Praveen Venkataramani Suraj Sindia Vishwani D. Agrawal FINDING BEST VOLTAGE AND FREQUENCY TO SHORTEN POWER CONSTRAINED TEST TIME 4/29/ ST IEEE VLSI.

Software Architecture of High Efficiency Video Coding for Many-Core Systems with Power- Efficient Workload Balancing Muhammad Usman Karim Khan, Muhammad.

Fall 06, Sep 19, 21 ELEC / Lecture 6 1 ELEC / (Fall 2005) Special Topics in Electrical Engineering Low-Power Design of Electronic.

1 Introduction Background: CS 3810 or equivalent, based on Hennessy and Patterson’s Computer Organization and Design Text for CS/EE 6810: Hennessy and.

Aleksandra Tešanović Low Power/Energy Scheduling for Real-Time Systems Aleksandra Tešanović Real-Time Systems Laboratory Department of Computer and Information.

May 14, ISVLSI 09 Algorithms for Estimating Number of Glitches and Dynamic Power in CMOS Circuits with Delay Variations Jins Davis Alexander Vishwani.

Input-Specific Dynamic Power Optimization for VLSI Circuits Fei Hu Intel Corp. Folsom, CA 95630, USA Vishwani D. Agrawal Department of ECE Auburn University,

March 16, 2009SSST'091 Computing Bounds on Dynamic Power Using Fast Zero-Delay Logic Simulation Jins Davis Alexander Vishwani D. Agrawal Department of.

Institute of Digital and Computer Systems 1 Fabio Garzia / Finding Peak Performance in a Process23/06/2015 Chapter 5 Finding Peak Performance in a Process.

1 A Single-supply True Voltage Level Shifter Rajesh Garg Gagandeep Mallarapu Sunil P. Khatri Department of Electrical and Computer Engineering, Texas A&M.

Mehdi Amirijoo1 Power estimation n General power dissipation in CMOS n High-level power estimation metrics n Power estimation of the HW part.

On-Line Adjustable Buffering for Runtime Power Reduction Andrew B. Kahng Ψ Sherief Reda † Puneet Sharma Ψ Ψ University of California, San Diego † Brown.

1 Razor: A Low Power Processor Design Presented By: - Murali Dharan.

1 Introduction Background: CS 3810 or equivalent, based on Hennessy and Patterson’s Computer Organization and Design Text for CS/EE 6810: Hennessy and.

Circuit Performance Variability Decomposition Michael Orshansky, Costas Spanos, and Chenming Hu Department of Electrical Engineering and Computer Sciences,

Architectural Power Management for High Leakage Technologies Department of Electrical and Computer Engineering Auburn University, Auburn, AL /15/2011.

1 Introduction Background: CS 3810 or equivalent, based on Hennessy and Patterson’s Computer Organization and Design Text for CS/EE 6810: Hennessy and.

ECE 510 Brendan Crowley Paper Review October 31, 2006.

UC San Diego Computer Engineering VLSI CAD Laboratory UC San Diego Computer Engineering VLSI CAD Laboratory UC San Diego Computer Engineering VLSI CAD.

Noise and Delay Uncertainty Studies for Coupled RC Interconnects Andrew B. Kahng, Sudhakar Muddu † and Devendra Vidhani ‡ UCLA Computer Science Department,

Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.

11 1 Process Variation in Near-threshold Wide SIMD Architectures Sangwon Seo 1, Ronald G. Dreslinski 1, Mark Woh 1, Yongjun Park 1, Chaitali Chakrabarti.

Power-Aware SoC Test Optimization through Dynamic Voltage and Frequency Scaling Vijay Sheshadri, Vishwani D. Agrawal, Prathima Agrawal Dept. of Electrical.

Folklore Confirmed: Compiling for Speed = Compiling for Energy Tomofumi Yuki INRIA, Rennes Sanjay Rajopadhye Colorado State University 1.

VOLTAGE SCHEDULING HEURISTIC for REAL-TIME TASK GRAPHS D. Roychowdhury, I. Koren, C. M. Krishna University of Massachusetts, Amherst Y.-H. Lee Arizona.

Chalmers University of Technology FlexSoC Seminar Series – Page 1 Power Estimation FlexSoc Seminar Series – Daniel Eckerbert

Determining the Optimal Process Technology for Performance- Constrained Circuits Michael Boyer & Sudeep Ghosh ECE 563: Introduction to VLSI December 5.

1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.

An Efficient Algorithm for Dual-Voltage Design Without Need for Level-Conversion SSST 2012 Mridula Allani Intel Corporation, Austin, TX (Formerly.

Variation Aware Application Scheduling in Multi-core Systems Lavanya Subramanian, Aman Kumar Carnegie Mellon University {lsubrama,

Jia Yao and Vishwani D. Agrawal Department of Electrical and Computer Engineering Auburn University Auburn, AL 36830, USA Dual-Threshold Design of Sub-Threshold.

Chapter 07 Electronic Analysis of CMOS Logic Gates

1 CS/EE 6810: Computer Architecture Class format:  Most lectures on YouTube *BEFORE* class  Use class time for discussions, clarifications, problem-solving,

C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.

Energy Savings with DVFS Reduction in CPU power Extra system power.

The Fast Optimal Voltage Partitioning Algorithm For Peak Power Density Minimization Jia Wang, Shiyan Hu Department of Electrical and Computer Engineering.

PRAVEEN VENKATARAMANI VISHWANI D. AGRAWAL Auburn University, Dept. of ECE Auburn, AL 36849, USA 26 th International.

26 th International Conference on VLSI January 2013 Pune,India Optimum Test Schedule for SoC with Specified Clock Frequencies and Supply Voltages Vijay.

A Robust Pulse-triggered Flip-Flop and Enhanced Scan Cell Design

Radix-2 2 Based Low Power Reconfigurable FFT Processor Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Gin-Der Wu and Yi-Ming Liu Department.

MicroFix: Exploiting Path-grained Timing Adaptability for Improving Power-Performance Efficiency Guihai Yan, Yinhe Han, Hui Liu, Xiaoyao Liang, Xiaowei.

Skewed Flip-Flop Transformation for Minimizing Leakage in Sequential Circuits Jun Seomun, Jaehyun Kim, Youngsoo Shin Dept. of Electrical Engineering, KAIST,

Houman Homayoun, Sudeep Pasricha, Mohammad Makhzan, Alex Veidenbaum Center for Embedded Computer Systems, University of California, Irvine,

Optimal Power Allocation for Multiprogrammed Workloads on Single-chip Heterogeneous Processors Euijin Kwon 1,2 Jae Young Jang 2 Jae W. Lee 2 Nam Sung Kim.

Penn ESE370 Fall DeHon 1 ESE370: Circuit-Level Modeling, Design, and Optimization for Digital Systems Day 17: October 19, 2011 Energy and Power.

© Digital Integrated Circuits 2nd Inverter Digital Integrated Circuits A Design Perspective The Inverter Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

Variation. 2 Sources of Variation 1.Process (manufacturing) (physical) variations:  Uncertainty in the parameters of fabricated devices and interconnects.

Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.

Computer Science and Engineering Power-Performance Considerations of Parallel Computing on Chip Multiprocessors Jian Li and Jose F. Martinez ACM Transactions.

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

JouleTrack - A Web Based Tool for Software Energy Profiling Amit Sinha and Anantha Chandrakasan Massachusetts Institute of Technology June 19, 2001.

University of Toronto,Toronto, Ontario, Canada 1 Circuit Research Labs, Intel Corporation, Hillsboro, OR Variations-Aware Low-Power Design with Voltage.

CS203 – Advanced Computer Architecture

Characterizing Processors for Energy and Performance Management Harshit Goyal and Vishwani D. Agrawal Department of Electrical and Computer Engineering,

1 Hardware Reliability Margining for the Dark Silicon Era Liangzhen Lai and Puneet Gupta Department of Electrical Engineering University of California,

University of Michigan Advanced Computer Architecture Lab. 2 CAD Tools for Variation Tolerance David Blaauw and Kaviraj Chopra University of Michigan.

Power-Optimal Pipelining in Deep Submicron Technology

CS203 – Advanced Computer Architecture

Lecture 2: Performance Today’s topics:

Green cloud computing 2 Cs 595 Lecture 15.

Energy Efficient Scheduling in IoT Networks

Circuit Design Techniques for Low Power DSPs

Impact of Parameter Variations on Multi-core chips

Presentation transcript:

Department of Electrical and Computer Engineering University of Wisconsin - Madison Optimizing Total Power of Many-core Processors Considering Voltage Scaling Limit and Process Variations Jungseob Lee and Nam Sung Kim October 9, 2009

Outline Introduction Supply Voltage and Power Scaling  Supply Voltage Scaling of Many-Core Processors  Power Scaling of Many-Core Processors Impacts of Within-Die(WID) Spatial Process Variations  Global Clocking  Frequency−Island Clocking Conclusions

Parallel Processing  Improved throughput of computing systems w/ more cores  Throughput is limited by power+thermal constraints w/ all cores running Challenges: How do we  Determine # of cores for best performance-power efficiency?  Exploit process variations for multicore processors? Multicore processors [1] Serial processing Parallel processing [1] Source: [2] Source: NVIDIAhttp:// GPU which has many cores [2]

Types of Process variations Process variations Within-Die (WID) Variations Die-to-Die (D2D) Variations Wafer Scale Courtesy: K. Bowman from Intel A Systematic V th variation map for a 16-core processor The corresponding Norm F max and P leak map C2C frequency and leakage power variations due to spatial correlated WID variations become considerable.

Supply Voltage Scaling 1 Supply voltage scaling of many-core processors  Throughput w/ certain # of cores at max V DD (thus F max ) = Throughput w/ more cores at lower V DD (thus F max )  Potential throughput increase by many cores and lower V DD can reduce power.  # of cores 4  Operating freq V DD  # of cores 8  Operating freq Lower V than V DD

Supply Voltage Scaling 2 Supply voltage scaling of many-core processors  M∙T cycle (V DD ) = M∙((1−F) + F/N)∙T cycle (V) MNumber of operations T cycle Cycle time of a processor at supply voltage V DD Nominal supply voltage of base core processor FFraction of operations parallelizable w/o overhead NRelative number of cores VScaled supply voltage of N x more cores PTM 32nm HP PTM 32nm LP Require higher V DD due to high V th > 40 % ↓

Dynamic Power Analysis 1 Dynamic power scaling  Dynamic power of a base many-core processor P dyn,base = C eff ∙V 2 DD ∙F max (V DD )  Dynamic power of N x more cores than the base processor P dyn,N = ((1−F) ∙(1+(N−1) ∙K) + F ∙N) ∙C eff ∙V 2 ∙F max (V) = k(F, K, N) ∙f(V) ∙(V/V DD ) 2 ∙P dyn,base P dyn,base Dynamic power of a base core C eff Effetive total switching capacitance V DD Nominal voltage of the base core F max Maximum operating frequency of the base core P dyn,N Dynamic power of N x more cores KFraction of dynamic power of idle cores k(F,K,N)((1−F) ∙(1+(N−1) ∙K) + F ∙N) f(V)Frequency scaling factor at V; F max (V)/F max (V DD ) P dyn,base Dynamic power of a base processor C eff Effetive total switching capacitance V DD Nominal voltage of the base core F max Maximum operating frequency of the base proc

Dynamic Power Analysis 2 Dynamic power scaling PTM 32nm HP PTM 32nm LP Optimal Normalized P dyn / Relative # of cores Dotted lines show projected power consumption when no supply limit. V DD,min = 0.7V Less V DD scaling  Less P dyn reduction HP: 25~55% LP: 25~54%

Leakage Power Analysis 1 Leakage power scaling  In nanoscale technology, leakage power is significant fraction of total power consumption.  Leakage power of a base many-core processor P leak,base = I leak (V DD ) ∙V DD  Leakage power of N x more cores than the base processor P leak,N = N ∙I leak (V) ∙V = N ∙ l (V) ∙(V/V DD ) ∙P leak,base P leak,base Dynamic power of a base core I leak Total Leakage current of the base processor V DD Nominal voltage of the base core P leak,N Dynamic power of N x more cores l (V) Leakage scaling factor at V P leak,base Leakage power of a base core I leak Total Leakage current of the base processor V DD Nominal voltage of the base core

Leakage power scaling Leakage Power Analysis 2 PTM 32nm HP PTM 32nm LP Optimal Normalized P leak / Relative # of cores But Absolute P leak is much less than HP HP: 54~80% LP: 33~50%

Total Power Analysis 1 Total power scaling  The total power of a base many-core processor is the sum of dynamic and leakage power. P tot,base = P dyn,base + P leak,base = P dyn,base ∙ (1 + LF)  The total power of N x more cores than the base processor is the sum of dynamic and leakage power. P tot,N = P dyn,N + P leak,N = P dyn,base ∙ { k(F,K,N) ∙ f(V) ∙ (V/V DD ) 2 + N ∙ l (V) ∙ (V/V DD ) ∙ LF } P tot,base Total power of a base core LFRatio between P leak and P dyn ; (P leak /P dyn ) P tot,N Total power of N x more cores

Total power scaling Total Power Analysis 2 PTM 32nm HP LF 0.4/0.6 PTM 32nm LP LF 0.2/0.8 Optimal Normalized P tot / Relative # of cores More V DD scaling  only 17% more P tot reduction, but require more on-die memory area HP: 36~65% LP: 26~52%

Impacts of WID Variations − GC Global Clocking  Limits F max of a many-core processor to that of slowest core.  Previous P dyn,N equation still can be used to estimate P dyn,N  Estimation of P leak,N have to account for each core’s leakage variations as follows. P leak,N = l i (V) ∙(V/V DD ) ∙P leak,base l i (V) Leakage scaling factor of i-th core; Normalized to I leak (V DD ) A Systematic V th variation map for a 16-core processor The corresponding F max and P leak map Core ID Normalized F max, P leak

Impacts of WID Variations − GC Global Clocking HP Slowest base core HP Fastest base core Much more relative total power reduction because the fastest base core is not power efficient Average total power of 100 die samples / Relative # of cores(N) Slow: 23~54% Fast: 77~90%

Impact of WID Variations − FI Frequency−Island Clocking  FI clocking is more performance and power efficient than GC because each core can run at its own fastest frequency.  Previous GC P leak,N equation can be used to estimate P leak,N.  The equation for supply voltage scaling have to be modified as follows. M ∙T cycle,base (V DD ) = M ∙((1−F) / f j + F/ f i ) ∙T cycle (V)  Estimation of P dyn,N also have to account for an independent clock frequency per core. P dyn,N = ((1−F)∙(f j + f i ∙K) + F ∙ f i ) ∙ (V/V DD ) 2 ∙ P dyn,base  The fastest one among the chosen active cores always offers the optimal total power for processing the totally sequential portion of workload.

Impacts of WID Variations − FI Frequency−Island Clocking HP Slowest base core HP Fastest base core Average total power of 100 die samples / Relative # of cores(N) FI clocking is more power- efficient than the global clocking (GC) that often wastes F max of faster cores. On average, FI clocking offers 7% lower total power consumption than GC. Slow: 30~58% Fast: 81~90%

Experimental Methodology HSPICE simulations  32nm PTM HP and LP model Frequency / Leakage scaling factor  A range of V DD : 0.55 ~ 1.05(V) V th and L eff WID spatial and D2D variation map Complex gates for measuring l (V DD ) 24 FO4 inv chain for measuring f(V DD ) WID variation Correlation distance coefficient (Φ) % D2D variation5.0% 1 grid point [3] Smruti R. Sarangi et al., “VARIUS: A Model of Process Variation and Resulting Timing Errors for Microarchitects”, IEEE Transactions on Semiconductor Manufacturing (IEEE TSM), February [3]

Conclusions Optimal number of active cores to minimize total power consumption of many-core processors.  2x more active cores at lower voltage offer more than 50% of total power reduction at the same throughput with a base core. Extended power analysis considering WID C2C frequency and leakage variations  2x more active cores at lower voltage is the optimal choice.  FI clocking provides lower power consumption than GC since it can exploit C2C variations. Also the fastest one in active cores for sequential portion of application led to the lowest power consumption.

Backup

Process variations  Manufactured dies exhibit a large spread of transistor delay and leakage power across die and within each die.  Die-to-die(D2D) variations affect all transistors on a die equally. Within- die(WID) variations induce different characteristics across each die.  As individual core size becomes smaller, core-to-core(C2C) frequency and leakage power variations due to spatial correlated WID variations will become considerable. Introduction Source: Synopsys Die-to-die variations Spatial Within-die variations

Supply Voltage and Power Scaling 2 Supply voltage scaling of many-core processors  Throughput w/ a certain # of cores at max V DD (thus F max ) = Throughput w/ more cores at lower V DD (thus F max )  Potential throughput increase by many cores and lower V DD can reduce power.  # of active cores 1  Operating freq V DD  # of active cores 8  Operating freq Lower V than V DD Idle Core Many−Core Processor [1] x x x x x x x x x x x x x x x x x x x x x x Active Core x