1 Leveraging the Core-Level Complementary Effects of PVT Variations to Reduce Timing Emergencies in Multi-Core Processors Guihai Yan 1, Xiaoyao Liang 2,

Slides:



Advertisements
Similar presentations
Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.
Advertisements

Online Timing Variation Tolerance for Digital Integrated Circuits Guihai Yan & Xiaowei Li State Key Laboratory of Computer Architecture, Institute of Computing.
SoNIC: Classifying Interference in Sensor Networks Frederik Hermans et al. Uppsala University, Sweden IPSN 2013 Presenter: Jeffrey.
Introduction to the TRAMS project objectives and results in Y1 Antonio Rubio, Ramon Canal UPC, Project coordinator CASTNESS’11 WORKSHOP ON TERACOMP FET.
Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.
Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.
SuperRange: Wide Operational Range Power Delivery Design for both STV and NTV Computing Xin He, Guihai Yan, Yinhe Han, Xiaowei Li Institute of Computing.
CML CML Presented by: Aseem Gupta, UCI Deepa Kannan, Aviral Shrivastava, Sarvesh Bhardwaj, and Sarma Vrudhula Compiler and Microarchitecture Lab Department.
Techniques for Multicore Thermal Management Field Cady, Bin Fu and Kai Ren.
Institute of Networking and Multimedia, National Taiwan University, Jun-14, 2014.
Software Architecture of High Efficiency Video Coding for Many-Core Systems with Power- Efficient Workload Balancing Muhammad Usman Karim Khan, Muhammad.
Efficient Autoscaling in the Cloud using Predictive Models for Workload Forecasting Roy, N., A. Dubey, and A. Gokhale 4th IEEE International Conference.
Green Governors: A Framework for Continuously Adaptive DVFS Vasileios Spiliopoulos, Stefanos Kaxiras Uppsala University, Sweden.
Karl Schnaitter and Neoklis Polyzotis (UC Santa Cruz) Serge Abiteboul (INRIA and University of Paris 11) Tova Milo (University of Tel Aviv) Automatic Index.
CS 7810 Lecture 12 Power-Aware Microarchitecture: Design and Modeling Challenges for Next-Generation Microprocessors D. Brooks et al. IEEE Micro, Nov/Dec.
Temperature-Aware Design Presented by Mehul Shah 4/29/04.
University of Michigan Electrical Engineering and Computer Science 1 Online Timing Analysis for Wearout Detection Jason Blome, Shuguang Feng, Shantanu.
1 paper I design and implementation of the aegis single-chip secure processor using physical random functions, isca’05 nuno alves 28/sep/06.
University of Karlsruhe, System Architecture Group Balancing Power Consumption in Multiprocessor Systems Andreas Merkel Frank Bellosa System Architecture.
Analysis of Instruction-level Vulnerability to Dynamic Voltage and Temperature Variations ‡ Computer Science and Engineering, UC San Diego variability.org.
Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.
The Impact of Performance Asymmetry in Multicore Architectures Saisanthosh Ravi Michael Konrad Balakrishnan Rajwar Upton Lai UW-Madison and, Intel Corp.
Logic Optimization Mohammad Sharifkhani. Reading Textbook II, Chapters 5 and 6 (parts related to power and speed.) Following Papers: –Nose, Sakurai, 2000.
11 1 Process Variation in Near-threshold Wide SIMD Architectures Sangwon Seo 1, Ronald G. Dreslinski 1, Mark Woh 1, Yongjun Park 1, Chaitali Chakrabarti.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
Multimedia and Mobile communications Laboratory Augmenting Mobile 3G Using WiFi Aruna Balasubramanian, Ratul Mahajan, Arun Venkataramani Jimin.
Alex Shye, Berkin Ozisikyilmaz, Arindam Mallik, Gokhan Memik, Peter A. Dinda, Robert P. Dick, and Alok N. Choudhary Northwestern University, EECS International.
Low Contention Mapping of RT Tasks onto a TilePro 64 Core Processor 1 Background Introduction = why 2 Goal 3 What 4 How 5 Experimental Result 6 Advantage.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD What can Manifold Enable? Manifold.
Amphisbaena: Modeling Two Orthogonal Ways to Hunt on Heterogeneous Many-cores an analytical performance model for boosting performance Jun Ma, Guihai Yan,
1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.
Variation Aware Application Scheduling in Multi-core Systems Lavanya Subramanian, Aman Kumar Carnegie Mellon University {lsubrama,
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.
Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.
Department of Computer Science City University of Hong Kong Department of Computer Science City University of Hong Kong 1 A Statistics-Based Sensor Selection.
Temperature Aware Load Balancing For Parallel Applications Osman Sarood Parallel Programming Lab (PPL) University of Illinois Urbana Champaign.
An Analysis of Efficient Multi-Core Global Power Management Policies: Maximizing Performance for a Given Power Budget Represented by: Majid Malaika Authors:
Procedure Hopping: a Low Overhead Solution to Mitigate Variability in Shared-L1 Processor Clusters Abbas Rahimi.
1 A Cost-effective Substantial- impact-filter Based Method to Tolerate Voltage Emergencies Songjun Pan 1,2, Yu Hu 1, Xing Hu 1,2, and Xiaowei Li 1 1 Key.
LA-LRU: A Latency-Aware Replacement Policy for Variation Tolerant Caches Aarul Jain, Cambridge Silicon Radio, Phoenix Aviral Shrivastava, Arizona State.
MicroFix: Exploiting Path-grained Timing Adaptability for Improving Power-Performance Efficiency Guihai Yan, Yinhe Han, Hui Liu, Xiaoyao Liang, Xiaowei.
Towards Dynamic Green-Sizing for Database Servers Mustafa Korkmaz, Alexey Karyakin, Martin Karsten, Kenneth Salem University of Waterloo.
VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.
Understanding Performance, Power and Energy Behavior in Asymmetric Processors Nagesh B Lakshminarayana Hyesoon Kim School of Computer Science Georgia Institute.
Variation-Tolerant Circuits: Circuit Solutions and Techniques Jim Tschanz, Keith Bowman, and Vivek De Microprocessor Technology Lab Intel Corporation,
DTM and Reliability High temperature greatly degrades reliability
Evaluating the Impact of Job Scheduling and Power Management on Processor Lifetime for Chip Multiprocessors (SIGMETRICS 2009) Authors: Ayse K. Coskun,
Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction University of California MICRO ’03 Presented by Jinho Seol.
Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.
Mitigating Congestion in Wireless Sensor Networks Bret Hull, Kyle Jamieson, Hari Balakrishnan MIT Computer Science and Artificial Intelligence Laborartory.
Department of Electrical and Computer Engineering University of Wisconsin - Madison Optimizing Total Power of Many-core Processors Considering Voltage.
Best detection scheme achieves 100% hit detection with
-1- UC San Diego / VLSI CAD Laboratory Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems Andrew B. Kahng and Siddhartha.
Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.
Adaptable Approach to Estimating Thermal Effects in a Data Center Environment Corby Ziesman IMPACT Lab Arizona State University.
1 Hardware Reliability Margining for the Dark Silicon Era Liangzhen Lai and Puneet Gupta Department of Electrical Engineering University of California,
Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.
APPLICATION OF CLUSTER ANALYSIS AND AUTOREGRESSIVE NEURAL NETWORKS FOR THE NOISE DIAGNOSTICS OF THE IBR-2M REACTOR Yu. N. Pepelyshev, Ts. Tsogtsaikhan,
Rakesh Kumar Keith Farkas Norman P Jouppi,Partha Ranganathan,Dean M.Tullsen University of California, San Diego MICRO 2003 Speaker : Chun-Chung Chen Single-ISA.
A Case for Toggle-Aware Compression for GPU Systems
Raghuraman Balasubramanian Karthikeyan Sankaralingam
Abbas Rahimi‡, Luca Benini†, and Rajesh Gupta‡ ‡CSE, UC San Diego
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Experiment Evaluation
Application Slowdown Model
Jianbo Dong, Lei Zhang, Yinhe Han, Ying Wang, and Xiaowei Li
Guihai Yan, Yinhe Han, Xiaowei Li, and Hui Liu
Guihai Yan, Yinhe Han, and Xiaowei Li
Presentation transcript:

1 Leveraging the Core-Level Complementary Effects of PVT Variations to Reduce Timing Emergencies in Multi-Core Processors Guihai Yan 1, Xiaoyao Liang 2, Yinhe Han 1, and Xiaowei Li 1 1. Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences (ICT, CAS) 2. NVIDIA Corporation Jun. 23, 2010

2 Key Laboratory of Computer System and Architecture, ICT CAS Outline  Introduction to PVT variations  Analyzing “complementary effect” Timing domain Frequency domain  Implementation challenges & solutions  Experimental results

3 Key Laboratory of Computer System and Architecture, ICT CAS Introduction to variations  Variation sources Process variation –Random dopant fluctuation –Sub-wave length lithography Voltage variation –Parasitic power delivery networks –Application variability –Inductive noise, IR-drop Temperature variation –Imbalanced activity –Hotspot  We focus on the primary manifestation Performance variation

4 Key Laboratory of Computer System and Architecture, ICT CAS Process variation  Sub-wavelength Lithography “What you get is not what you want” Systematic  Random dopant fluctuations Vth variation Random Sub-wavelength lithography [Borkar, DAC’09] [Aitken, ATS’07] Max Freq. differentiate by 20% ! [Teodorescu, ISCA’08] P variation is time-independent, “DC component”

5 Key Laboratory of Computer System and Architecture, ICT CAS Temperature variation  Application-specific  Slow-varying Milliseconds Typical thermal constant: 2ms [Donald, ISCA’06] Measured PentiumM processor temperatures T variation is slow-varying, “Low-frequency components”

6 Key Laboratory of Computer System and Architecture, ICT CAS Voltage variation  Fast-changing Inductive noise –a.k.a. L(di/dt) problem IR-drop Hierarchical PDN Why it is harder to keep a constant voltage level ? V variation is fast-changing, “High-frequency components” Example Power budget: 100W Working voltage: 1V Current: 100A To keep voltage fluctuation between ±5%, R PDN < 0.5 mOhm

7 Key Laboratory of Computer System and Architecture, ICT CAS Resultant impact of PVT variations Fast cores Slow coresViolent apps. High temp. Low temp.Mild apps.  Timing (Delay) Variation

8 Key Laboratory of Computer System and Architecture, ICT CAS Prior solutions  Strive to compensate P, V, and T variation individually Mitigate P variation –ReCycle[ISCA’06], Body Bias[Micro’07], ReVIVal[ISCA’08] et al. Stabilize V variation –Pipeline damping[ISCA’03], DeCoR[HPCA’08] et al. Balance T variation –Hotspot [ISCA’03], DVFS + Activity Migration[ISCA’03, HPCA’01, TODAES’07] et al.  Other timing-oriented solutions Razor[JSSC’06], EVAL[Micro’08], Tribeca[Micro’09] et al.

9 Key Laboratory of Computer System and Architecture, ICT CAS Our perspective  Focus on the essential Timing issue Delay variation Process variation Voltage variation Temp. variation Not Necessarily aggregated, but can cancel off each others in some cases. Hence, “Complementary” Design Goal: Minimize Delay variation Process VoltageTemp. Delay

10 Key Laboratory of Computer System and Architecture, ICT CAS Some terms  Timing emergency (TE)  Emergency level (EL) “Density” of TE Define: EL = # of TE per 100 millions cycles  Violent vs. Mild Voltage –Large fluctuation = Violent –Small fluctuation = Mild Temperature –“Hot” = Violent –“Cool” = Mild Process –Slow corner = Violent –Fast corner = Mild Time Delay Timing Emergency Threshold Mild Violent Voltage Traces

11 Key Laboratory of Computer System and Architecture, ICT CAS How PVT Variations Complement each other ?  Observation in time domain What if exchange the threads on Core1 and Core2? T. Mild, V. Mild Core1: Large margin, low EL T. Violent, V. Violent Core2: Little margin, High EL  Time Delay Threshold Time Delay T Violent, V Violent T Mild, V Mild T Mild, V Violent T Violent, V Mild Emergency Excessive headroom Mild + Violent

12 Key Laboratory of Computer System and Architecture, ICT CAS Frequency domain analysis  Y(f) = FFT(D(t))  Sample interval: 5ns  Span of analysis: 1ms DC component: “P” Low freq. component: “T” High freq. component : “V”

13 Key Laboratory of Computer System and Architecture, ICT CAS The strength of each component of PVT variations Migrate threads = “ Graft” V component PT PT

14 Key Laboratory of Computer System and Architecture, ICT CAS Frequency domain analysis (cont.)  Relative frequency spectrum deviations on 2GHz quad-core processor. P: 0-100Hz, T: 100Hz-1MHz, V: 1MHz-250MHz.  Potential Core3 and Core4 are mild  Strategy exchange threads on Core1 and Core4, Core2 and Core 3

15 Key Laboratory of Computer System and Architecture, ICT CAS How to exploit such “complementary effect”?  Straightforward approach T componentP componentV component Product testVoltage sensorTemp. sensorAging sensor Xyz sensor Pros. Conceptually simple Cons. Slow: V. and T. sensor are slow Incomprehensive: e.g. what if aging ?  Our approach: Delay sensor-based scheme Delay sensor V component(P+T) component Pros. Fast Comprehensive (Timing) Cons. Need a little trick

16 Key Laboratory of Computer System and Architecture, ICT CAS Implementation (cont.)  What we have known Delay variation –Delay sensors  What we need to know The strength of PT and V component How to bridge the gap?  Three challenges Infer PVT component from delay Values On-the-fly thread migration decision-making On-the-fly variation prediction

17 Key Laboratory of Computer System and Architecture, ICT CAS Top view of architecture Timing Emergency Aware + Thread Migration TEA-TM

18 Key Laboratory of Computer System and Architecture, ICT CAS Infer PVT component from Delay Values  Use mean delay to infer PT component ( < 1MHz ) This simplification greatly facilitates cost-efficient implementation of TEA-TM. Then, how about “V component”? Mean delay PT component

19 Key Laboratory of Computer System and Architecture, ICT CAS On-the-fly TEA-TM Decision Making  Urgent First Policy (UFP) Do NOT directly rely on accurate V-component  Basic idea: Migrate the threads running on the highest EL core to the core with the smallest PT component. —— Always right, but may not be optimum! EL = PT “+” V Core1Core2 Emergency Level PT Component TM Refer to our paper for the more sophisticated “DUFP” heuristic

20 Key Laboratory of Computer System and Architecture, ICT CAS On-the-fly Variation Prediction  Objective: reducing the emergency level in the future Emergency Level PT component Linear prediction mechanism EL prediction result

21 Key Laboratory of Computer System and Architecture, ICT CAS Experiments  Methodology Trace-based evaluation  Modeled processor Quad-core Superscalar 2GHz  PDN Similar to Intel Xeon 5500 quad-core microprocessor 130W (peak 150W)  Workload

22 Key Laboratory of Computer System and Architecture, ICT CAS Metrics  Relative throughput loss  Relative Fairness Where,

23 Key Laboratory of Computer System and Architecture, ICT CAS Impact of TM interval on average EL reduction  No migration overhead accounted  1ms at 2GHz, migration overhead is negligible  0.3 ms at 2GHz, migration overhead < 15%   Perf. Overhead & EL Reduction Overall Throughput Minimal TM Interval Large Migration Penalty Large Emergency Rate When take migration penalty into account

24 Key Laboratory of Computer System and Architecture, ICT CAS Reduction in Relative Throughput Loss  TM Interval: 0.2ms, Accuracy: 90%  Developing more sophisticated heuristics

25 Key Laboratory of Computer System and Architecture, ICT CAS Fairness Improvement  80% fairness improvement

26 Key Laboratory of Computer System and Architecture, ICT CAS Conclusion  Analyzing the complementary effect from both time and frequency domain  Presenting a delay sensor-based scheme (TEA-TM) to exploit the comp. effect Simple, cost-efficient  The experimental results show Improved throughput Improved fairness

27 Key Laboratory of Computer System and Architecture, ICT CAS