Maestro: Orchestrating Lifetime Reliability in Chip Multiprocessors

Slides:

Advertisements

Similar presentations

Pricing for Utility-driven Resource Management and Allocation in Clusters Chee Shin Yeo and Rajkumar Buyya Grid Computing and Distributed Systems (GRIDS)

Advertisements

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

Thank you for your introduction.

Performance, Energy and Thermal Considerations of SMT and CMP architectures Yingmin Li, David Brooks, Zhigang Hu, Kevin Skadron Dept. of Computer Science,

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.

1 An Adaptive GA for Multi Objective Flexible Manufacturing Systems A. Younes, H. Ghenniwa, S. Areibi uoguelph.ca.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.

Exploiting Unbalanced Thread Scheduling for Energy and Performance on a CMP of SMT Processors Matt DeVuyst Rakesh Kumar Dean Tullsen.

University of Michigan Advanced Computer Architecture Laboratory StageWeb: Interweaving Pipeline Stages into a Wearout and Variation Tolerant CMP Fabric.

VARIUS: A Model of Process Variation and Resulting Timing Errors for Microarchitects Sarangi et al Prateeksha Satyamoorthy CS

Scheduling Algorithms for Unpredictably Heterogeneous CMP Architectures J. Winter and D. Albonesi, Cornell University International Conference on Dependable.

4.1.5 System Management Background What is in System Management Resource control and scheduling Booting, reconfiguration, defining limits for resource.

Techniques for Multicore Thermal Management Field Cady, Bin Fu and Kai Ren.

Applying Genetic Algorithms to Decision Making in Autonomic Computing Systems Authors: Andres J. Ramirez, David B. Knoester, Betty H.C. Cheng, Philip K.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science August 20, 2009 Enabling.

On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Self-calibrated.

Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

University of Michigan Electrical Engineering and Computer Science 1 StageNet: A Reconfigurable CMP Fabric for Resilient Systems Shantanu Gupta Shuguang.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Maestro: Orchestrating.

University of Michigan Electrical Engineering and Computer Science 1 Online Timing Analysis for Wearout Detection Jason Blome, Shuguang Feng, Shantanu.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

New Challenges in Cloud Datacenter Monitoring and Management

Analysis of Instruction-level Vulnerability to Dynamic Voltage and Temperature Variations ‡ Computer Science and Engineering, UC San Diego variability.org.

Advanced Computing and Information Systems laboratory Device Variability Impact on Logic Gate Failure Rates Erin Taylor and José Fortes Department of Electrical.

Thermal Aware Resource Management Framework Xi He, Gregor von Laszewski, Lizhe Wang Golisano College of Computing and Information Sciences Rochester Institute.

Improving Intrusion Detection System Taminee Shinasharkey CS689 11/2/00.

DYNAMIC TEST SET SELECTION USING IMPLICATION-BASED ON-CHIP DIAGNOSIS Nicholas Imbriglia, Nuno Alves, Elif Alpaslan, Jennifer Dworak Brown University NATW.

Sensor-Based Fast Thermal Evaluation Model For Energy Efficient High-Performance Datacenters Q. Tang, T. Mukherjee, Sandeep K. S. Gupta Department of Computer.

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

Cluster Reliability Project ISIS Vanderbilt University.

Illusionist: Transforming Lightweight Cores into Aggressive Cores on Demand I2PC March 28, 2013 Amin Ansari 1, Shuguang Feng 2, Shantanu Gupta 3, Josep.

StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.

Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1.

Variation Aware Application Scheduling in Multi-core Systems Lavanya Subramanian, Aman Kumar Carnegie Mellon University {lsubrama,

Challenges towards Elastic Power Management in Internet Data Center.

An Analysis of Efficient Multi-Core Global Power Management Policies Authors: Canturk Isci†, Alper Buyuktosunoglu†, Chen-Yong Cher†, Pradip Bose† and Margaret.

Thermal-aware Issues in Computers IMPACT Lab. Part A Overview of Thermal-related Technologies.

OPERETTA: An Optimal Energy Efficient Bandwidth Aggregation System Karim Habak†, Khaled A. Harras‡, and Moustafa Youssef† †Egypt-Japan University of Sc.

On-Chip Sensors for Process, Aging, and Temperature Variation

A few issues on the design of future multicores André Seznec IRISA/INRIA.

Qiang XU CUhk REliable computing laboratory (CURE)

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Adaptive Online Testing.

Lev Finkelstein ISCA/Thermal Workshop 6/ Overview 1.Motivation (Kevin) 2.Thermal issues (Kevin) 3.Power modeling (David) 4.Thermal management (David)

Evaluating the Impact of Job Scheduling and Power Management on Processor Lifetime for Chip Multiprocessors (SIGMETRICS 2009) Authors: Ayse K. Coskun,

Light Weight Grid Platform: Design Methodology Vladimir Getov University of Westminster.

Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.

University of Michigan Electrical Engineering and Computer Science 1 Embracing Heterogeneity with Dynamic Core Boosting Hyoun Kyu Cho and Scott Mahlke.

ECE692 Course Project Proposal Cache-aware power management for multi-core real-time systems Xing Fu Khairul Kabir 16 September 2009.

Best detection scheme achieves 100% hit detection with

-1- UC San Diego / VLSI CAD Laboratory Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems Andrew B. Kahng and Siddhartha.

1 Hardware Reliability Margining for the Dark Silicon Era Liangzhen Lai and Puneet Gupta Department of Electrical Engineering University of California,

M AESTRO : Orchestrating Predictive Resource Management in Future Multicore Systems Sangyeun Cho, Socrates Demetriades Computer Science Department University.

New Rules: Scaling Performance for Extreme Scale Computing

Overview Motivation (Kevin) Thermal issues (Kevin)

Canturk ISCI Margaret MARTONOSI

Raghuraman Balasubramanian Karthikeyan Sankaralingam

Gift Nyikayaramba 30 September 2014

Abbas Rahimi‡, Luca Benini†, and Rajesh Gupta‡ ‡CSE, UC San Diego

SystemC Simulation Based Memory Controller Optimization

Fault-Tolerant NoC-based Manycore system: Reconfiguration & Scheduling

PA an Coordinated Memory Caching for Parallel Jobs

Scott Mahlke University of Michigan

CARP: Compression-Aware Replacement Policies

Abbas Rahimi‡, Luca Benini†, and Rajesh Gupta‡ ‡CSE, UC San Diego

Chapter 2 Operating System Overview

Unintrusive Aging Analysis based on Offline Learning

Presentation transcript:

Maestro: Orchestrating Lifetime Reliability in Chip Multiprocessors Authors: Shuguang Feng Shantanu Gupta Amin Ansari Scott Mahlke HiPEAC 2010 January 25-27, 2010 1

Failures will be wearout Motivation “Designing Reliable Systems from Unreliable Components…” - Shekhar Borkar (Intel) [Borkar, MICRO‘05] [Srinivasan, DSN‘04] More failures to come Failures will be wearout induced 2

Approaches to Reliability Prevent Faults (proactive) Tolerate Faults (reactive) or… Margining Robust cell topologies Circuit-level Approaches to Reliability High-K dielectrics Passivation Detect Diagnose Repair/reconfigure/recover Architecture-level Dynamic thermal mgmt (DTM) Maestro Reliability Banking Diva StageNet Heat-and-Run Facelift Argus Avoid/postpone failures using intelligent job scheduling 3 3

Not All Cores are Created Equal Process variation (PV) will cause “heterogeneity” in future many-core chip multiprocessors (CMPs) Results in variations in performance and reliability characteristics Dynamic thermal/power budgeting can be suboptimal Temperature is only part of the picture Must account for the condition of underlying hardware Many reliability solutions can benefit from wearout sensor feedback System reconfiguration Dynamic voltage and frequency scaling (DVFS) Job scheduling 4

Formulate wearout-centric schedules to accomplish wear-leveling Maestro Insight Variations exist in both hardware (PV + wearout) and software (thermal)…embrace it Solution Monitor wearout with low-level sensors + Dynamically profile application thermal characteristics Formulate wearout-centric schedules to accomplish wear-leveling 5

Hot or Cold? Is the traditional view of “hot”/“cold” applications too simplistic? 35% of core failures are caused by failing structures that never experience peak on-chip temperatures 22% of core failures are caused by modules that do not rank among the top 3 most thermally active 6

Implications for Lifetime Reliability Up to ~6x improvement in mean time to failure (MTTF) achievable with wearout-centric job scheduling Disparity present even in FP v. FP and INT v. INT application comparisons Must account for both workload variation and process variation 7

Maestro System Architecture Low-level Sensors delay leakage temperature ElastIC [IDT`06] Reconfigurable Hardware StageNet [MICRO`08] CCA [PACT`08] WDU [MICRO`07] 8

+ Scheduling Policies: Greedy Minimizes premature core failures by favoring “weaker” cores Global wear-leveling (across a chip) is achieved by forcing “stronger” cores to execute the less desirable applications Local wear-leveling (within a core) is achieved by assigning jobs with favorable thermal footprints Re-sort based on C1 Re-sort based on C6 Weak Strong C1 C3 C10 C4 C0 C8 C6 C1 C3 C10 C4 C7 C6 C1 C3 C10 C0 C1 C2 C3 C4 Cn Light Heavy J0 J1 J2 J3 J4 Jn J12 J3 J9 J5 J4 J4 J3 J9 J5 J7 J13 J8 J9 J3 J5 J15 J11 J13 J0 J7 + J5 J2 J4 J12 J15 J6 J8 J3 J1 J10 J14 J9 Cores Jobs Schedule 9

Core Failure Distribution: Greedy How can we improve upon this? Systems are unlikely to be kept in service until every last core on a CMP fails Leverage knowledge of this “failure threshold” Allow some cores to fail in hopes of prolonging the life of “just enough” - 10

+ Scheduling Policies: Adaptive “Survival-of-the-fittest” approach that maximizes the lifetime of a subset of cores Select from among remaining jobs Weak Strong C15 C2 C7 C6 C1 C3 C0 C1 C2 C3 C4 Cn J0 J1 J2 J3 J4 Jn Secondary (n-m) Primary (m) J11 J13 J0 J1 + J9 J2 J4 J12 J3 J6 J8 J15 J7 J10 J14 J5 Cores Jobs 11

Core Failure Distribution: Adaptive 12

Scheduling Details Greedy Adaptive Cores Jobs , where m’ is the weakest module in the core Cores Jobs , where the minimal Cost(S) is identified with a GA* 13

Experimental Methodology Application Characterization Lifetime Simulation Simulates wearout degradation Integrates existing models from literature Models 2-16 core CMP with Alpha 21364–like architecture Simulates hardware sensor feedback Evaluate naïve (baseline), greedy, and adaptive scheduling policies Figures of merit: Lifetime throughput (LT) Failure distribution Populates device lifetimes w/ Varius Tracks system-wide performance and reliability statistics Emulates OS scheduler Models different system utilization patterns 14

Lifetime Throughput Failure threshold – the number of cores that can fail before a CMP is considered unusable ↑ CMP size → ↑ LT ↑ Failure threshold → ↓ LT Failure thresholds ~ N/2 maximizes delta between Greedy and Adaptive 15

Sensitivity to System Utilization Lifetime throughput is maximized in moderately over-provisioned systems 16-cores w/ failure threshold of 8 16

CMP Failure Distribution 16-cores w/ failure threshold of 8 17

Conclusion Maestro enables wear-leveling without extensive hardware support* Capitalizes on the natural heterogeneity in both CMPs (process var.) and their workloads (thermal var.) Evaluated two approaches (policies) to wearout-centric scheduling Shown that policies can be tailored based on user requirements to increase their effectiveness Maestro can… extend the effective service life of a 16-core CMP by 38% or be reconfigured to perform 180% more useful work before experiencing the first core failure. 18

Questions? http://cccp.eecs.umich.edu