Qiang XU CUhk REliable computing laboratory (CURE)

Slides:



Advertisements
Similar presentations
International Symposium on Low Power Electronics and Design Qing Xie, Mohammad Javad Dousti, and Massoud Pedram University of Southern California ISLPED.
Advertisements

Microprocessor Reliability
A Mechanism for Online Diagnosis of Hard Faults in Microprocessors Fred A. Bower, Daniel J. Sorin, and Sule Ozev.
Scheduling Algorithms for Unpredictably Heterogeneous CMP Architectures J. Winter and D. Albonesi, Cornell University International Conference on Dependable.
IVF: Characterizing the Vulnerability of Microprocessor Structures to Intermittent Faults Songjun Pan 1,2, Yu Hu 1, and Xiaowei Li 1 1 Key Laboratory of.
REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.
Making Services Fault Tolerant
Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. Yu August 15, 2005.
Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. Yu August 15, 2005.
On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University.
Lifetime Reliability-Aware Task Allocation and Scheduling for MPSoC Platforms Lin Huang, Feng Yuan and Qiang Xu Reliable Computing Laboratory Department.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Self-calibrated.
2. Introduction to Redundancy Techniques Redundancy Implies the use of hardware, software, information, or time beyond what is needed for normal system.
L i a b l eh kC o m p u t i n gL a b o r a t o r y Performance Yield-Driven Task Allocation and Scheduling for MPSoCs under Process Variation Presenter:
September 28 th 2004University of Utah1 A preliminary look Karthik Ramani Power and Temperature-Aware Microarchitecture.
Unreliable Silicon: Myth or Reality? Shubu Mukherjee Principal Engineer Director, SPEARS Group (SPEARS = Simulation & Pathfinding of Efficient And Reliable.
Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from.
1 Making Services Fault Tolerant Pat Chan, Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong Miroslaw Malek.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
University of Michigan Electrical Engineering and Computer Science 1 StageNet: A Reconfigurable CMP Fabric for Resilient Systems Shantanu Gupta Shuguang.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Maestro: Orchestrating.
University of Michigan Electrical Engineering and Computer Science 1 Online Timing Analysis for Wearout Detection Jason Blome, Shuguang Feng, Shantanu.
New Challenges in Cloud Datacenter Monitoring and Management
H-1 Network Management Network management is the process of controlling a complex data network to maximize its efficiency and productivity The overall.
5/24/2016 Based on text by S. Mourad "Priciples of Electronic Systems" Digital Testing: Defects, Failures and Faults.
C OLUMBIA U NIVERSITY Lightwave Research Laboratory Embedding Real-Time Substrate Measurements for Cross-Layer Communications Caroline Lai, Franz Fidler,
Advanced Computing and Information Systems laboratory Device Variability Impact on Logic Gate Failure Rates Erin Taylor and José Fortes Department of Electrical.
Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of.
TASK ADAPTATION IN REAL-TIME & EMBEDDED SYSTEMS FOR ENERGY & RELIABILITY TRADEOFFS Sathish Gopalakrishnan Department of Electrical & Computer Engineering.
L i a b l eh kC o m p u t i n gL a b o r a t o r y On Effective TSV Repair for 3D- Stacked ICs Li Jiang †, Qiang Xu † and Bill Eklow § † CUhk REliable.
Towards a Contract-based Fault-tolerant Scheduling Framework for Distributed Real-time Systems Abhilash Thekkilakattil, Huseyin Aysan and Sasikumar Punnekkat.
Joint UNECE/Eurostat Meeting on Population and Housing Censuses (28-30 October 2009) Accuracy evaluation of Nuts level 2 hypercubes with the adoption of.
DYNAMIC TEST SET SELECTION USING IMPLICATION-BASED ON-CHIP DIAGNOSIS Nicholas Imbriglia, Nuno Alves, Elif Alpaslan, Jennifer Dworak Brown University NATW.
UW-Madison Computer Sciences Vertical Research Group© 2010 A Unified Model for Timing Speculation: Evaluating the Impact of Technology Scaling, CMOS Design.
Integrating Fine-Grained Application Adaptation with Global Adaptation for Saving Energy Vibhore Vardhan, Daniel G. Sachs, Wanghong Yuan, Albert F. Harris,
StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.
A Lightweight Fault-Tolerant Mechanism for Network-on-Chip
1 Customer-Aware Task Allocation and Scheduling for Multi-Mode MPSoCs Lin Huang, Rong Ye and Qiang Xu CHhk REliable computing laboratory (CURE) The Chinese.
Part.1.1 In The Name of GOD Welcome to Babol (Nooshirvani) University of Technology Electrical & Computer Engineering Department.
Microsoft Reseach, CambridgeBrendan Murphy. Measuring System Behaviour in the field Brendan Murphy Microsoft Research Cambridge.
Zibin Zheng DR 2 : Dynamic Request Routing for Tolerating Latency Variability in Cloud Applications CLOUD 2013 Jieming Zhu, Zibin.
L i a b l eh kC o m p u t i n gL a b o r a t o r y Test Economics for Homogeneous Manycore Systems Lin Huang† and Qiang Xu†‡ †CUhk REliable computing laboratory.
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2004 Daniel J. Sorin Duke University.
CUHK Learning-Based Power Management for Multi-Core Processors YE Rong Nov 15, 2011.
Enabling System-Level Modeling of Variation-Induced Faults in Networks-on-Chips Konstantinos Aisopos (Princeton, MIT) Chia-Hsin Owen Chen (MIT) Li-Shiuan.
Rabi Mahapatra Department of Computer Science & Engineering Texas A&M University.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Adaptive Online Testing.
Adaptive Resource Management Architecture for DRE Systems Nishanth Shankaran
Evaluating the Impact of Job Scheduling and Power Management on Processor Lifetime for Chip Multiprocessors (SIGMETRICS 2009) Authors: Ayse K. Coskun,
11 Online Computing and Predicting Architectural Vulnerability Factor of Microprocessor Structures Songjun Pan Yu Hu Xiaowei Li {pansongjun, huyu,
Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.
Computer Science and Engineering Power-Performance Considerations of Parallel Computing on Chip Multiprocessors Jian Li and Jose F. Martinez ACM Transactions.
University of Rostock Institute of Applied Microelectronics and Computer Engineering Monitoring and Control of Temperature in Networks-on- Chip Tim Wegner,
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
Harnessing Soft Computation for Low-Budget Fault Tolerance Daya S Khudia Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan,
Service Reliability Engineering The Chinese University of Hong Kong
Part.2.1 In The Name of GOD FAULT TOLERANT SYSTEMS Part 2 – Canonical Structures Chapter 2 – Hardware Fault Tolerance.
L i a b l eh kC o m p u t i n gL a b o r a t o r y Modeling TSV Open Defects in 3D-Stacked DRAM Li Jiang †, Liu Yuxi †, Lian Duan ‡, Yuan Xie ‡, and Qiang.
Best detection scheme achieves 100% hit detection with
Taniya Siddiqua, Paul Lee University of Virginia, Charlottesville.
-1- UC San Diego / VLSI CAD Laboratory Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems Andrew B. Kahng and Siddhartha.
CS203 – Advanced Computer Architecture Dependability & Reliability.
M AESTRO : Orchestrating Predictive Resource Management in Future Multicore Systems Sangyeun Cho, Socrates Demetriades Computer Science Department University.
Experience Report: System Log Analysis for Anomaly Detection
Raghuraman Balasubramanian Karthikeyan Sankaralingam
Fault-Tolerant NoC-based Manycore system: Reconfiguration & Scheduling
Babak Sorkhpour, Prof. Roman Obermaisser, Ayman Murshed
Maestro: Orchestrating Lifetime Reliability in Chip Multiprocessors
Mattan Erez The University of Texas at Austin July 2015
Hardware Assisted Fault Tolerance Using Reconfigurable Logic
Presentation transcript:

Fault-Tolerant Computing – It’s Time to Cross the Layer for Cost-Effectiveness Qiang XU CUhk REliable computing laboratory (CURE) Department of Computer Science & Engineering The Chinese University of Hong Kong

Technology Scaling Continues… Feature size shrinks to tens of atoms across! Effects Manufacturing defects Process variation Transient errors from radiation Noise fluctuations Fragile devices with shortened lifetimes 2

Ever-Increasing Defect Density IBM’s 8-core Cell processor chips: 10-20% yield 3

Defective Chip Identification Testing is responsible for ensuring the quality of shipped products In the Past … Decision Threshold BAD Population Occurrence Frequency GOOD Population Redraw from [O’Neill-itc07] 4

Where is the Decision Threshold? Nowadays … Decision Threshold Process variation Func./test mode discrepancy BAD Population Occurrence Frequency GOOD Population TEST ESCAPE FALSE REJECT Manufacturing Test is NOT Reliable Any More! Redraw from [O’Neill-itc07]

Current Solution for Yield Improvement Yield-driven redundancy Cisco’s 192-core Metro network processor contains 4 spares nVidia’s 128-core GeForce 8800 GPU can be degraded to 96-core version if some cores are faulty Simple solution but … More and more redundant circuitries are necessary Require precise offline testing 6

Other Reliability Threats Hard errors Time dependent dielectric breakdown (TDDB) Electromigration (EM) Negative bias temperature instability (NBTI) Stress migration (SM) Soft errors Alpha particles; Neutron Intermittent faults Permanent Transient Burst for a Period of Time Hardware solution, again, more redundant circuitries! 7

The Impact of Reliability Threats with Scaling Difficult Burn-in Useful Life Useful Life Failure Rate Higher failure rate Faster aging Time

To Keep Scaling … Reliability Cost Total Cost Cost per Transistor Transistor Cost Year

To Achieve Cost-Effective Scaling Unlike old days, defective/Vulnerable ICs will be shipped to customers! Cross-layer solution as a remedy for resilient system design! 10

Defective/Vulnerable Cross-Layer Reliability Tolerate critical defects and soft/hard error with high failure rates at hardware level Mask non-critical defects and soft/hard errors with low failure rates at Hw.-dependent software level Take advantage of error-tolerance at application level Applications Hw.-dependent Sw. Defective/Vulnerable ICs 11

Key Questions in Cross-Layer Reliability Differentiate the impact of various reliability threats and tackle them at different layers! @ Circuit-level Which defects, soft/hard errors are critical enough requiring hardware redundancy? Protect at which granularity? Traditional pass/fail testing methodology no longer stands, what would be the new metrics for testing? Ever-increasingly important online test and diagnosis 12

Key Questions in Cross-Layer Reliability Differentiate the impact of various reliability threats and tackle them at different layers! @ Hardware-dependent software level How to model various hardware faults accurately at this level? How to allocate workloads intelligently to mitigate such errors? @ Application level How to take application reliability requirements into account? Is it possible to generalize such solutions? 13

Key Questions in Cross-Layer Reliability Differentiate the impact of various reliability threats and tackle them at different layers! @ System-level - Low-cost resilient designs under performance, power, and reliability constraint How to monitor the system’s reliability changes? How do we evaluate the cross-layer reliability for the entire system? Can we separate the layers clearly with only FIT or BER information? 14

High-Level Lifetime Reliability Modeling and Simulation Framework DPM / DTM DVFS Timeout Thermal throttling Power gating … Redundancy Level Quantity Task Allocation Round-robin Energy-driven SPECIFICATION IC DESIGN Functionality Expected service life Power consumption Area constraint Thermal issue …

Only short simulation time is affordable! The Challenge Wear-out effects of hard errors Reliability at a specific time point depends on current reliability-related factors (e.g., temperature) aging effects due to past usage Significant temperature variation Temperature simulation is time-consuming Temperature Variation Example Only short simulation time is affordable!

The Challenge – Simulation Framework Apparently, it is not possible to trace temperature and aging-related execution parameters in a fine-grained manner throughout the entire lifetime What if we conduct coarse-grained tracing and compute lifetime reliability with average operational temperature? The ignorance of temperature variation results in lack of accuracy How to achieve efficient yet accurate lifetime reliability simulation with limited fine-grained trace information, when failure mechanisms follow arbitrary failure distributions?

Aging Rate Calculation The key issue is to compute a time-independent aging rate Ω effectively with limited fine-grained traced information Given general failure distribution R (t), e.g., Weibull distribution express it as R (t) = R (Θ۰Ω۰t) , we then have Two steps Deduct a close-form lifetime reliability function with time-varying operational states and temperature Extract the time-independent aging rate parameter from this function

Lifetime Reliability Simulation Framework – AgeSim Evaluate lifetime reliability under various usage strategy and workload DPM / DTM Trigger mechanism Load-sharing strategy Redundancy scheme Applicable for any failure distribution Output performance and energy consumption also

Asymmetry-Aware Processor Allocation for Chip Multiprocessor Chip multiprocessor with increasing number of processor cores However, technology scaling also results in … Defective cores on-chip Cores with distinct performance 20

Asymmetric Chip Multiprocessor Performance-asymmetry Process variation Significant frequency deviation on a chip (up to 40%) Dynamic power-performance adaptation Topology-asymmetry Manufacturing defects Wearout effect 21

Hide Hardware Defects @ OS Level Applications A unified topology OS Chip Multiprocessor Fault-free core Faulty core Router Underlying hardware 22

Asymmetry-Aware Processor Allocation We propose two contiguous processor allocation methodologies with different computing power representations considering Performance including communication overhead Processor allocation time System Load = Mean Application Service Rate / Mean Application Arrival Rate 23

Thank you for your attention !