Hardware Fault Recovery for I/O Intensive Applications Pradeep Ramachandran, Intel Corporation, Siva Kumar Sastry Hari, NVDIA Manlap (Alex) Li, Latham.

Slides:



Advertisements
Similar presentations
Remus: High Availability via Asynchronous Virtual Machine Replication
Advertisements

- Dr. Kalpakis CMSC Dr. Kalpakis 1 Outline In implementing DBMS we need to answer How should the system store and manage very large amounts of data?
IMPACT Second Generation EPIC Architecture Wen-mei Hwu IMPACT Second Generation EPIC Architecture Wen-mei Hwu Department of Electrical and Computer Engineering.
(C) 2002 Daniel SorinWisconsin Multifacet Project SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery.
UW-Madison Computer Sciences Vertical Research Group© 2010 Relax: An Architectural Framework for Software Recovery of Hardware Faults Marc de Kruijf Shuou.
CS 7810 Lecture 25 DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design T. Austin Proceedings of MICRO-32 November 1999.
Fault Detection in a HW/SW CoDesign Environment Prepared by A. Gaye Soykök.
Architectural Support for Operating Systems. Announcements Most office hours are finalized Assignments up every Wednesday, due next week CS 415 section.
1 Lecture 26: Storage Systems Topics: Storage Systems (Chapter 6), other innovations Final exam stats:  Highest: 95  Mean: 70, Median: 73  Toughest.
An Integrated Framework for Dependable Revivable Architectures Using Multi-core Processors Weiding Shi, Hsien-Hsin S. Lee, Laura Falk, and Mrinmoy Ghosh.
Figure 1.1 Interaction between applications and the operating system.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,
GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang , Karthik Pattabiraman, Matei Ripeanu, The University of British.
Input / Output CS 537 – Introduction to Operating Systems.
Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of.
Software Faults and Fault Injection Models --Raviteja Varanasi.
Roza Ghamari Bogazici University.  Current trends in transistor size, voltage, and clock frequency, future microprocessors will become increasingly susceptible.
SWAT: Designing Reisilent Hardware by Treating Software Anomalies Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup K. Sahoo, Siva Kumar Sastry Hari, Rahmet.
Kenichi Kourai (Kyushu Institute of Technology) Takuya Nagata (Kyushu Institute of Technology) A Secure Framework for Monitoring Operating Systems Using.
◦ What is an Operating System? What is an Operating System? ◦ Operating System Objectives Operating System Objectives ◦ Services Provided by the Operating.
IVEC: Off-Chip Memory Integrity Protection for Both Security and Reliability Ruirui Huang, G. Edward Suh Cornell University.
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
SafetyNet Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S.
Eliminating Silent Data Corruptions caused by Soft-Errors Siva Hari, Sarita Adve, Helia Naeimi, Pradeep Ramachandran, University of Illinois at Urbana-Champaign,
A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau.
ReSlice: Selective Re-execution of Long-retired Misspeculated Instructions Using Forward Slicing Smruti R. Sarangi, Wei Liu, Josep Torrellas, Yuanyuan.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Encore: Low-Cost,
Title of Selected Paper: IMPRES: Integrated Monitoring for Processor Reliability and Security Authors: Roshan G. Ragel and Sri Parameswaran Presented by:
Using Likely Program Invariants to Detect Hardware Errors Swarup Kumar Sahoo, Man-Lap Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou.
Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2004 Daniel J. Sorin Duke University.
SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.
Application-Aware SoftWare AnomalyTreatment (SWAT) of Hardware Faults Byn Choi, Siva Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup Sahoo, Sarita.
Relyzer: Exploiting Application-level Fault Equivalence to Analyze Application Resiliency to Transient Faults Siva Hari 1, Sarita Adve 1, Helia Naeimi.
Process Architecture Process Architecture - A portion of a program that can run independently of and concurrently with other portions of the program. Some.
Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.
SWAT: Designing Resilient Hardware by Treating Software Anomalies Byn Choi, Siva Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup Sahoo, Sarita Adve,
1 Lecture 1: Computer System Structures We go over the aspects of computer architecture relevant to OS design  overview  input and output (I/O) organization.
Preserving Application Reliability on Unreliable Hardware Siva Hari Department of Computer Science University of Illinois at Urbana-Champaign.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Low-cost Program-level Detectors for Reducing Silent Data Corruptions Siva Hari †, Sarita Adve †, and Helia Naeimi ‡ † University of Illinois at Urbana-Champaign,
1 Lecture 27: Disks Today’s topics:  Disk basics  RAID  Research topics.
Movement-Based Check-pointing and Logging for Recovery in Mobile Computing Systems Sapna E. George, Ing-Ray Chen, Ying Jin Dept. of Computer Science Virginia.
Evaluating the Fault Tolerance Capabilities of Embedded Systems via BDM M. Rebaudengo, M. Sonza Reorda Politecnico di Torino Dipartimento di Automatica.
DS - IX - NFT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 9 NETWORK FAULT TOLERANCE Wintersemester 99/00 Leitung:
1 Device Controller I/O units typically consist of A mechanical component: the device itself An electronic component: the device controller or adapter.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Software Managed Resiliency Siva Hari Lei Chen, Xin Fu, Pradeep Ramachandran, Swarup Sahoo, Rob Smolenski, Sarita Adve Department of Computer Science University.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
Self Recovery in Server Programs The University of California, Riverside Vijay Nagarajan Dennis JeffreyRajiv Gupta International Symposium on Memory Management.
GangES: Gang Error Simulation for Hardware Resiliency Evaluation Siva Hari 1, Radha Venkatagiri 2, Sarita Adve 2, Helia Naeimi 3 1 NVIDIA Research, 2 University.
University of Michigan Electrical Engineering and Computer Science 1 Low Cost Control Flow Protection Using Abstract Control Signatures Daya S Khudia and.
1 load [2], [9] Transfer contents of memory location 9 to memory location 2. Illegal instruction.
Approximate Computing: (Old) Hype or New Frontier? Sarita Adve University of Illinois, EPFL Acks: Vikram Adve, Siva Hari, Man-Lap Li, Abdulrahman Mahmoud,
Presented by: Daniel Taylor
MSWAT: Hardware Fault Detection and Diagnosis for Multicore Systems
Modeling Stream Processing Applications for Dependability Evaluation
nZDC: A compiler technique for near-Zero silent Data Corruption
SWAT: Designing Resilient Hardware by Treating Software Anomalies
Fault Tolerance In Operating System
RAID RAID Mukesh N Tekwani
InCheck: An In-application Recovery Scheme for Soft Errors
Mengjia Yan† , Jiho Choi† , Dimitrios Skarlatos,
Co-designed Virtual Machines for Reliable Computer Systems
RAID RAID Mukesh N Tekwani April 23, 2019
University of Wisconsin-Madison Presented by: Nick Kirchem
Presentation transcript:

Hardware Fault Recovery for I/O Intensive Applications Pradeep Ramachandran, Intel Corporation, Siva Kumar Sastry Hari, NVDIA Manlap (Alex) Li, Latham and Watkins LLP Sarita V. Adve, University of Illinois at Urbana Champaign *This work was done when Pradeep, Siva, and Alex were at the University of Illinois at Urbana Champaign

Battling the Dark Side of Moore’s Law Hardware will fail in the field for a variety of reasons Need in-the-field solutions for detection, diagnosis, and recovery – Must incur low-cost => traditional redundancy solutions too expensive! SWAT: A low-cost solution to handle unreliable HW – Key: Handle only HW faults that affect SW, near-zero impact to fault-free exec – Detect faults with near-zero cost monitors for SW anomaly, smart HW recovery This paper: A closer look at fault recovery with low-cost detection Transient errors (High-energy particles ) Wear-out failures (Devices are weaker) … and so on Intermittent fault

Components of SWAT Detection: Low-cost monitors for anomalous SW behavior – E.g., fatal traps from protection violation, div by zero [ASPLOS’08, DSN‘08, ASPLOS‘12] Diagnosis: Identifies faulty core, uarch block [DSN‘08, MICRO’09] Recovery w/o IO: Leverage existing sol for core/mem chkpt, rollback Recovery with IO not handled; also ignored by most other prior work Intricate relationship between detection and recovery not considered – Checkpoint interval = maximum detection latency Error Symptom detection Fault Chkpt Diagnosis Recovery Chkpt

Contributions of This Paper HW technique for fault recovery in the presence of external IOs – Existing recovery solutions mostly ignore the “output commit” problem – Low overhead to fault-free exec ⇒ detection latency <100K instructions New definition of detection latency to be more relevant to recovery instructions, only 80% of faults detected with existing definition! – Existing definition conservative ⇒ high-overhead to fault-free execution Combined evaluation of low-cost fault detection & recovery solution – SWAT recovers the system for 94% of injected 100K instr, 0.2% SDC rate

Agenda Motivation and Contributions Recovery in the presence of external IO A new definition of detection latency Combined evaluation of detection and recovery Conclusions

Output Buffering External outputs need to be delayed until guaranteed to be fault-free – Once committed, they cannot be rolled-back Previous solution: Buffer outputs in dedicated SW [Revive I/O] No HW changes, exploit semantics of SW-level output for efficiency Outputs vulnerable as buffering SW runs on faulty HW Our solution: Buffer external outputs in dedicated HW – SW output maps to multiple dependent HW stores ⇒ potentially high overheads – Solution should require no changes to device HW – Buffered outputs should not be vulnerable to HW faults

Architecture of HW Output Buffer CPU communicates with devices through I/O loads & stores HW buffer buffers outputs until next checkpoint or IO fence – Committed outputs verified fault-free and drained in parallel to regular execution Buffered outputs protected through ECC checks – Special handling during DMA transfer to device to protect output while draining CPU-centric implementation with no changes to IO devices CPU Memory $ Device OUTPUT BUFFER Memory Bus IO Bus

Operations of HW Output Buffer Fault-free Operation – Outputs verified as fault-free at second subsequent checkpoint Recovery Operation St 1 St 2 Drain Stores in background Dev St 1 Dev St 2 Buffer St 1 St 2 Fault Detection Dev St 3 Buffer Discard st3 Dev St 1 Dev St 2 Buffer Stores to Devices Rollback Arch state St 1 St 2 Dev St 3 Restore Devices Continue execution from this point Determined by Maximum Detection latency

Measuring Fault-free Overheads Buffering outputs imparts overheads to fault-free execution – Outputs to clients delayed  performance overhead – HW to store buffered outputs  area overhead Simulated a fault-free client-server system to measure overheads Focused on I/O intensive workloads to study fault free overheads – sshd, apache, mysql, squid w/ multiple request and server threads Network CPU Devices CPU Devices HW Output Buffer Simulated ServerSimulated Client SIMICS full-system simulator Latency = 0.1ms

Performance Overhead from Output Buffering Chkpt interval of 1M inst (~1ms) => perf impact of 5X! Grows with chkpt interval! Practical chkpt interval <100K inst (~100us); perf overhead of <5% on fault-free exec

Connecting Detection and Recovery Checkpoint interval determined by maximum detection latency Recovery results  checkpoint interval ≤100K instructions – <5% performance overhead, <2KB area overhead SWAT detection results  only 80% detected in 100K instructions Need to reduce latency to enable practical solution – Shortcoming identified only when components combined, ignored in prior work Strategy – New low-cost HW detector for out-of-bounds accesses (details in paper) – Re-look at detection latency for recovery; previous definitions too conservative

Agenda Motivation and Contributions Recovery in the presence of external IO A new definition of detection latency Combined evaluation of detection and recovery Conclusions

A New Definition for Detection Latency Traditional def: Hard latency = arch state corruption to detection But do all faults that corrupt arch state make system unrecoverable? Key observation: Software may tolerate some corruptions! – E.g., a used only for a>0 changes from 5 to 10 New definition: Soft latency = SW state corruption to detection – Checkpoint interval should be based on new definition Bad SW state Soft Latency Bad arch state Hard latency Fault Detection Recoverable chkpt Recoverable chkpt

Hard-Latency vs Soft-Latency to Determine Checkpoint Interval For any targeted detection rate, detection latency w/ Hard-latency >> w/ Soft-latency  Hard-latency may result in unnecessarily high chkpt intervals, high overheads

Agenda Motivation and Contributions Recovery in the presence of external IO A new definition of detection latency Combined evaluation of detection and recovery Conclusions

Evaluating SWAT Detection + Recovery with IO Devices µarch-level fault injections into simulated server CPU – Focused on server workloads due to heavy I/O Detection: Simulate faults for 10M instructions with SWAT detectors Recovery: Restore system after detection with different chkpt intervals – Rollback CPU & memory, restore devices, replay buffer outputs Network CPU Devices CPU Devices HW Output Buffer Simulated ServerSimulated Client SIMICS full-system simulator Latency = 0.1ms Fault

SWAT Detection + Recovery Results 94% of faults detected and recovered at chkpt interval of 100K instructions Only 44/18,000 injected faults (0.2%) cause Silent Data Corruptions (SDCs)

Conclusions Key challenge: Low-cost solution for reliable exec on unreliable HW Emerging low-cost sol. for detection, recovery, diagnosis, like SWAT But recovery in the presence of IO ignored ⇒ limited applicability This paper presents – Low-cost HW solution for recovery with IO; <5% perf, <2KB area overhead – New definition of detection latency that reduces overheads to fault-free exec – Eval of detection + recovery; only 0.2% of faults cause above overheads On-going work: Eliminate SDCs by leveraging application properties

Hardware Fault Recovery for I/O Intensive Applications Pradeep Ramachandran, Intel Corporation Siva Kumar Sastry Hari, NVDIA Manlap (Alex) Li, Latham and Watkins LLP Sarita V. Adve, University of Illinois at Urbana Champaign *This work was done when Pradeep, Siva, and Alex were at the University of Illinois at Urbana Champaign

BACKUP

I/O Characteristics of Server Workloads ApplicationData TransferredAverage Rate% I/O wait apache38MB9.5MBps76.5% sshd19MB2.5MBps24.3% squid20MB11.6MBps69.5% mysql7.5MBps1.05MBps71.1%

Importance of I/O for Fault Recovery No device recovery, output buffering reduces recoverability by 89%

Measuring Soft Latency vs Hard Latency Identify uarch state corruption easy  hard latency easily measurable Measuring soft latency  need to identify when SW state is corrupted But identifying SW state corruption is hard! – Need to know how faulty value used by application, and if it affects output Measure soft latency by rolling back to older checkpoints – Only for analysis, not required in reality Fault Detection Bad arch stateBad SW state Soft latency Chkpt Rollback & Replay Symptom Chkpt Fault effect masked Rollback & Replay