Software Managed Resiliency Siva Hari Lei Chen, Xin Fu, Pradeep Ramachandran, Swarup Sahoo, Rob Smolenski, Sarita Adve Department of Computer Science University.

Slides:



Advertisements
Similar presentations
Abstract HyFS: A Highly Available Distributed File System Jianqiang Luo, Mochan Shrestha, Lihao Xu Department of Computer Science, Wayne State University.
Advertisements

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts Amherst Operating Systems CMPSCI 377 Lecture.
(C) 2005 Daniel SorinDuke Computer Engineering Autonomic Computing via Dynamic Self-Repair Daniel J. Sorin Department of Electrical & Computer Engineering.
Hardware Fault Recovery for I/O Intensive Applications Pradeep Ramachandran, Intel Corporation, Siva Kumar Sastry Hari, NVDIA Manlap (Alex) Li, Latham.
Making Services Fault Tolerant
Distributed components
1 Lecture 26: Storage Systems Topics: Storage Systems (Chapter 6), other innovations Final exam stats:  Highest: 95  Mean: 70, Median: 73  Toughest.
What Great Research ?s Can RAMP Help Answer? What Are RAMP’s Grand Challenges ?
Yuanyuan ZhouUIUC-CS Architectural Support for Software Bug Detection Yuanyuan (YY) Zhou and Josep Torrellas University of Illinois at Urbana-Champaign.
University of Kansas Construction & Integration of Distributed Systems Jerry James Oct. 30, 2000.
1 Software Fault Protection Allen Goldberg Kestrel Technology.
1 FM Overview of Adaptation. 2 FM RAPIDware: Component-Based Design of Adaptive and Dependable Middleware Project Investigators: Philip McKinley, Kurt.
.NET Mobile Application Development Remote Procedure Call.
Concurrent Systems Architecture Group University of California, San Diego and University of Illinois at Urbana-Champaign Morph 9/21/98 Morph: Supporting.
Software-Based Online Detection of Hardware Defects: Mechanisms, Architectural Support, and Evaluation Kypros Constantinides University of Michigan Onur.
MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,
Storage System: RAID Questions answered in this lecture: What is RAID? How does one trade-off between: performance, capacity, and reliability? What is.
0 Deterministic Replay for Real- time Software Systems Alice Lee Safety, Reliability & Quality Assurance Office JSC, NASA Yann-Hang.
Advances in Language Design
Software faults & reliability Presented by: Presented by: Pooja Jain Pooja Jain.
Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of.
Software Faults and Fault Injection Models --Raviteja Varanasi.
2. Fault Tolerance. 2 Fault - Error - Failure Fault = physical defect or flow occurring in some component (hardware or software) Error = incorrect behavior.
SWAT: Designing Reisilent Hardware by Treating Software Anomalies Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup K. Sahoo, Siva Kumar Sastry Hari, Rahmet.
An Empirical Study of Reported Bugs in Server Software with Implications for Automated Bug Diagnosis Swarup Kumar Sahoo, John Criswell, Vikram Adve Department.
A Framework for Highly-Available Session-Oriented Internet Services Bhaskaran Raman, Prof. Randy H. Katz {bhaskar, The ICEBERG Project.
Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S.
Z. Kalbarczyk K. Whisnant, Q. Liu, R.K. Iyer Center for Reliable and High-Performance Computing Coordinated Science Laboratory University of Illinois at.
Eliminating Silent Data Corruptions caused by Soft-Errors Siva Hari, Sarita Adve, Helia Naeimi, Pradeep Ramachandran, University of Illinois at Urbana-Champaign,
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
ICS-FORTH 25-Nov Infrastructure for Scalable Services Are we Ready Yet? Angelos Bilas Institute of Computer Science (ICS) Foundation.
FireProof. The Challenge Firewall - the challenge Network security devices Critical gateway to your network Constant service The Challenge.
Using Likely Program Invariants to Detect Hardware Errors Swarup Kumar Sahoo, Man-Lap Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou.
SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.
Application-Aware SoftWare AnomalyTreatment (SWAT) of Hardware Faults Byn Choi, Siva Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup Sahoo, Sarita.
Relyzer: Exploiting Application-level Fault Equivalence to Analyze Application Resiliency to Transient Faults Siva Hari 1, Sarita Adve 1, Helia Naeimi.
Test Results of the EuroStore Mass Storage System Ingo Augustin CERNIT-PDP/DM Padova.
RELIABILITY ENGINEERING 28 March 2013 William W. McMillan.
CISC Machine Learning for Solving Systems Problems Presented by: Suman Chander B Dept of Computer & Information Sciences University of Delaware Automatic.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Adaptive Online Testing.
StealthTest: Low Overhead Online Software Testing Using Transactional Memory Jayaram Bobba, Weiwei Xiong*, Luke Yen †, Mark D. Hill, and David A. Wood.
FPGA-Based System Design: Chapter 7 Copyright  2004 Prentice Hall PTR Topics n Hardware/software co-design.
SWAT: Designing Resilient Hardware by Treating Software Anomalies Byn Choi, Siva Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup Sahoo, Sarita Adve,
Preserving Application Reliability on Unreliable Hardware Siva Hari Department of Computer Science University of Illinois at Urbana-Champaign.
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
A Binary Agent Technology for COTS Software Integrity Anant Agarwal Richard Schooler InCert Software.
Low-cost Program-level Detectors for Reducing Silent Data Corruptions Siva Hari †, Sarita Adve †, and Helia Naeimi ‡ † University of Illinois at Urbana-Champaign,
Adding Algorithm Based Fault-Tolerance to BLIS Tyler Smith, Robert van de Geijn, Mikhail Smelyanskiy, Enrique Quintana-Ortí 1.
Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene Fratkin {emrek,
Em Spatiotemporal Database Laboratory Pusan National University File Processing : Database Management System Architecture 2004, Spring Pusan National University.
By Ronnie Julio Mohammad Alsawwaf.  Using more than two computer systems that are linked together  Handles a larger/more variable workload  Provides.
Preserving Application Reliability on Unreliable Hardware Siva Hari Adviser: Sarita Adve Department of Computer Science University of Illinois at Urbana-Champaign.
GangES: Gang Error Simulation for Hardware Resiliency Evaluation Siva Hari 1, Radha Venkatagiri 2, Sarita Adve 2, Helia Naeimi 3 1 NVIDIA Research, 2 University.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
Approximate Computing: (Old) Hype or New Frontier? Sarita Adve University of Illinois, EPFL Acks: Vikram Adve, Siva Hari, Man-Lap Li, Abdulrahman Mahmoud,
Fail-stutter Behavior Characterization of NFS
Presented by: Daniel Taylor
Fault Tolerance Comparison
Lazy Preemption to Enable Path-Based Analysis of Interrupt-Driven Code
MSWAT: Hardware Fault Detection and Diagnosis for Multicore Systems
Lessons from The File Copy Assignment
ATTRACT TWD Symposium, Barcelona, Spain, 1st July 2016
SWAT: Designing Resilient Hardware by Treating Software Anomalies
Fault Tolerance In Operating System
Fault Injection: A Method for Validating Fault-tolerant System
ARMOR-based Hierarchical Fault/Error Management
InCheck: An In-application Recovery Scheme for Soft Errors
Mark McKelvin EE249 Embedded System Design December 03, 2002
Co-designed Virtual Machines for Reliable Computer Systems
Abstractions for Fault Tolerance
Presentation transcript:

Software Managed Resiliency Siva Hari Lei Chen, Xin Fu, Pradeep Ramachandran, Swarup Sahoo, Rob Smolenski, Sarita Adve Department of Computer Science University of Illinois at Urbana-Champaign

Motivation Hardware will fail in-the-field due to several reasons  Need in-field detection, diagnosis, recovery, repair Reliability problem pervasive across many markets –Traditional redundancy solutions (e.g., nMR) too expensive  Need low-cost solutions for multiple failure sources  Must incur low area, performance, power overhead Transient errors (High-energy particles ) Wear-out (Devices are weaker) Design Bugs … and so on

Observations Need handle only hardware faults that propagate to software Fault-free case remains common, must be optimized  Watch for software anomalies (symptoms) – Zero to low overhead “always-on” monitors Diagnose cause after symptom detected −May incur high overhead, but rarely invoked  SWAT: SoftWare Anomaly Treatment

SWAT Framework Components Detection: Symptoms of software misbehavior Diagnosis: Rollback/replay on multicore Recovery: Checkpoint/rollback, output buffering Repair/reconfiguration: Redundant, reconfigurable hardware Flexible control through firmware Diagnosis FaultErrorSymptom detected Recovery Repair Checkpoint

SWAT Contributions So Far In-situ diagnosis [DSN’08] Very low-cost detectors, 99+% coverage [ASPLOS’08, DSN’08] Diagnosis FaultErrorSymptom detected Recovery Repair Checkpoint Accurate fault modeling [HPCA’09] Multithreaded workloads [MICRO’09]

Off-core faults Client/server interactions Exploit safe parallel languages –DeNovo + Deterministic Parallel Java for safe parallelism –Will explore for resiliency SWAT on Multicore and Distributed Systems Simulated Network Simics Simulated Server Simulated Client Application OS Hardware Application OS Hardware Fault TATA TBTB TCTC TDTD TDTD TATA TBTB TCTC TCTC TDTD TATA TBTB

Leverage app knowledge and structure for s/w reliability – Can improve both SDCs and recovery E.g., stateless recovery for server threads, transactional semantics for clients, assertions in production code Application-Driven Resiliency Fault-freeFaulty

FPGA prototype with Michigan CrashTest – Realistic fault models – Full multicore implementation with firmware Validation and Prototype