Using Fault Model Enforcement (FME) to Improve Availability EASY ’02 Workshop Kiran Nagaraja, Ricardo Bianchini, Richard Martin, Thu Nguyen Department.

Slides:



Advertisements
Similar presentations
Express5800/ft series servers Product Information Fault-Tolerant General Purpose Servers.
Advertisements

Data Communications and Networking
CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Resource Containers: A new Facility for Resource Management in Server Systems G. Banga, P. Druschel,
IEEE JSAC Special Issue Adaptive Media Streaming Submissions by April 1 Details at
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Availability in Globally Distributed Storage Systems
Zookeeper at Facebook Vishal Kathuria.
Making Services Fault Tolerant
Business Continuity and DR, A Practical Implementation Mich Talebzadeh, Consultant, Deutsche Bank
Experience with some Principles for Building an Internet-Scale Reliable System Mike Afergan (Akamai and MIT) Joel Wein (Akamai and Polytechnic University,
1 Principles of Reliable Distributed Systems Tutorial 12: Frangipani Spring 2009 Alex Shraer.
Dark and Panic Lab Computer Science, Rutgers University1 Impact of Layering and Faults on Availability and its End-to-End Implications Ricardo Bianchini,
Mendosus A SAN-Based Fault Injection Test-Bed for Construction of Highly Available Network Services Xiaoyan Li, Richard Martin, Kiran Nagaraja, Thu D.
Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.
Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.
Dark and Panic Lab Computer Science, Rutgers University1 Evaluating the Impact of Communication Architecture on Performability of Cluster-Based Services.
Cooperative Caching Middleware for Cluster-Based Servers Francisco Matias Cuenca-Acuna Thu D. Nguyen Panic Lab Department of Computer Science Rutgers University.
Yaksha: A Self-Tuning Controller for Managing the Performance of 3-Tiered Web Sites Abhinav Kamra, Vishal Misra CS Department Columbia University Erich.
2001 ©R.P.Martin Using Distributed Data Structures for Constructing Cluster-Based Servers Richard Martin, Kiran Nagaraja and Thu Nguyen Rutgers University.
1 Making Services Fault Tolerant Pat Chan, Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong Miroslaw Malek.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
Lecture 8 Epidemic communication, Server implementation.
EASY workshop The Shape of Failure Taliver Heath, Richard Martin and Thu Nguyen Rutgers University Department of Computer Science EASY Workshop July.
Locality-Aware Request Distribution in Cluster-based Network Servers Presented by: Kevin Boos Authors: Vivek S. Pai, Mohit Aron, et al. Rice University.
1 CSE 403 Reliability Testing These lecture slides are copyright (C) Marty Stepp, They may not be rehosted, sold, or modified without expressed permission.
Frangipani: A Scalable Distributed File System C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation.
Presented by: Alvaro Llanos E.  Motivation and Overview  Frangipani Architecture overview  Similar DFS  PETAL: Distributed virtual disks ◦ Overview.
Computer System Lifecycle Chapter 1. Introduction Computer System users, administrators, and designers are all interested in performance evaluation. Whether.
Highly Available ACID Memory Vijayshankar Raman. Introduction §Why ACID memory? l non-database apps: want updates to critical data to be atomic and persistent.
PMIT-6102 Advanced Database Systems
1 The Google File System Reporter: You-Wei Zhang.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
IT Infrastructure Chap 1: Definition
1 Software Testing and Quality Assurance Lecture 33 – Software Quality Assurance.
1 Performance Evaluation of Computer Systems and Networks Introduction, Outlines, Class Policy Instructor: A. Ghasemi Many thanks to Dr. Behzad Akbari.
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
May PEM status report. O.Bärring 1 PEM status report Large-Scale Cluster Computing Workshop FNAL, May Olof Bärring, CERN.
Metrics and Techniques for Evaluating the Performability of Internet Services Pete Broadwell
Background: Operating Systems Brad Karp UCL Computer Science CS GZ03 / M th November, 2008.
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
ICOM 6115: Computer Systems Performance Measurement and Evaluation August 11, 2006.
1 Evaluation of Cooperative Web Caching with Web Polygraph Ping Du and Jaspal Subhlok Department of Computer Science University of Houston presented at.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2004 Daniel J. Sorin Duke University.
CS 505: Thu D. Nguyen Rutgers University, Spring CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers.
CISC Machine Learning for Solving Systems Problems Presented by: Suman Chander B Dept of Computer & Information Sciences University of Delaware Automatic.
Measuring the Capacity of a Web Server USENIX Sympo. on Internet Tech. and Sys. ‘ Koo-Min Ahn.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
Latency as a Performability Metric: Experimental Results Pete Broadwell
CHAPTER 7 CLUSTERING SERVERS. CLUSTERING TYPES There are 2 types of clustering ; Server clusters Network Load Balancing (NLB) The difference between the.
Module 9 Planning and Implementing Monitoring and Maintenance.
Lecture 4 Page 1 CS 111 Online Modularity and Virtualization CS 111 On-Line MS Program Operating Systems Peter Reiher.
MiddleMan: A Video Caching Proxy Server NOSSDAV 2000 Brian Smith Department of Computer Science Cornell University Ithaca, NY Soam Acharya Inktomi Corporation.
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
Improving the Reliability of Commodity Operating Systems Michael M. Swift, Brian N. Bershad, Henry M. Levy Presented by Ya-Yun Lo EECS 582 – W161.
By Nitin Bahadur Gokul Nadathur Department of Computer Sciences University of Wisconsin-Madison Spring 2000.
Cloud Computing – UNIT - II. VIRTUALIZATION Virtualization Hiding the reality The mantra of smart computing is to intelligently hide the reality Binary->
Em Spatiotemporal Database Laboratory Pusan National University File Processing : Database Management System Architecture 2004, Spring Pusan National University.
VCS Building Blocks. Topic 1: Cluster Terminology After completing this topic, you will be able to define clustering terminology.
1 Evaluation of Cooperative Web Caching with Web Polygraph Ping Du and Jaspal Subhlok Department of Computer Science University of Houston presented at.
BIG DATA/ Hadoop Interview Questions.
Fail-stutter Behavior Characterization of NFS
Abhinav Kamra, Vishal Misra CS Department Columbia University
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
High Availability Linux (HA Linux)
Large Distributed Systems
Latency as a Performability Metric: Experimental Results
Fault Tolerance Distributed Web-based Systems
Database System Architectures
Presentation transcript:

Using Fault Model Enforcement (FME) to Improve Availability EASY ’02 Workshop Kiran Nagaraja, Ricardo Bianchini, Richard Martin, Thu Nguyen Department of Computer Science Rutgers University

Motivation  Network services are extremely complex  Typically many software and hardware components  Numerous fault points and types  E.g, nodes, disks, cables, links, switches, etc.  Extremely difficult for services to tolerate all these faults  Hard to reason about all possible faults  Difficult to determine actual fault  Many faults exhibit same runtime symptoms

FME Approach  Define a reduced abstract fault model  Components, faults, symptoms, component behavior during faults  Enforce this fault model at run-time  If an “unexpected” fault occurs, map to one that was planned for in the abstract model  “If the facts don’t fit the theory, change the facts.” - Albert Einstein  Allow designer to concentrate on tolerating a well- defined, yet limited in complexity, set of faults

Our Study  Estimate potential impact of FME  Have not yet implemented FME  Case study: PRESS cluster-based web server  PRESS has simple abstract fault model  In companion study, only achieve around three 9’s  Study hypothetical improvement if FME was used to enforce PRESS’s abstract fault model  FME can reduce the unavailability by up to 50%

Outline  FME in more detail  Evaluation methodology  PRESS web server  Availability study  Related work  Conclusions  Future directions

Fault Model Enforcement (FME)  Enforce a reduced fault model at runtime  Allow service to perform correct recovery action to regain full functionality  How to enforce a reduced fault model?  Two ideas so far  Map an unexpected fault to an expected fault  E.g., crash a node if the network link connecting it to the switch fails  Fail outer component if sub-component fails  E.g., crash a node if the disk fails  How is it different from fail-stop ?  Allows reasoning about failures at a desired abstraction

Evaluation Methodology  Want to evaluate FME’s potential impact  Two phase methodology  Phase I - Single fault injection analysis  Define and inject faults on “live” system  Monitor system performance (throughput T) and availability(A) = fraction of successful requests  Phase II - Use an analytical model to determine performability  Computes average availability and average throughput

Case Study: PRESS Web Server  Cluster-based, locality-conscious web server  Serve requests out of global memory pool  Exclusion from pool  lower performance  Simple fault model  Connection failure/lost heartbeats = node failure  Recovery through rejoin of “new” node  Several versions developed over time  TCP, VIA  Different fault detection mechanism  Heart-beat for TCP  Connection breaks for VIA

Fault Set  Fault Load Link down Switch down SCSI timeout Node crash Node freeze Application crash Application hang  All faults are modeled as fail-stop

PRESS with FME  Recovery upon fault model mismatch  Restart 0, 1 or all nodes?  FME approach: reboot the appropriate node after a fault and its recovery have occurred  Link down – reboot unreachable node  Switch down – reboot all nodes  Disk failure – reboot node with faulty disk  Node, application crash – do nothing

Single-Fault Experiments  Setup: 4 PC cluster running at 90% load  3 versions: TCP, TCP-HB, VIA  Use results to evaluate impact of FME

Single Fault - Results Link Failure Application Hang

Modeling – Seven Stage Model  Input: measured throughput and availability  Parameters: MTTF, MTTR, operator on site time  Output: average availability & average throughput

Modeling Availability  Assumptions:  Effects of faults are independent  Fault arrivals are exponential  Overall unavailability = Σ T (unavailability of all faults)

Modeling Results  Application fault rate: 1/month  Time to operator intervention: 5 minutes  Unavailability of TCP-HB reduced by ~50%  VIA: ~36% reduction

Modeling Results  Application fault rate: 1/day - unstable s/w  Time to operator intervention: 5 minutes  Unavailability of TCP-HB reduces by > 50%  VIA: ~13% reduction

Related Work  Enforcing fail-stop  Tandem Non-Stop – process pairs  Robust design with rigorous internal assertions  Fault detection and fail-over  HA-Linux  Reactive and proactive rejuvenation  Recursive restartability(ROC) – Berkeley & Stanford  Software rejuvenation – Duke

Conclusion  FME allows for very simple fault models  FME can cut the unavailability by up to 50%  Fault detection mechanism is crucial for effectiveness  Benefits increase with fault coverage

FME - Future Directions  How extensive should the fault model be?  Determines programming complexity/effort  How to prevent FME from reducing availability?  Bugs within enforcement?  When to declare a symptom a fault?  FME reduces human intervention  Are humans better at deciding?  8-23 % of recovery procedures are botched [Brown 2001]

Thank you.

Communication Architecture  All operations by main thread are non- blocking  Separate send, receive and multiple disk helper threads  Filling up of queues could stall the entire node

Performability  Model computes 2 metrics:  Average throughput (AT)  Average Availability (AA)  Performability P = Tn x log(AI) log(AA)  AI : Availability of Ideal system with  Log scale ratio allows a linear relationship with unavailability

Experiments: Single-Fault Loads  4 800Mhz PIII PCs, 206MB, 2x10000 SCSI disks, 1Gb/s cLan interconnect (TCP or VIA)  PRESS: 128MB file cache, static content  Clients: constant rate ~ 90% server capacity  Modified sclient [Banga 97]  Rutgers trace; file size = avg. request size

Mendosus – Fault Injection Central Controller Fast & Reliable SAN Node ANode B Events Kernel User-Level SCSI Process Ctrl Daemon Mlib Applications E.g. PRESS emulation n/w faults n/w stack comLibglibcsys_calls Node/OS

Phase II – Modeling Performability  5 minutes duration for operator intervention(E) and restart(F) stages FaultMTTFMTTR Link down6 months3 minutes Switch down1 year1 hour SCSI timeout1 year1 hour Node crash2 weeks3 minutes Node freeze2 weeks3 minutes Application Crash2 months3 minutes Application Hang2 months3 minutes