Fail-stutter Behavior Characterization of NFS

Slides:



Advertisements
Similar presentations
Top-Down Network Design Chapter Nine Developing Network Management Strategies Copyright 2010 Cisco Press & Priscilla Oppenheimer.
Advertisements

The Effects of Wide-Area Conditions on WWW Server Performance Erich Nahum, Marcel Rosu, Srini Seshan, Jussara Almeida IBM T.J. Watson Research Center,
The Connectivity and Fault-Tolerance of the Internet Topology
Predictable Computer Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison.
Distributed Multimedia Systems
CS 795 – Spring  “Software Systems are increasingly Situated in dynamic, mission critical settings ◦ Operational profile is dynamic, and depends.
Objektorienteret Middleware Presentation 2: Distributed Systems – A brush up, and relations to Middleware, Heterogeneity & Transparency.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
1 Availability Study of Dynamic Voting Algorithms Kyle Ingols and Idit Keidar MIT Lab for Computer Science.
ISCSI Performance in Integrated LAN/SAN Environment Li Yin U.C. Berkeley.
CS 603 Failure Models April 12, Fault Tolerance in Distributed Systems Perfect world: No Failures –W–We don’t live in a perfect world Non-distributed.
1 Action Breakout Session Anil, AP, Nina Bhatti, Charles Berdnall, Joe Hellerstein, Wei Hu, Anthony Joseph, Randy Katz, Li, Machi Mukund Kimmo Raatikanen,
1 Software Testing and Quality Assurance Lecture 34 – Software Quality Assurance.
Soft. Eng. II, Spr. 2002Dr Driss Kettani, from I. Sommerville1 CSC-3325: Chapter 9 Title : Reliability Reading: I. Sommerville, Chap. 16, 17 and 18.
09/18/06 1 Software Security Vulnerability Testing in Hostile Environment Herbert H. Thompson James A. Whittaker Florence E. Mottay.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 12 Slide 1 Distributed Systems Design 1.
Client/Server Architectures
Software Faults and Fault Injection Models --Raviteja Varanasi.
ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.
PMIT-6102 Advanced Database Systems
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
Operating Systems  A collection of programs that  Coordinates computer usage among users  Manages computer resources  Handle Common Tasks.
Top-Down Network Design Chapter Nine Developing Network Management Strategies Oppenheimer.
Naaliel Mendes, João Durães, Henrique Madeira CISUC, Department of Informatics Engineering University of Coimbra {naaliel, jduraes,
Cluster Reliability Project ISIS Vanderbilt University.
◦ What is an Operating System? What is an Operating System? ◦ Operating System Objectives Operating System Objectives ◦ Services Provided by the Operating.
D0 SAM – status and needs Plagarized from: D0 Experiment SAM Project Fermilab Computing Division.
Budget-based Control for Interactive Services with Partial Execution 1 Yuxiong He, Zihao Ye, Qiang Fu, Sameh Elnikety Microsoft Research.
Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.
A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
Slide 1 Breaking databases for fun and publications: availability benchmarks Aaron Brown UC Berkeley ROC Group HPTS 2001.
Internetworking Concept and Architectural Model
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
1 ACTIVE FAULT TOLERANT SYSTEM for OPEN DISTRIBUTED COMPUTING (Autonomic and Trusted Computing 2006) Giray Kömürcü.
3. Global Applications CS100: The World of Computing John Dougherty Haverford College.
CS 505: Thu D. Nguyen Rutgers University, Spring CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers.
Fault Tolerance Benchmarking. 2 Owerview What is Benchmarking? What is Dependability? What is Dependability Benchmarking? What is the relation between.
CCNA4 v3 Module 6 v3 CCNA 4 Module 6 JEOPARDY K. Martin.
CISC Machine Learning for Solving Systems Problems Presented by: Suman Chander B Dept of Computer & Information Sciences University of Delaware Automatic.
Compuware Corporation Deliver Reliable Applications Faster Dave Kapelanski Automated Testing Manager.
Lecture 4 Page 1 CS 111 Online Modularity and Virtualization CS 111 On-Line MS Program Operating Systems Peter Reiher.
CSC 480 Software Engineering Lecture 17 Nov 4, 2002.
1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.
Fail-Stop Processors UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau One paper: Byzantine.
UC Marco Vieira University of Coimbra
Distributed Systems Architectures Chapter 12. Objectives  To explain the advantages and disadvantages of different distributed systems architectures.
Distributed Systems Architectures. Topics covered l Client-server architectures l Distributed object architectures l Inter-organisational computing.
Unit Hardware Troubleshooting
Modularity Most useful abstractions an OS wants to offer can’t be directly realized by hardware Modularity is one technique the OS uses to provide better.
Hardware & Software Reliability
Software Architecture in Practice
Action Breakout Session
Distributed Systems – Paxos
Large Distributed Systems
ATTRACT TWD Symposium, Barcelona, Spain, 1st July 2016
Outline Introduction Characteristics of intrusion detection systems
CSC 480 Software Engineering
Top-Down Network Design Chapter Nine Developing Network Management Strategies Copyright 2010 Cisco Press & Priscilla Oppenheimer.
UNIT 17 Computing Support.
Fault Tolerance In Operating System
Scaling for the Future Katherine Yelick U.C. Berkeley, EECS
Fault Tolerance Distributed Web-based Systems
Prophecy: Using History for High-Throughput Fault Tolerance
Presentation Title Global-scale systems that know when they are behaving badly NSF workshop on grand challenges in distributed systems Jeff Mogul, HP.
Co-designed Virtual Machines for Reliable Computer Systems
CSE 542: Operating Systems
Distributed Systems and Concurrency: Distributed Systems
CS5123 Software Validation and Quality Assurance
Top-Down Network Design Chapter Nine Developing Network Management Strategies Copyright 2010 Cisco Press & Priscilla Oppenheimer.
Presentation transcript:

Fail-stutter Behavior Characterization of NFS Jichuan Chang CS736 Final Project, UW-Madison December 13, 2002

performance correctness Motivation We want systems to be very Fast and Available! Hard to achieve for modern computer systems complex interactions among components; can’t assume everything is always working perfectly! We need a better fault model Simpler than the Byzantine model; Richer than the fail-stop model; Fail-stutter Fault-tolerance [Remzi 01]. Fail-stop: fault Fail-stutter: performance correctness fault fault Stable Performance Low Performance Down

Fail-stutter Issues Exploit fail-stutter behavior Separate performance faults from correctness faults What are performance faults? Need a performance specification, but how to get the spec.? How to distinguish “interference” and performance fault? What are correctness faults? Correctness should be defined in an end-to-end manner. How to diagnose both types of faults? Must observe how systems behave! Exploit fail-stutter behavior Who should be notified about failures, when and how? System supports - programming tools / runtime support Integration with existing systems - less intrusion

Our Approach Case study: NFS fail-stutter characterization Fault-injection (vs. system monitoring) Performance measurement Simple, software-based test-bed Interesting observations Different failed parts have different performance impact Different types of clients have different behaviors Patient (keep retrying) vs. Impatient (try other servers) Transition between performance and correctness faults Can be determined proactively by fault-injection; Performance spec. could be application-specific.

Experimental Settings … NFS Client App NFS Server X Storage System X … X … Click S/W Router Workloads - SpecSFS97, file (micro-benchmark). Data to collect - throughput, response time, errors. Faulty components - network, server, disk, bus, etc. Fault injection - network package dropping drop k% Ethernet packages, drop k% IP packages coming from the server.

Results (1) - Patient Client 1. Performance degradation scales with drop probability. X X X = Error occurred 2. Ethernet dropping less harmful compared with IP dropping. X X X X X 3. Performance data less meaningful when error occurs. X X X X X X X X X X 4. Different operations switch to correctness faults at different points (e.g. 5%, 15%, 20%). Total execution time can hide such difference.

Results (2) - Impatient Client 1. Throughput decreases linearly as the dropping probability increases. 2. Throughput drops manifest under heavy loads. 1. Throughput decreases linearly as the dropping probability increases. 2. Throughput drops manifest under heavy loads. SpecSFS97 Retry once! 3. Response time doesn’t change as much! 4. Ethernet dropping less harmful.

Summary Modern computer system design needs a better fault-tolerance model. Using fault-injection to characterize NFS fail-stutter behavior. Preliminary observations address some of the fail-stutter issues How to separate different types of faults? Suggest that we can extract performance specification by fault-injection and probing.

Future Work Very-short-term Short-term Long-term More classes of faults More realistic fault injection Short-term Separate “interference” and performance fault Extract/refine performance specifications Performance-fault diagnosis Long-term Detailed model for a specific workload / system System support for fail-stutter failures