Fail-stutter Behavior Characterization of NFS

Fail-stutter Behavior Characterization of NFS
Jichuan Chang CS736 Final Project, UW-Madison December 13, 2002

performance correctness
Motivation We want systems to be very Fast and Available! Hard to achieve for modern computer systems complex interactions among components; can’t assume everything is always working perfectly! We need a better fault model Simpler than the Byzantine model; Richer than the fail-stop model; Fail-stutter Fault-tolerance [Remzi 01]. Fail-stop: fault Fail-stutter: performance correctness fault fault Stable Performance Low Performance Down

Fail-stutter Issues Exploit fail-stutter behavior
Separate performance faults from correctness faults What are performance faults? Need a performance specification, but how to get the spec.? How to distinguish “interference” and performance fault? What are correctness faults? Correctness should be defined in an end-to-end manner. How to diagnose both types of faults? Must observe how systems behave! Exploit fail-stutter behavior Who should be notified about failures, when and how? System supports - programming tools / runtime support Integration with existing systems - less intrusion

Our Approach Case study: NFS fail-stutter characterization
Fault-injection (vs. system monitoring) Performance measurement Simple, software-based test-bed Interesting observations Different failed parts have different performance impact Different types of clients have different behaviors Patient (keep retrying) vs. Impatient (try other servers) Transition between performance and correctness faults Can be determined proactively by fault-injection; Performance spec. could be application-specific.

Experimental Settings
… NFS Client App NFS Server X Storage System X … X … Click S/W Router Workloads - SpecSFS97, file (micro-benchmark). Data to collect - throughput, response time, errors. Faulty components - network, server, disk, bus, etc. Fault injection - network package dropping drop k% Ethernet packages, drop k% IP packages coming from the server.

Results (1) - Patient Client
1. Performance degradation scales with drop probability. X X X = Error occurred 2. Ethernet dropping less harmful compared with IP dropping. X X X X X 3. Performance data less meaningful when error occurs. X X X X X X X X X X 4. Different operations switch to correctness faults at different points (e.g. 5%, 15%, 20%). Total execution time can hide such difference.

Results (2) - Impatient Client
1. Throughput decreases linearly as the dropping probability increases. 2. Throughput drops manifest under heavy loads. 1. Throughput decreases linearly as the dropping probability increases. 2. Throughput drops manifest under heavy loads. SpecSFS97 Retry once! 3. Response time doesn’t change as much! 4. Ethernet dropping less harmful.

Summary Modern computer system design needs a better fault-tolerance model. Using fault-injection to characterize NFS fail-stutter behavior. Preliminary observations address some of the fail-stutter issues How to separate different types of faults? Suggest that we can extract performance specification by fault-injection and probing.

Future Work Very-short-term Short-term Long-term
More classes of faults More realistic fault injection Short-term Separate “interference” and performance fault Extract/refine performance specifications Performance-fault diagnosis Long-term Detailed model for a specific workload / system System support for fail-stutter failures

Fail-stutter Behavior Characterization of NFS

Similar presentations

Presentation on theme: "Fail-stutter Behavior Characterization of NFS"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fail-stutter Behavior Characterization of NFS

Similar presentations

Presentation on theme: "Fail-stutter Behavior Characterization of NFS"— Presentation transcript:

Similar presentations

About project

Feedback