Download presentation
Presentation is loading. Please wait.
1
Presentation Title Global-scale systems that know when they are behaving badly NSF workshop on grand challenges in distributed systems Jeff Mogul, HP Labs, Palo Alto September, 2005
2
Distributed systems: broken by definition?
Leslie Lamport said (more or less): You know you have a distributed system when the crash of a computer you have never heard of stops you from getting any work done. A more accurate definition (?): You know you have a distributed system when a computer you have never heard of stops you from getting any work done. Grand challenge: make this definition obsolete 2/4/2019
3
The problem Real-world enterprise-scale distributed systems almost inevitably misbehave Even when no component “fails” per se Even without malicious interference “Correct” results aren’t always enough Correct-by-construction does not solve the problem It’s too hard (but keep trying, folks!) Specifications are never really right System-scale “emergent” misbehavior happens 2/4/2019
4
Examples Examples from simple systems:
Accidental synchronization of routing protocol updates (Floyd and Jacobson, 1993) Interaction between TCP’s delay-ACK and Nagle algorithms causes 200ms delays Examples from more complex systems: Sprite FS server “recovery storms” (Baker, 1991) Clients gang up on server during recovery Over-eager load-balancer “failure” timeout 2/4/2019
5
Detect system-wide misbehavior
A prerequisite for diagnosis and repair Assume system-wide failures will happen Minimize undetected misbehavior Design systems that recognize their own failures Continuous self-monitoring designed-in from start Not “extra cost”; this is part of the spec Synthesize global view from local views, probably Online misbehavior-detectors 2/4/2019
6
What would it take? Several possible approaches (use them all!):
Tools to express and check “expectations” Separate from code Not necessarily formal or correct themselves E.g., “never more than log(N)+2 hops in DHT lookup” Detectors for generic kinds of misbehavior Thrashing, deadlock, oscillation, resource leaks, etc. No a priori knowledge of application or implementation Global-behavior visualization for human operators Must balance detail vs. comprehensibility 2/4/2019
7
Research issues How much instrumentation is enough?
And how to move, process the resulting data Designing a language to express expectations Work in progress: Patrick Reynolds + others Designing generic detectors for system-wide failure Some work from Stanford/Berkeley Pinpoint project Balancing false alarm rate vs. non-detection Root-cause inference Doesn’t have to be perfect to be useful Don’t ignore system management in MREFC 2/4/2019
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.