Presentation Title Global-scale systems that know when they are behaving badly NSF workshop on grand challenges in distributed systems Jeff Mogul, HP.

Presentation Title Global-scale systems that know when they are behaving badly NSF workshop on grand challenges in distributed systems Jeff Mogul, HP Labs, Palo Alto September, 2005

Distributed systems: broken by definition?
Leslie Lamport said (more or less): You know you have a distributed system when the crash of a computer you have never heard of stops you from getting any work done. A more accurate definition (?): You know you have a distributed system when a computer you have never heard of stops you from getting any work done. Grand challenge: make this definition obsolete 2/4/2019

The problem Real-world enterprise-scale distributed systems almost inevitably misbehave Even when no component “fails” per se Even without malicious interference “Correct” results aren’t always enough Correct-by-construction does not solve the problem It’s too hard (but keep trying, folks!) Specifications are never really right System-scale “emergent” misbehavior happens 2/4/2019

Examples Examples from simple systems:
Accidental synchronization of routing protocol updates (Floyd and Jacobson, 1993) Interaction between TCP’s delay-ACK and Nagle algorithms causes 200ms delays Examples from more complex systems: Sprite FS server “recovery storms” (Baker, 1991) Clients gang up on server during recovery Over-eager load-balancer “failure” timeout 2/4/2019

Detect system-wide misbehavior
A prerequisite for diagnosis and repair Assume system-wide failures will happen Minimize undetected misbehavior Design systems that recognize their own failures Continuous self-monitoring designed-in from start Not “extra cost”; this is part of the spec Synthesize global view from local views, probably Online misbehavior-detectors 2/4/2019

What would it take? Several possible approaches (use them all!):
Tools to express and check “expectations” Separate from code Not necessarily formal or correct themselves E.g., “never more than log(N)+2 hops in DHT lookup” Detectors for generic kinds of misbehavior Thrashing, deadlock, oscillation, resource leaks, etc. No a priori knowledge of application or implementation Global-behavior visualization for human operators Must balance detail vs. comprehensibility 2/4/2019

Research issues How much instrumentation is enough?
And how to move, process the resulting data Designing a language to express expectations Work in progress: Patrick Reynolds + others Designing generic detectors for system-wide failure Some work from Stanford/Berkeley Pinpoint project Balancing false alarm rate vs. non-detection Root-cause inference Doesn’t have to be perfect to be useful Don’t ignore system management in MREFC 2/4/2019

Presentation Title Global-scale systems that know when they are behaving badly NSF workshop on grand challenges in distributed systems Jeff Mogul, HP.

Similar presentations

Presentation on theme: "Presentation Title Global-scale systems that know when they are behaving badly NSF workshop on grand challenges in distributed systems Jeff Mogul, HP."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Presentation Title Global-scale systems that know when they are behaving badly NSF workshop on grand challenges in distributed systems Jeff Mogul, HP.

Similar presentations

Presentation on theme: "Presentation Title Global-scale systems that know when they are behaving badly NSF workshop on grand challenges in distributed systems Jeff Mogul, HP."— Presentation transcript:

Similar presentations

About project

Feedback