Presentation Title Global-scale systems that know when they are behaving badly NSF workshop on grand challenges in distributed systems Jeff Mogul, HP.

Slides:

Advertisements

Similar presentations

ABSTRACT Due to the Internets sheer size, complexity, and various routing policies, it is difficult if not impossible to locate the causes of large volumes.

Advertisements

Chapter 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.

1 Christophe S. Jelger, Michael Kleis, Burak Simsek, Rolf Stadler, Ralf König, Danny Raz Theories/formal methods in support of autonomic management Dagstuhl.

The Case for Drill-Ready Cloud Computing Vision Paper Tanakorn Leesatapornwongsa and Haryadi S. Gunawi 1.

CTO Office Reliability & Security Distinctions and Interactions Hal Lockhart BEA Systems.

© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Emergent (Mis)behavior vs. Complex.

The Future of Correct Software George Necula. 2 Software Correctness is Important ► Where there is software, there are bugs ► It is estimated that software.

Business Continuity and DR, A Practical Implementation Mich Talebzadeh, Consultant, Deutsche Bank

Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 3 – Distributed Systems.

Evolutionary Computational Intelligence Lecture 10a: Surrogate Assisted Ferrante Neri University of Jyväskylä.

1 BGP Security -- Zhen Wu. 2 Schedule Tuesday –BGP Background –" Detection of Invalid Routing Announcement in the Internet" –Open Discussions Thursday.

Systems of Distributed Systems Module 2 -Distributed algorithms Teaching unit 3 – Advanced algorithms Ernesto Damiani University of Bozen Lesson 6 – Two.

CS 603 Failure Models April 12, Fault Tolerance in Distributed Systems Perfect world: No Failures –W–We don’t live in a perfect world Non-distributed.

Internet and Intranet Protocols and Applications Section V: Network Application Performance Lecture 11: Why the World Wide Wait? 4/11/2000 Arthur P. Goldberg.

Toward Optimal Network Fault Correction via End-to-End Inference Patrick P. C. Lee, Vishal Misra, Dan Rubenstein Distributed Network Analysis (DNA) Lab.

1 More on Distributed Coordination. 2 Who’s in charge? Let’s have an Election. Many algorithms require a coordinator. What happens when the coordinator.

Composition Model and its code. bound:=bound+1.

Distributed Systems Sukumar Ghosh Department of Computer Science University of Iowa.

CSE 486/586 CSE 486/586 Distributed Systems PA Best Practices Steve Ko Computer Sciences and Engineering University at Buffalo.

Openlab Workshop on Data Analytics 16 th of November 2012 Axel Voitier – CERN EN-ICE.

1 CS 501 Spring 2003 CS 501: Software Engineering Lecture 16 System Architecture and Design II.

 Chapter 13 – Dependability Engineering 1 Chapter 12 Dependability and Security Specification 1.

Use of Coverity & Valgrind in Geant4 Gabriele Cosmo.

Deadlock Detection and Recovery

Hwajung Lee. One of the selling points of a distributed system is that the system will continue to perform even if some components / processes fail.

Progress Report Armando Fox with George Candea, James Cutler, Ben Ling, Andy Huang.

CS 4700 / CS 5700 Network Fundamentals Lecture 7.5: Summary from Lecture 2 Revised 1/25/2014.

Lecture 1: Logical and Physical Time with some Applications Anish Arora CSE 6333 Notes include material from Dr. Jeff Brumfield.

Faults and fault-tolerance One of the selling points of a distributed system is that the system will continue to perform even if some components / processes.

Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene Fratkin {emrek,

Fail-stutter Behavior Characterization of NFS

David Wetherall Spring 2000

The consensus problem in distributed systems

Outline Properties of keys Key management Key servers Certificates.

Faults and fault-tolerance

Greetings. Those of you who don't yet know me... Today is... and

Virtual Active Networks

Chapter 8 – Software Testing

Synthesis from scenarios and requirements

CSE 486/586 Distributed Systems Time and Synchronization

Formally Specified Monitoring of Temporal Properties

Operating System Reliability

Operating System Reliability

Introduction to Computer Programming

CSE 486/586 Distributed Systems Logical Time

Software Engineering (CSI 321)

EECS 498 Introduction to Distributed Systems Fall 2017

Distributed Systems, Consensus and Replicated State Machines

Operating System Reliability

Virtual Active Networks

Operating System Reliability

Faults and fault-tolerance

Foundations for Highly-Available Content-based Publish/Subscribe Overlays Young Yoon, Vinod Muthusamy and Hans-Arno Jacobsen.

Dr. Rob Hasker SE 3800 Note 9 Reviews.

Distributed Transactions

Decoupled Storage: “Free the Replicas!”

Operating System Reliability

Physical clock synchronization

Distributed Transactions

Synchronization (2) – Mutual Exclusion

This Lecture Substitution model

Distributed Transactions

Distributed Transactions

CSE 542: Operating Systems

CSE 486/586 Distributed Systems Time and Synchronization

Operating System Reliability

Operating System Reliability

Luca Simoncini PDCC, Pisa and University of Pisa, Pisa, Italy

Chowkidar: Stabilizing Health Monitoring for Wireless Sensor Networks

Queueing Problem The performance of network systems rely on different delays. Propagation/processing/transmission/queueing delays Which delay is affected.

Presentation transcript:

Presentation Title Global-scale systems that know when they are behaving badly NSF workshop on grand challenges in distributed systems Jeff Mogul, HP Labs, Palo Alto Jeff.Mogul@hp.com September, 2005

Distributed systems: broken by definition? Leslie Lamport said (more or less): You know you have a distributed system when the crash of a computer you have never heard of stops you from getting any work done. A more accurate definition (?): You know you have a distributed system when a computer you have never heard of stops you from getting any work done. Grand challenge: make this definition obsolete 2/4/2019

The problem Real-world enterprise-scale distributed systems almost inevitably misbehave Even when no component “fails” per se Even without malicious interference “Correct” results aren’t always enough Correct-by-construction does not solve the problem It’s too hard (but keep trying, folks!) Specifications are never really right System-scale “emergent” misbehavior happens 2/4/2019

Examples Examples from simple systems: Accidental synchronization of routing protocol updates (Floyd and Jacobson, 1993) Interaction between TCP’s delay-ACK and Nagle algorithms causes 200ms delays Examples from more complex systems: Sprite FS server “recovery storms” (Baker, 1991) Clients gang up on server during recovery Over-eager load-balancer “failure” timeout 2/4/2019

Detect system-wide misbehavior A prerequisite for diagnosis and repair Assume system-wide failures will happen Minimize undetected misbehavior Design systems that recognize their own failures Continuous self-monitoring designed-in from start Not “extra cost”; this is part of the spec Synthesize global view from local views, probably Online misbehavior-detectors 2/4/2019

What would it take? Several possible approaches (use them all!): Tools to express and check “expectations” Separate from code Not necessarily formal or correct themselves E.g., “never more than log(N)+2 hops in DHT lookup” Detectors for generic kinds of misbehavior Thrashing, deadlock, oscillation, resource leaks, etc. No a priori knowledge of application or implementation Global-behavior visualization for human operators Must balance detail vs. comprehensibility 2/4/2019

Research issues How much instrumentation is enough? And how to move, process the resulting data Designing a language to express expectations Work in progress: Patrick Reynolds + others Designing generic detectors for system-wide failure Some work from Stanford/Berkeley Pinpoint project Balancing false alarm rate vs. non-detection Root-cause inference Doesn’t have to be perfect to be useful Don’t ignore system management in MREFC 2/4/2019