Download presentation
Presentation is loading. Please wait.
1
© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Emergent (Mis)behavior vs. Complex Software Systems Jeff Mogul HP Labs – Palo Alto April 2006
2
Emergent (Mis)behavior vs. Complex Software Systems2 Emergent behavior? Ants are dumb Anthills are “smart” The global behavior of the anthill emerges from the local behaviors of the ants −The individual ants don’t know what the global behavior is supposed to be
3
April 2006Emergent (Mis)behavior vs. Complex Software Systems3 Opening day on the Millennium Footbridge Opening day (10 June 2000): −“unexpected lateral vibrations occured” −“a significant number of pedestrians [had] difficulty walking” −The bridge was closed; the engineers got back to work They had already done very careful modelling of a novel design What went wrong? −People on a swaying surface tend to synchronize their footsteps to the swaying, even if initial amplitude is small −Bridge’s natural frequency was close to normal footsteps −This effect was unknown in engineering literature Novel bridge design + unusual pedestrian-only load Once the problem was understood, modelling and retrofit were fairly straightforward
4
April 2006Emergent (Mis)behavior vs. Complex Software Systems4 Why is that bridge interesting to us? People have been designing bridges for millennia −Civil engineering is a well-regulated profession −Lots of experience with unexpected dynamic failures −Lots of computer modelling expertise But the engineers still got it wrong: why? Answer: emergent misbehavior −The system’s behavior emerged – it wasn’t easy to predict Particularly, not from understanding of individual “parts” −And the result was unexpected and bad If these engineers got it wrong, what about us? −Computer systems are worse than bridges!
5
April 2006Emergent (Mis)behavior vs. Complex Software Systems5 The importance of emergent misbehavior in computer systems Much past focus has been on: Fault-tolerant systems Correctness-by-construction Both are valuable, but … 1.System-wide failures not always caused by “faults” 2.Modern systems are too complex to understand 3.Performance matters! All three issues can result from emergent misbehavior Goals of this talk: Illustrate the scope and nature of the problem Propose a research agenda
6
April 2006Emergent (Mis)behavior vs. Complex Software Systems6 What this talk is NOT about Dealing with malicious behavior Game theory and incentives for people Telling anyone that their approach is wrong −We still need fault tolerance, program verification, correct-by-construction techniques, etc.! Improving peak (best-case) system performance This talk is 100% uncontaminated by: −Implementation or architecture −Experiments or results
7
April 2006Emergent (Mis)behavior vs. Complex Software Systems7 Outline Examples What is/is not “emergent misbehavior”? A research agenda Thoughts about visions of the future Related work
8
April 2006Emergent (Mis)behavior vs. Complex Software Systems8 Examples of emergent misbehavior Examples can be found in: Non-computer technology −Millennium Footbridge (London); Traffic jams Computer hardware −Vibrations in large disk arrays Networking −Ethernet capture effect, Router synchronization; BGP Route flap damping; TCP’s Nagle algorithm Distributed systems and operating systems −Misconfigured load balancer; Herd behavior; Priority inversion in the Mars Pathfinder
9
April 2006Emergent (Mis)behavior vs. Complex Software Systems9 Examples of emergent misbehavior Examples described in this talk: Non-computer technology −Millennium Footbridge (London); Traffic jams Computer hardware −Vibrations in large disk arrays Networking −Ethernet capture effect, Router synchronization; BGP Route flap damping; TCP’s Nagle algorithm Distributed systems and operating systems −Misconfigured load balancer; Herd behavior; Priority inversion in the Mars Pathfinder
10
April 2006Emergent (Mis)behavior vs. Complex Software Systems10 Ethernet Capture Effect: an example scenario Host A decides to transmit Host B decides to transmit Host A, count = 1, flips “backoff coin” = 0 Host B, count = 1, flips “backoff coin” = 1 Host A wins, transmits Idle Host A decides to transmit Host B decides to transmit Host A, count = 1, flips “backoff coin” = 0 Host B, count = 2, flips “backoff coin” = 01 Host A wins, transmits Assume both hosts have full transmit queues … ad infinitum B’s disadvantage doubles on each round
11
April 2006Emergent (Mis)behavior vs. Complex Software Systems11 Ethernet Capture Effect (II) No component here has failed Problem didn’t show up until chips met the spec −Older chips were too slow to send back-to-back packets −The extra delay left B a chance to sneak in Apparently was not caught in original modelling Problem doesn’t require large scale to show up −In fact, adding more hosts tends to blur the picture Solution involved adding extra delay −“Don’t send back-to-back if you just won a collision” −[Ramakrishnan and Yang, 1994]
12
April 2006Emergent (Mis)behavior vs. Complex Software Systems12 A misconfigured load balancer Load balancer with two jobs: −Spread load between servers −Detect server failure via timeout System stops responding reliably −After working fine for months −Load balancer repeatedly declares each server dead, in alternation Diagnosis: −DBs got slower as the got fuller −Load balancer timeout was too low −Slow app servers appeared to have “failed”, causing load balancer to switch back and forth
13
April 2006Emergent (Mis)behavior vs. Complex Software Systems13 Herd behavior in a distributed system Planetary-Scale Event Prop & Routing System −(a.k.a. PsEPR) [Brett et al., WORLDS 2005 ] −Runs on PlanetLab −Aims for very large scale Requires clients to be distributed evenly among servers Clients keep ordered preference lists of servers −Prefer “nearby” servers (based on all-pairs-ping) −On server failure: Demote failed server Try to connect to top server on list
14
April 2006Emergent (Mis)behavior vs. Complex Software Systems14 PsEPR system structures Desirable Undesirable
15
April 2006Emergent (Mis)behavior vs. Complex Software Systems15 Herd behavior in a distributed system: what went wrong with PsEPR Initially, clients generally balanced among servers As servers/links failed: −Same servers tended to look bad to most clients −So, client preference lists tended to converge −So, clients tended to connect to a small subset of servers Clients mostly converged on a few servers: −These servers became overloaded −Server-local response-time monitors caused restarts Causing further convergence of client preference lists −Clients all moved to the next server on their list At rate governed by server restart times Fix: adjust ordering by success count + random #
16
April 2006Emergent (Mis)behavior vs. Complex Software Systems16 Outline Examples What is/is not “emergent misbehavior”? A research agenda Thoughts about visions of the future Related work
17
April 2006Emergent (Mis)behavior vs. Complex Software Systems17 One definition of emergent behavior Emergent behavior is that which cannot be predicted through analysis at any level simpler than that of the system as a whole. −George Dyson (1998) Emergent misbehavior is just emergent behavior that we don’t want
18
April 2006Emergent (Mis)behavior vs. Complex Software Systems18 Distinguishing between emergent and “normal” misbehavior Misbehavior that is not emergent: −Single-component bugs that break the whole system −Inherently inefficient algorithms −Insufficient resources −Much work on computer systems reliability Focuses on handling faults Aims for “correct by construction” Emergent misbehavior tends to be: −Global misbehavior arising from “correct” local behaviors −Related to the composition of independent parts −Related to delays and to decentralized control It might not ever be possible to be definitive
19
April 2006Emergent (Mis)behavior vs. Complex Software Systems19 Outline Examples What is/is not “emergent misbehavior”? A research agenda Thoughts about visions of the future Related work
20
April 2006Emergent (Mis)behavior vs. Complex Software Systems20 Outline of a proposed research agenda 1.Create a taxonomy of emergent misbehaviors To guide the rest of the agenda 2.Create a taxonomy of frequent causes Generalize when possible; tie back to taxonomy #1 3.Develop detection and diagnosis techniques Look for distinctive signatures from taxonomies 4.Develop prediction techniques For better prediction of performance and failures 5.Develop amelioration techniques System design tricks to avoid emergent misbehavior 6.Develop testing techniques −Strategies for smoking out emergent misbehavior during testing
21
April 2006Emergent (Mis)behavior vs. Complex Software Systems21 Taxonomy #1: kinds of emergent misbehavior Thrashing Unwanted synchronization Unwanted oscillation or periodicity Deadlock Livelock Phase change Chaotic behavior etc.
22
April 2006Emergent (Mis)behavior vs. Complex Software Systems22 Taxonomy #2: Frequent causes of emergent misbehavior Unexpected resource sharing Massive scale Decentralized control Lack of composability Misconfiguration Unexpected inputs or loads Communication delay etc.
23
April 2006Emergent (Mis)behavior vs. Complex Software Systems23 There’s a lot more work to do! A little more discussion in the paper … Hopefully, a few dissertations, from people with more energy than I have.
24
April 2006Emergent (Mis)behavior vs. Complex Software Systems24 Outline Examples What is/is not “emergent misbehavior”? A research agenda Thoughts about visions of the future Related work
25
April 2006Emergent (Mis)behavior vs. Complex Software Systems25 Visions of the future (large-scale and enterprise systems) Automatic control of data centers and services −Beyond “lights out” to “minimal human involvement” −Feedback control of almost everything Service-oriented computing −Construction by composition of “services” −Correctness by construction −Loose coupling via networks Declarative approaches −“Models” for components and their composition
26
April 2006Emergent (Mis)behavior vs. Complex Software Systems26 Visions of the future: ignoring emergent misbehavior? Automatic control of data centers and services −Feedback loops can lead to surprises Especially when several loops are working at cross purposes Service-oriented computing −Composition of dynamic behaviors could yield surprises −Loose coupling via networks: adds latency Declarative approaches −Rule-based systems are hard to debug −Less explicit control over dynamics than procedural style?
27
April 2006Emergent (Mis)behavior vs. Complex Software Systems27 Outline Examples What is/is not “emergent misbehavior”? A research agenda Thoughts about visions of the future Related work
28
April 2006Emergent (Mis)behavior vs. Complex Software Systems28 Related work Lots of related work on good side of emergence −E.g.: Dyson, Darwin Among the Machines (1998) Non-computer work on misbehavior: −Parunak & VanderBok (1997) “Managing emergent behavior in distributed control systems” Computer systems work on emergent misbehavior: −Term first(?) used by Ed Nisley (Dr. Dobb’s J., 2004) −Steven Gribble (HotOS, 2001) Making systems more robust in the face of the unexpected −National Research Council report: A Research Agenda for Networked Systems of Embedded Computers (2001)
29
April 2006Emergent (Mis)behavior vs. Complex Software Systems29 Summary We’ve already seen lots of emergent misbehavior Trends could make things worse in the future CS research on reliability has focussed on faults We need to understand emergent misbehavior We needs ways to cope with it A lot more detail in the paper
30
April 2006Emergent (Mis)behavior vs. Complex Software Systems30 Advice for OSDI Authors There will be no extensions to the deadline Papers that violate the format requirements will be rejected.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.