© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Emergent (Mis)behavior vs. Complex.

Slides:



Advertisements
Similar presentations
Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Advertisements

Introduction to Embedded Systems Resource Management - III Lecture 19.
Connecting LANs: Section Figure 15.1 Five categories of connecting devices.
G. Alonso, D. Kossmann Systems Group
Sponsored by the U.S. Department of Defense © 2005 by Carnegie Mellon University 1 Pittsburgh, PA Dennis Smith, David Carney and Ed Morris DEAS.
Introduction CSCI 444/544 Operating Systems Fall 2008.
Consensus Algorithms Willem Visser RW334. Why do we need consensus? Distributed Databases – Need to know others committed/aborted a transaction to avoid.
Business Continuity and DR, A Practical Implementation Mich Talebzadeh, Consultant, Deutsche Bank
Software Engineering for Real- Time: A Roadmap H. Kopetz. Technische Universitat Wien, Austria Presented by Wing Kit Hor.
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
1 CSSE 477 – A bit more on Performance Steve Chenoweth Friday, 9/9/11 Week 1, Day 2 Right – Googling for “Performance” gets you everything from Lady Gaga.
LANs Media Access Control Step 1 in Sharing Resources.
Positive Feedback Loops in DHTs or Be Careful How You Simulate January 13, 2004 Sean Rhea, Dennis Geels, Timothy Roscoe, and John Kubiatowicz From “Handling.
CS 603 Failure Models April 12, Fault Tolerance in Distributed Systems Perfect world: No Failures –W–We don’t live in a perfect world Non-distributed.
Fault-tolerant Adaptive Divisible Load Scheduling Xuan Lin, Sumanth J. V. Acknowledge: a few slides of DLT are from Thomas Robertazzi ’ s presentation.
Self Healing Wide Area Network Services Bhavjit S Walha Ganesh Venkatesh.
Internet and Intranet Protocols and Applications Section V: Network Application Performance Lecture 11: Why the World Wide Wait? 4/11/2000 Arthur P. Goldberg.
1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.
Distributed Systems 2006 Group Membership * *With material adapted from Ken Birman.
EEC-681/781 Distributed Computing Systems Lecture 11 Wenbing Zhao Cleveland State University.
1 Today More on random testing + symbolic constraint solving (“concolic” testing) Using summaries to explore fewer paths (SMART) While preserving level.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
Algorithms for Self-Organization and Adaptive Service Placement in Dynamic Distributed Systems Artur Andrzejak, Sven Graupner,Vadim Kotov, Holger Trinks.
Network Topologies.
Client/Server Architectures
CSE 486/586 CSE 486/586 Distributed Systems PA Best Practices Steve Ko Computer Sciences and Engineering University at Buffalo.
Server Load Balancing. Introduction Why is load balancing of servers needed? If there is only one web server responding to all the incoming HTTP requests.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
Team Skill 6: Building the Right System From Use Cases to Implementation (25)
Protocol implementation Next-hop resolution Reliability and graceful restart.
Improving Capacity and Flexibility of Wireless Mesh Networks by Interface Switching Yunxia Feng, Minglu Li and Min-You Wu Presented by: Yunxia Feng Dept.
What are the main differences and commonalities between the IS and DA systems? How information is transferred between tasks: (i) IS it may be often achieved.
© Oxford University Press 2011 DISTRIBUTED COMPUTING Sunita Mahajan Sunita Mahajan, Principal, Institute of Computer Science, MET League of Colleges, Mumbai.
Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.
Distributed Software Engineering Lecture 1 Introduction Sam Malek SWE 622, Fall 2012 George Mason University.
©Ian Sommerville 2000 Software Engineering, 6th edition. Chapter 10Slide 1 Architectural Design l Establishing the overall structure of a software system.
DEBUGGING. BUG A software bug is an error, flaw, failure, or fault in a computer program or system that causes it to produce an incorrect or unexpected.
Day11 Devices/LAN/WAN. Network Devices Hub Switches Bridge Router Gateway.
Transit Signal Priority (TSP). Problem: Transit vehicles are slow Problem: Transit vehicles are effected even more than cars by traffic lights –The number.
Mr C Johnston ICT Teacher
Deadlock Detection and Recovery
The Software Development Process
Fault Tolerance Benchmarking. 2 Owerview What is Benchmarking? What is Dependability? What is Dependability Benchmarking? What is the relation between.
CSCI1600: Embedded and Real Time Software Lecture 24: Real Time Scheduling II Steven Reiss, Fall 2015.
CSCI1600: Embedded and Real Time Software Lecture 28: Verification I Steven Reiss, Fall 2015.
Lecture Network layer -- May Congestion control Algorithms.
A Load-Balanced Switch with an Arbitrary Number of Linecards Offense Anwis Das.
1 © Process Software Corp. DHCP Failover Protocol Jeff DECUS Europe 2000 Thursday, 13 Apr :00 - 9:45.
CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors: Kashi Venkatesh Vishwanath ; Nachiappan Nagappan Presented By: Vibhuti Dhiman.
CS 5150 Software Engineering Lecture 22 Reliability 3.
Distance Vector Routing
Pitfalls of your first paper Shu Cai Institute of Computing Technology, Chinese Academy of Sciences
Discovering Sensor Networks: Applications in Structural Health Monitoring Summary Lecture Wireless Communications.
Lecture 4 Page 1 CS 111 Summer 2013 Scheduling CS 111 Operating Systems Peter Reiher.
Jeff Kern NRAO/ALMA.  Scaling and Complexity ◦ SKA is not just a bigger version of existing systems  Higher Expectations  End to End Systems  Archive.
William Stallings Data and Computer Communications
Fail-stutter Behavior Characterization of NFS
David Wetherall Spring 2000
ETHANE: TAKING CONTROL OF THE ENTERPRISE
Large Distributed Systems
Network Load Balancing Functionality
Instructor: Mr. Malik Zaib
Intra-Domain Routing Jacob Strauss September 14, 2006.
Project Planning is a waste of time!!!
Software Testing and Maintenance Maintenance and Evolution Overview
Dynamic Routing and OSPF
CS4470 Computer Networking Protocols
Distributed computing deals with hardware
Presentation Title Global-scale systems that know when they are behaving badly NSF workshop on grand challenges in distributed systems Jeff Mogul, HP.
Presentation transcript:

© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Emergent (Mis)behavior vs. Complex Software Systems Jeff Mogul HP Labs – Palo Alto April 2006

Emergent (Mis)behavior vs. Complex Software Systems2 Emergent behavior? Ants are dumb Anthills are “smart” The global behavior of the anthill emerges from the local behaviors of the ants −The individual ants don’t know what the global behavior is supposed to be

April 2006Emergent (Mis)behavior vs. Complex Software Systems3 Opening day on the Millennium Footbridge Opening day (10 June 2000): −“unexpected lateral vibrations occured” −“a significant number of pedestrians [had] difficulty walking” −The bridge was closed; the engineers got back to work They had already done very careful modelling of a novel design What went wrong? −People on a swaying surface tend to synchronize their footsteps to the swaying, even if initial amplitude is small −Bridge’s natural frequency was close to normal footsteps −This effect was unknown in engineering literature Novel bridge design + unusual pedestrian-only load Once the problem was understood, modelling and retrofit were fairly straightforward

April 2006Emergent (Mis)behavior vs. Complex Software Systems4 Why is that bridge interesting to us? People have been designing bridges for millennia −Civil engineering is a well-regulated profession −Lots of experience with unexpected dynamic failures −Lots of computer modelling expertise But the engineers still got it wrong: why? Answer: emergent misbehavior −The system’s behavior emerged – it wasn’t easy to predict Particularly, not from understanding of individual “parts” −And the result was unexpected and bad If these engineers got it wrong, what about us? −Computer systems are worse than bridges!

April 2006Emergent (Mis)behavior vs. Complex Software Systems5 The importance of emergent misbehavior in computer systems Much past focus has been on: Fault-tolerant systems Correctness-by-construction Both are valuable, but … 1.System-wide failures not always caused by “faults” 2.Modern systems are too complex to understand 3.Performance matters! All three issues can result from emergent misbehavior Goals of this talk: Illustrate the scope and nature of the problem Propose a research agenda

April 2006Emergent (Mis)behavior vs. Complex Software Systems6 What this talk is NOT about Dealing with malicious behavior Game theory and incentives for people Telling anyone that their approach is wrong −We still need fault tolerance, program verification, correct-by-construction techniques, etc.! Improving peak (best-case) system performance This talk is 100% uncontaminated by: −Implementation or architecture −Experiments or results

April 2006Emergent (Mis)behavior vs. Complex Software Systems7 Outline Examples What is/is not “emergent misbehavior”? A research agenda Thoughts about visions of the future Related work

April 2006Emergent (Mis)behavior vs. Complex Software Systems8 Examples of emergent misbehavior Examples can be found in: Non-computer technology −Millennium Footbridge (London); Traffic jams Computer hardware −Vibrations in large disk arrays Networking −Ethernet capture effect, Router synchronization; BGP Route flap damping; TCP’s Nagle algorithm Distributed systems and operating systems −Misconfigured load balancer; Herd behavior; Priority inversion in the Mars Pathfinder

April 2006Emergent (Mis)behavior vs. Complex Software Systems9 Examples of emergent misbehavior Examples described in this talk: Non-computer technology −Millennium Footbridge (London); Traffic jams Computer hardware −Vibrations in large disk arrays Networking −Ethernet capture effect, Router synchronization; BGP Route flap damping; TCP’s Nagle algorithm Distributed systems and operating systems −Misconfigured load balancer; Herd behavior; Priority inversion in the Mars Pathfinder

April 2006Emergent (Mis)behavior vs. Complex Software Systems10 Ethernet Capture Effect: an example scenario Host A decides to transmit Host B decides to transmit Host A, count = 1, flips “backoff coin” = 0 Host B, count = 1, flips “backoff coin” = 1 Host A wins, transmits Idle Host A decides to transmit Host B decides to transmit Host A, count = 1, flips “backoff coin” = 0 Host B, count = 2, flips “backoff coin” = 01 Host A wins, transmits Assume both hosts have full transmit queues … ad infinitum B’s disadvantage doubles on each round

April 2006Emergent (Mis)behavior vs. Complex Software Systems11 Ethernet Capture Effect (II) No component here has failed Problem didn’t show up until chips met the spec −Older chips were too slow to send back-to-back packets −The extra delay left B a chance to sneak in Apparently was not caught in original modelling Problem doesn’t require large scale to show up −In fact, adding more hosts tends to blur the picture Solution involved adding extra delay −“Don’t send back-to-back if you just won a collision” −[Ramakrishnan and Yang, 1994]

April 2006Emergent (Mis)behavior vs. Complex Software Systems12 A misconfigured load balancer Load balancer with two jobs: −Spread load between servers −Detect server failure via timeout System stops responding reliably −After working fine for months −Load balancer repeatedly declares each server dead, in alternation Diagnosis: −DBs got slower as the got fuller −Load balancer timeout was too low −Slow app servers appeared to have “failed”, causing load balancer to switch back and forth

April 2006Emergent (Mis)behavior vs. Complex Software Systems13 Herd behavior in a distributed system Planetary-Scale Event Prop & Routing System −(a.k.a. PsEPR) [Brett et al., WORLDS 2005 ] −Runs on PlanetLab −Aims for very large scale Requires clients to be distributed evenly among servers Clients keep ordered preference lists of servers −Prefer “nearby” servers (based on all-pairs-ping) −On server failure: Demote failed server Try to connect to top server on list

April 2006Emergent (Mis)behavior vs. Complex Software Systems14 PsEPR system structures Desirable Undesirable

April 2006Emergent (Mis)behavior vs. Complex Software Systems15 Herd behavior in a distributed system: what went wrong with PsEPR Initially, clients generally balanced among servers As servers/links failed: −Same servers tended to look bad to most clients −So, client preference lists tended to converge −So, clients tended to connect to a small subset of servers Clients mostly converged on a few servers: −These servers became overloaded −Server-local response-time monitors caused restarts Causing further convergence of client preference lists −Clients all moved to the next server on their list At rate governed by server restart times Fix: adjust ordering by success count + random #

April 2006Emergent (Mis)behavior vs. Complex Software Systems16 Outline Examples What is/is not “emergent misbehavior”? A research agenda Thoughts about visions of the future Related work

April 2006Emergent (Mis)behavior vs. Complex Software Systems17 One definition of emergent behavior Emergent behavior is that which cannot be predicted through analysis at any level simpler than that of the system as a whole. −George Dyson (1998) Emergent misbehavior is just emergent behavior that we don’t want

April 2006Emergent (Mis)behavior vs. Complex Software Systems18 Distinguishing between emergent and “normal” misbehavior Misbehavior that is not emergent: −Single-component bugs that break the whole system −Inherently inefficient algorithms −Insufficient resources −Much work on computer systems reliability Focuses on handling faults Aims for “correct by construction” Emergent misbehavior tends to be: −Global misbehavior arising from “correct” local behaviors −Related to the composition of independent parts −Related to delays and to decentralized control It might not ever be possible to be definitive

April 2006Emergent (Mis)behavior vs. Complex Software Systems19 Outline Examples What is/is not “emergent misbehavior”? A research agenda Thoughts about visions of the future Related work

April 2006Emergent (Mis)behavior vs. Complex Software Systems20 Outline of a proposed research agenda 1.Create a taxonomy of emergent misbehaviors To guide the rest of the agenda 2.Create a taxonomy of frequent causes Generalize when possible; tie back to taxonomy #1 3.Develop detection and diagnosis techniques Look for distinctive signatures from taxonomies 4.Develop prediction techniques For better prediction of performance and failures 5.Develop amelioration techniques System design tricks to avoid emergent misbehavior 6.Develop testing techniques −Strategies for smoking out emergent misbehavior during testing

April 2006Emergent (Mis)behavior vs. Complex Software Systems21 Taxonomy #1: kinds of emergent misbehavior Thrashing Unwanted synchronization Unwanted oscillation or periodicity Deadlock Livelock Phase change Chaotic behavior etc.

April 2006Emergent (Mis)behavior vs. Complex Software Systems22 Taxonomy #2: Frequent causes of emergent misbehavior Unexpected resource sharing Massive scale Decentralized control Lack of composability Misconfiguration Unexpected inputs or loads Communication delay etc.

April 2006Emergent (Mis)behavior vs. Complex Software Systems23 There’s a lot more work to do! A little more discussion in the paper … Hopefully, a few dissertations, from people with more energy than I have.

April 2006Emergent (Mis)behavior vs. Complex Software Systems24 Outline Examples What is/is not “emergent misbehavior”? A research agenda Thoughts about visions of the future Related work

April 2006Emergent (Mis)behavior vs. Complex Software Systems25 Visions of the future (large-scale and enterprise systems) Automatic control of data centers and services −Beyond “lights out” to “minimal human involvement” −Feedback control of almost everything Service-oriented computing −Construction by composition of “services” −Correctness by construction −Loose coupling via networks Declarative approaches −“Models” for components and their composition

April 2006Emergent (Mis)behavior vs. Complex Software Systems26 Visions of the future: ignoring emergent misbehavior? Automatic control of data centers and services −Feedback loops can lead to surprises Especially when several loops are working at cross purposes Service-oriented computing −Composition of dynamic behaviors could yield surprises −Loose coupling via networks: adds latency Declarative approaches −Rule-based systems are hard to debug −Less explicit control over dynamics than procedural style?

April 2006Emergent (Mis)behavior vs. Complex Software Systems27 Outline Examples What is/is not “emergent misbehavior”? A research agenda Thoughts about visions of the future Related work

April 2006Emergent (Mis)behavior vs. Complex Software Systems28 Related work Lots of related work on good side of emergence −E.g.: Dyson, Darwin Among the Machines (1998) Non-computer work on misbehavior: −Parunak & VanderBok (1997) “Managing emergent behavior in distributed control systems” Computer systems work on emergent misbehavior: −Term first(?) used by Ed Nisley (Dr. Dobb’s J., 2004) −Steven Gribble (HotOS, 2001) Making systems more robust in the face of the unexpected −National Research Council report: A Research Agenda for Networked Systems of Embedded Computers (2001)

April 2006Emergent (Mis)behavior vs. Complex Software Systems29 Summary We’ve already seen lots of emergent misbehavior Trends could make things worse in the future CS research on reliability has focussed on faults We need to understand emergent misbehavior We needs ways to cope with it A lot more detail in the paper

April 2006Emergent (Mis)behavior vs. Complex Software Systems30 Advice for OSDI Authors There will be no extensions to the deadline Papers that violate the format requirements will be rejected.