Action Breakout Session Anil, AP, Nina Bhatti, Charles Berdnall, Joe Hellerstein, Wei Hu, Anthony Joseph, Randy Katz, Li, Machi Mukund Kimmo Raatikanen, Siva
Breakout Goal Identify research questions and issues related to adaptive action invocation to enhance the dependability and security of distributed systems Customer is the “system administrator,” not the end user
Breakout Process Define actions by example Discuss cross-layer interaction and coordination Distill underlying principles
Key Observations Distinguish between control actions (e.g., “slow down”) and data actions (e.g., “drop packets”) Distinguish between internal/locally performed actions and actions that affect global behavior Control loops operating in multiple levels, regionally and globally Performance-related actions are the basic building block Control system itself can be target of an adversarial attack
Working Examples Network Storage Service; Media Streaming Service Multiple instances of service various places in network Direct requests to best available service instance Balance requests among service instances Fall back to alternative service instance in the face of failure or DOS attack Coordinate measurements on client-side and server-side to reduce load through admission control and content adaptation Distinguish between server overload and network overload For clients “not in the loop” (heterogeneous clients, adversarial clients), proxy the necessary behavior inside the network Network Denial of Service Overload data traffic and starve control traffic Secondary performance effects: session resets, router CPUs driven to high utilization, etc.
Control Theoretic Viewpoint Black boxes that are managed by a control system Actuation points that can acted upon to control the system E.g., Apply backpressure to clients to slow down request rate (control); degrade content quality (data) E.g., Prioritize/reserve bandwidth for control traffic; Policy settings are control actions, enforcement of policy are data actions Single vs. independent control loops: which is better? Theory provides tools for managing “disturbances” Note that the control system can itself be the target of attack Hellerstein: Action is a change to a configuration E.g., buffer pool size, weights in load balancer E.g., uninstall/reinstall software
General Observations Causality and Visibility Actions can lead to cascaded actions Can interactions/side effects be modeled/made explicit? Action graph model: probability that a following action will be invoked as the result of a given current action In general, difficult to determine in advance Could it be learned via observe/analyze? Feasible to place action points at every potential bottleneck site? Note that routers are badly designed black boxes, difficult and time consuming to extract their internal state Tradeoff between centralized collection of state that may be “complete” but out-of-date vs. decentralized collection that may be more timely but globally incomplete Principle of containment: first do no harm, local actions potential less disastrous than global actions
General Observations Managing Disturbances Instabilities arise where delays in taking action are introduced Latencies in response Imperfect knowledge of the state Tradeoff in making decisions based on longer intervals spanning more state vs. shorter intervals spanning less state Time intervals adapt … short time to ensure useful work always being done E.g., Disk scheduling in Storage Server You can only do work you are aware of Keep the queues short to achieve best performance
General Observations Predictive actions Waiting too long to detect problem limits ability to respond Characterize workload/response changes as signature of impending system performance failure Response to workload changes: “gradual” vs. cliff degradation E.g., as I/O workload grows, predict increases in response latency E.g., IBM detects changes to slope of activity to trigger resource allocation to manage flash crowds in web server farms
General Observations Don’t ignore the human decision maker Human operators in the loop Research challenge: visualizing the configuration and state of the system to a human decision maker Higher order configuration and administration tools and frameworks