Download presentation
Presentation is loading. Please wait.
1
1 A Research Program in Reliable Adaptive Distributed Systems (RADS) Armando Fox*, Michael Jordan, Randy Katz, George Necula, David Patterson, Ion Stoica, Doug Tygar University of California, Berkeley *Stanford University
2
2 What Are We Trying to Do: New Approach for RADS Dramatically improve the trustworthiness of networked systems Observe: design observation points throughout system Analyze: infer via statistical learning –Respond: detect anomalous behavior vs. baseline –Learn: use observations to modify responses to future observations Act: –Reactive: use control points in system for rapid recovery if detect something wrong –Proactive/protective: prophylactically act on system to prevent predicted impending failure
3
3 Today’s Systems are Too Brittle Fragile, easily broken, yielding poor dependability and security –E.g., Amazon: yearly revenue $3.1B, downtime costs $600,000/hr Why? –Existing systems focus on performance, not fast adaptive detection and response to failure and attack –Fundamentally incorrect assumptions »Humans are perfect »Software can be made bug free »Maintenance is “free” People/HW/SW failures are facts, not problems “If a problem has no solution, it may not be a problem, but a fact--not to be solved, but to be coped with over time” — Shimon Peres
4
4 Failures and Attacks Inevitable … so Design for Rapid Adaptation Rapid application and server recovery, agile network rerouting, proactive protective actions... –No distinction between “normal operation” and “recovery” Elements of our solution –Programming paradigms for robust recovery –Crash-only software design for rapid server recovery –Network protocols designed for observation to allow rapid detection of behavioral violations –Instrumentation and online statistical analysis for anomaly detection and failure diagnosis/localization Adaptation benchmarks to measure progress –What you can’t measure, you can’t improve –Collect real failure data to drive benchmarks
5
5 Example: anomaly detection meets crash- only design Use simple time series analysis on key operating statistics (committed writes, offered load, etc.) Count relative frequencies of all substrings of length k or shorter, look for discrepancies in relative frequencies across replicas Works even when period is irregular or not known a priori If you see anything unusual, coerce to a crash and recover from that; reboot is nearly free, so occasional false positives OK
6
6 Security Challenges for RADS Need new techniques to detect and respond to rapidly-evolving attacks But these techniques can themselves be used to mount attacks –So we must secure the learning process Rapid secure protocol synthesis tools can be applied to this problem
7
7 Approach for Success: Interdisciplinary Expertise Interdisciplinary Team –Armando Fox/Dave Patterson: Dependable System Design –Randy Katz/Ion Stoica: Network Services/Protocols –Michael Jordan: Statistical Learning Theory –Ion Stoica/Doug Tygar: Verification of networks and security –George Necula: Language/Applications-level mechanisms Spans algorithm design and system implementations –Comprehensive distributed architecture embedding SLT as a primitive building block –Embedding observational and inference means at strategic points throughout the distributed system –New kinds of statistical inference and verification techniques able to execute on-line and in real-time
8
8 RADS Conceptual Architecture Commodity Internet & IP networks Edge Network Distributed Middleware Client SLT Services Distributed Middleware Server Router Edge Network PNE Prototype Application: Messaging, E-Mail for Operational Systems Operator User Application- Specific Overlay Network Programming Abstractions For Roll-back (Necula) Crash-Only Middleware & Servers, System O&C Infrastructur e (Fox) Protocols Enabling Fast Detection & Route Recovery, Network O&C Infrastructure (Katz, Stoica) Online Statistical Learning Algorithms (Jordan) Benchmarks, Tools for Human Operators (Patterson) Reduction to practice of on-line SLT and observe/analyze/act infrastructure Reusable embeddable components Pervasive security considerations (Tygar)
9
9 Allies Networks Vulnerable Messaging Application that Requires Trustworthiness DHS/Federal Network Coalition Internet Compromised Network With Embedded Adversaries Trust Relations Incident Reports Responder Locations GIS Data Etc. Net Failure Allies Networks Local Police, Fire, State Police Adversary Active Adversary Service Attacks Exploit DETER Testbed for Prototyping
10
10 Scientific Foundation For “Self-*” Systems New design principles and tools for systems that continuously adjust their behavior in response to analysis of online observations New metrics and benchmarks for evaluating self-adapting networked systems Advances in Statistical Learning Theory to move from offline to online analysis of large- scale distributed systems
11
11 Measuring Success Build messaging prototype using RADS design principles and tools Put realistic performance workload on prototype, embed in DHS DETER testbed Subject prototype to increasingly aggressive failure and attack workloads –E.g., hardware failures, software failures, operator failures, worms attacks, DDOS attacks, … Measure false positive rates, accuracy rates, time to analyze failures, time to act, performance impact of actions, availability of prototype, performability of prototype, … Compare results with conventional systems under similar performance, failure, and attack workloads
12
12 New Funding Opportunity: NSF CyberTrust Program From RFP: People rely on systems based on networked computers –Too vulnerable to cyber attacks: inhibit function, corrupt data, or expose private information Promote vision where networked systems are: –More predictable, more accountable, and less vulnerable to attack and abuse; –Developed, configured, operated and evaluated by a well-trained and diverse workforce; –Used by a public educated in their secure and ethical operation Example research area: improve trustworthiness of networks; explore evolving nature of security protocols and policies in communications networks Individual, Team projects and 1-2 Centers
13
13 CATS: Center for Adaptive Trustworthy Systems Dramatically improve the trustworthiness of networked systems New understanding of how to construct such systems –Observe-Analyze-Act –From responding to known problems to learning new problems –From reacting to problems to proactively responding before problems become significant –Experimental method of benchmarking, prototyping, and deployment to provide context Technical Thrusts –Statistical Learning Theory –Crash-Only Software –Behaviorally-Consistent and Secure Protocols –Programmable Network Elements Integration Vehicle –Application: Disaster Response Messaging –Supported by prototype distributed system architecture –Deployment and Evaluation Plan
14
14 We need your help and support! Discussion?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.