Berkeley RAD Lab Technical Vision

Berkeley RAD Lab Technical Vision
Armando Fox, Randy Katz, Michael Jordan, Dave Patterson, Scott Shenker, Ion Stoica RADS Retreat, June 2005

Outline Overall Vision Internet Services Vision (ServRADS)
Network Vision (NetRADS) Internet Services Network architecture Principles and Summary

Overarching Mantra Enable a faster pace of network service innovation through new distributed system architectures that reduce operations cost by 2-3 orders of magnitude The Challenge: Software systems: Too much information => make sense of it through statistical learning & control theory Network systems: Too little information => exploit better observation and monitoring in the network infrastructure to drive management processes

In practice this means …
Single person can write, deploy, operate the next-generation IT business (“the Fortune 1 million”) Do for Internet apps what Web did for individual publishing Gray’ s challenge: planetary-scale distributed system operated by a single part-time operator Goal: programmers focus on functionality; put the *ility in the platform Could be built on utility computing, giving access to distributed physical resources Integrated approach to network and server/service management Requires 100x-1000x reduction in TCO from today’s levels

What things are like today
World-scale services created and operated by expert teams “Google-sized organization” to create a Google Amazon’s book browsing, designed by programmers, is cumbersome Browsing for housewares, designed by domain experts on mature infrastructure, more usable We don’t know what the next “killer app” will be! NOW project didn’t predict Internet search as a “Killer app” for NOW’s If we succeed, the next killer Internet app will be written, deployed, operated, at Google-like scales, by a single programmer

Focusing on lowering cost of ownership
Standard way to account for “where the money goes” in operating a deployed distributed application Definition independent of who is operating the app Operators per byte of storage or per CPU? No, doesn’t scale with technology changes Operators per end-user served? (This is the figure of merit for e-tailers) Operators per geographic region served? Operators per $ spent on capital cost? Operators per $ of revenue?

Enabling Technologies for Reducing TCO in ServRADS
Past successes microrebooting: Fast recovery makes false positives tolerable Pinpoint: using SLT to detect and localize fine-grain failures visualization+SLT to help operators & earn their trust Elements of technical vision SLT and machine learning Operator-centric visualization Control theory “Open source” failures database (sanitized, open failures & forensics repository)

Example scenarios Helping operators make sense of instrumentation
Using ML techniques to localize failures (P. Bodik, E. Kiciman) Using automatically-induced statistical models to identify likely causes of performance problems (S. Zhang, I. Cohen et al.) Combining SLT with visualization for cross-checking problem reports and rapidly spotting potential problems visually Facilitating self-tuning/configuration Using control theory to improve performance of a distributed streaming database (W. Xu) Service placement in wide-area distributed system (D. Oppenheimer) Microreboots (G. Candea) and microreplacement (S. Kawamoto) as low-cost prevention/repair strategies If false positive cost can be kept low, automate. Otherwise, help operator do her job.

Services example: combining viz + SLT

Reduce TCO via Planetary-scale Abstractions
Inspiration: narrowly-focused planetary-scale abstractions whose design & implementation... scale well: understand distributed scheduling, locality, symptoms of wide-area failures monitorable and controllable (using SLT & linear CT) retain precisely-quantifiable and “acceptable” semantics under partial-failure conditions Examples of existing “narrow but powerful” services MapReduce in Google understands data locality Can easily imagine a “lossy” MapReduce, like online aggregation queues/messaging in Yahoo, Amazon, others User information database in Yahoo Instrumentation collection & analysis services using Telegraph-CQ

RADS Network Problem Internet routing has proven to be robust But …
Poor visibility: hard to determine health of the network Routing policy interactions defeat propagation of useful diagnostic info: difficult to identify root cause problems Slow reaction times to connectivity failures; operator intervention (across admin domains) increases cost of ownership Key observation: network service failures attributed to unexpected traffic surge patterns Approach: identify and protect “good” traffic during surge Mechanism deployed in network edge: It’s where the servers and clients are located Greatest need for lowering management costs Administrative scope and responsibility is well-defined

iBoxes: New network element for Observe, Analyze, Act
Enterprise Network Architecture Inspection-and-Action Boxes: Deep multiprotocol packet inspection No routing; observation & marking Policing points: drop, fence, block

Network-Level Observe-Analyze-Act
Packet, path, protocol, service invocation statistical collection and sampling: frequencies, latencies, completion rates Construct the collection infrastructure Analyze Determine correlations among observations “Normal” model discovery + anomaly detection Exploit SLT Act Experiment to test correlations Prioritize and throttle Mark and annotate Control theory? Distributed analyses and actions

Network Layer Mechanism: Annotations
Enhance network visibility: disseminate observations, communicate actions, provide in-band network management actions, iBox-to-iBox communications iBoxes label packets at annotation layer but do not rewrite packet contents Annotations stack, must be removed from packets before delivery to A-layer unaware end nodes Phy Link Network Annotation Transport Session Presentation Application

Scenario: Traffic Surge Inhibiting Network Services
Internet Edge II R Primary & Secondary DNS Servers S Distribution Tier S E Mail Server E S R R IS IA E Spam Appliance Server Edge Access Edge S E DNS Server swamped by excessive request traffic Observe: DNS time outs, Web access traffic slowed, but also higher than normal mail delivery latency implying busy server edge (correlation between Mail Server and DNS Server utilization?) Root Cause: High DNS request rates generated by Spam Appliance triggered by mail surge

Scenario How Diagnosed?
Internet Edge II R Primary & Secondary DNS Servers S Distribution Tier S E Mail Server E S R R IS IA E Spam Appliance Server Edge Access Edge S E How Diagnosed? I-S detects high link utilization but abnormally high DNS traffic Stats from I-I: high mail traffic, low outgoing web traffic, in traffic high but link utilization not high Stats from I-A: lower web traffic, no unusual mail origination Problem localized to Server edge, but visibility limited: RADS can help

Scenario Possible Action Responses
Internet Edge II R Primary & Secondary DNS Servers S Distribution Tier S E Mail Server E S R R IS IA E Spam Appliance Server Edge Access Edge S E Possible Action Responses Experiment: Redirect local DNS requests to Secondary DNS server: if these complete, can infer the server is the problem, not the network Throttle: Due to MS-DNS correlation, block/slow traffic at Server Edge: should expect reduced DNS server utilization

Embodying principles in a prototype
Platform architecture and prototype to enable rapid innovation in network services by non-experts automatically accommodates scaling, provisioning, failure management multi-datacenter (geoplexed) observable networks connecting datacenters potentially planetary scale runs with minimal operator oversight Prototype keeps various research projects focused on common goal and allows ongoing testing Participation in standards processes to promote “best practices” in platform as open standards

Reliable Adaptive Distributed Systems
Operator User Prototype Applications Programming Abstractions For Roll-back and wide-area distributed computations Distributed Middleware Client Distributed Middleware Server Crash-only services + Observation Infrastructure for System SLT SLT Services Application- Specific Overlay Network Checkable Protocols Fast Detection & Route Recovery Observation Infrastructure for network SLT iBox iBox Edge Network Edge Network Internet IP Network Router Commodity Internet

Generic iBox Architecture
Interconnection Fabric Input Ports Output Ports Buffers CP Classification Processor CP AP Action Processor “Tag” Mem Rules & Programs

Possible architecture of a rack
app. server & application, e.g. J2EE Microrecovery actions Datacenter boundary From other datacenters High-level effectors SLT algo. SLT algo. SLT algo. To other datacenters Control loops High-level sensor data Externally-induced failures, workload changes, etc. T-CQ engine Sanitized data SLT algo. SLT algo. SLT algo. Visualization Preprocessed data To other datacenters Syndrome identification

ServRADS: Observations & Summary
SLT algorithms make sense of large amounts of data Classification, outlier/anomaly detection, clustering, etc. Viz helps operator use “visual pattern recognition” to quickly spot problems and cross-check SLT models Enables operator expertise to be quickly brought to bear Builds operators’ trust in statistical/machine learning models Challenge Fundamental challenges associated with applying SLT to problem determination (coming up next session) Unifying many techniques into a coherent approach - prototype platform as unifying artifact Idea: capture best practices in TCO-optimized, planetary-scale abstractions

NetRADS: Observations & Summary
COPS: Paradigm for (more) automatically protecting critical resources when network is under stress Checkable protocols: visible semantics Observe network behavior: good (easy), bad (hard), suspicious Protect services: throttle, redirect Network management major contributor to TCO NetRADS built on: iBoxes: pervasive infrastructure for observation and action at the network level Annotation Layer: for marking, control, inter-iBox communications Integration with Internet service approach for service/server-level visibility and integrated management

Berkeley RAD Lab Technical Vision

Similar presentations

Presentation on theme: "Berkeley RAD Lab Technical Vision"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Berkeley RAD Lab Technical Vision

Similar presentations

Presentation on theme: "Berkeley RAD Lab Technical Vision"— Presentation transcript:

Similar presentations

About project

Feedback