Berkeley RAD Lab Technical Vision

Slides:

Advertisements

Similar presentations

All rights reserved © 2006, Alcatel Grid Standardization & ETSI (May 2006) B. Berde, Alcatel R & I.

Advertisements

Trace Analysis Chunxu Tang. The Mystery Machine: End-to-end performance analysis of large-scale Internet services.

1 Quality of Service vs. Any Service at All 10th IEEE/IFIP Conference on Network Operations and Management Systems (NOMS 2006) Vancouver, BC, Canada April.

High speed links, distributed services, can’t modify routers  Lack of visibility But, need for more visibility and control  Increased number and complexity.

Berkeley RAD Lab Technical Vision Armando Fox, Randy Katz, Michael Jordan, Dave Patterson, Scott Shenker, Ion Stoica RADS Retreat, June 2005.

Berkeley RAD Lab Center Proposal Armando Fox, Randy Katz, Michael Jordan, Dave Patterson, Scott Shenker, Ion Stoica RADS Retreat, June 2005.

1 Action Breakout Session Anil, AP, Nina Bhatti, Charles Berdnall, Joe Hellerstein, Wei Hu, Anthony Joseph, Randy Katz, Li, Machi Mukund Kimmo Raatikanen,

1 A Research Program in Reliable Adaptive Distributed Systems (RADS) Armando Fox*, Michael Jordan, Randy Katz, George Necula, David Patterson, Ion Stoica,

Chapter 9: Moving to Design

Winter Retreat Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen, Emre Kıcıman, Anthony Accardi, Armando Fox, Eric Brewer

Research Directions for On-chip Network Microarchitectures Luca Carloni, Steve Keckler, Robert Mullins, Vijay Narayanan, Steve Reinhardt, Michael Taylor.

1 Reliable Adaptive Distributed Systems Armando Fox, Michael Jordan, Randy H. Katz, David Patterson, George Necula, Ion Stoica, Doug Tygar.

LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.

Securing Legacy Software SoBeNet User group meeting 25/06/2004.

Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.

Cloud Computing for the Enterprise November 18th, This work is licensed under a Creative Commons.

Chapter 9 Elements of Systems Design

1 Enterprise Networks under Stress. 2 = 60% growth/year Vern Paxson, ICIR, “Measuring Adversaries”

1 RADS Conceptual Architecture Commodity Internet & IP networks Edge Network Distributed Middleware Client SLT Services Distributed Middleware Server Router.

9 Systems Analysis and Design in a Changing World, Fourth Edition.

9 Systems Analysis and Design in a Changing World, Fourth Edition.

Introduction to Grids By: Fetahi Z. Wuhib [CSD2004-Team19]

Progress Report Armando Fox with George Candea, James Cutler, Ben Ling, Andy Huang.

GRID ANATOMY Advanced Computing Concepts – Dr. Emmanuel Pilli.

Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene Fratkin {emrek,

Danilo Florissi, Yechiam Yemini (YY), Sushil da Silva, Hao Huang Columbia University, New York, NY 10027

9 Systems Analysis and Design in a Changing World, Fifth Edition.

SIEM Rotem Mesika System security engineering

Chapter 1 Characterization of Distributed Systems

Connected Infrastructure

CompTIA Security+ Study Guide (SY0-401)

Service Assurance in the Age of Virtualization

Univa Grid Engine Makes Work Management Automatic and Efficient, Accelerates Deployment of Cloud Services with Power of Microsoft Azure MICROSOFT AZURE.

CIS 700-5: The Design and Implementation of Cloud Networks

Lecture 2: Cloud Computing

Introduction to Windows Azure AppFabric

Connected Maintenance Solution

Action Breakout Session

Applying Control Theory to Stream Processing Systems

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S

Securing the Network Perimeter with ISA 2004

Large Distributed Systems

1st Draft for Defining IoT (1)

Connected Maintenance Solution

Software Design and Architecture

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S

Grid Computing.

Connected Infrastructure

Web Engineering.

The NPD Group - Enterprise DC Agenda

University of Technology

GRID COMPUTING PRESENTED BY : Richa Chaudhary.

Introduction to Cloud Computing

Northwestern Lab for Internet and Security Technology (LIST) Yan Chen Department of Computer Science Northwestern University.

CompTIA Security+ Study Guide (SY0-401)

Cloud Computing Dr. Sharad Saxena.

Module 5 - Switches CCNA 3 version 3.0.

AKAMAI INTELLIGENT PLATFORM™

Streaming Sensor Data Fjord / Sensor Proxy Multiquery Eddy

RM3G: Next Generation Recovery Manager

AWS Cloud Computing Masaki.

Berkeley RAD Lab Technical Vision

Internet and Web Simple client-server model

Distributed Hash Tables

The Anatomy and The Physiology of the Grid

The Anatomy and The Physiology of the Grid

Control Theory in Log Processing Systems

EE 122: Lecture 22 (Overlay Networks)

Sachiko A. Kuwabara, PhD, MA

Yining ZHAO Computer Network Information Center,

Presentation transcript:

Berkeley RAD Lab Technical Vision Armando Fox, Randy Katz, Michael Jordan, Dave Patterson, Scott Shenker, Ion Stoica RADS Retreat, June 2005

Outline Overall Vision Internet Services Vision (ServRADS) Network Vision (NetRADS) Internet Services Network architecture Principles and Summary

Overarching Mantra Enable a faster pace of network service innovation through new distributed system architectures that reduce operations cost by 2-3 orders of magnitude The Challenge: Software systems: Too much information => make sense of it through statistical learning & control theory Network systems: Too little information => exploit better observation and monitoring in the network infrastructure to drive management processes

In practice this means … Single person can write, deploy, operate the next-generation IT business (“the Fortune 1 million”) Do for Internet apps what Web did for individual publishing Gray’ s challenge: planetary-scale distributed system operated by a single part-time operator Goal: programmers focus on functionality; put the *ility in the platform Could be built on utility computing, giving access to distributed physical resources Integrated approach to network and server/service management Requires 100x-1000x reduction in TCO from today’s levels

What things are like today World-scale services created and operated by expert teams “Google-sized organization” to create a Google Amazon’s book browsing, designed by programmers, is cumbersome Browsing for housewares, designed by domain experts on mature infrastructure, more usable We don’t know what the next “killer app” will be! NOW project didn’t predict Internet search as a “Killer app” for NOW’s If we succeed, the next killer Internet app will be written, deployed, operated, at Google-like scales, by a single programmer

Focusing on lowering cost of ownership Standard way to account for “where the money goes” in operating a deployed distributed application Definition independent of who is operating the app Operators per byte of storage or per CPU? No, doesn’t scale with technology changes Operators per end-user served? (This is the figure of merit for e-tailers) Operators per geographic region served? Operators per $ spent on capital cost? Operators per $ of revenue?

Outline Overall Vision Internet Services Vision (ServRADS) Network Vision (NetRADS) Internet Services Network architecture Principles and Summary

Enabling Technologies for Reducing TCO in ServRADS Past successes microrebooting: Fast recovery makes false positives tolerable Pinpoint: using SLT to detect and localize fine-grain failures visualization+SLT to help operators & earn their trust Elements of technical vision SLT and machine learning Operator-centric visualization Control theory “Open source” failures database (sanitized, open failures & forensics repository)

Example scenarios Helping operators make sense of instrumentation Using ML techniques to localize failures (P. Bodik, E. Kiciman) Using automatically-induced statistical models to identify likely causes of performance problems (S. Zhang, I. Cohen et al.) Combining SLT with visualization for cross-checking problem reports and rapidly spotting potential problems visually Facilitating self-tuning/configuration Using control theory to improve performance of a distributed streaming database (W. Xu) Service placement in wide-area distributed system (D. Oppenheimer) Microreboots (G. Candea) and microreplacement (S. Kawamoto) as low-cost prevention/repair strategies If false positive cost can be kept low, automate. Otherwise, help operator do her job.

Services example: combining viz + SLT

Reduce TCO via Planetary-scale Abstractions Inspiration: narrowly-focused planetary-scale abstractions whose design & implementation... scale well: understand distributed scheduling, locality, symptoms of wide-area failures monitorable and controllable (using SLT & linear CT) retain precisely-quantifiable and “acceptable” semantics under partial-failure conditions Examples of existing “narrow but powerful” services MapReduce in Google understands data locality Can easily imagine a “lossy” MapReduce, like online aggregation queues/messaging in Yahoo, Amazon, others User information database in Yahoo Instrumentation collection & analysis services using Telegraph-CQ

Outline Overall Vision Internet Services Vision (ServRADS) Network Vision (NetRADS) Internet Services Network architecture Principles and Summary

RADS Network Problem Internet routing has proven to be robust But … Poor visibility: hard to determine health of the network Routing policy interactions defeat propagation of useful diagnostic info: difficult to identify root cause problems Slow reaction times to connectivity failures; operator intervention (across admin domains) increases cost of ownership Key observation: network service failures attributed to unexpected traffic surge patterns Approach: identify and protect “good” traffic during surge Mechanism deployed in network edge: It’s where the servers and clients are located Greatest need for lowering management costs Administrative scope and responsibility is well-defined

iBoxes: New network element for Observe, Analyze, Act Enterprise Network Architecture Inspection-and-Action Boxes: Deep multiprotocol packet inspection No routing; observation & marking Policing points: drop, fence, block

Network-Level Observe-Analyze-Act Packet, path, protocol, service invocation statistical collection and sampling: frequencies, latencies, completion rates Construct the collection infrastructure Analyze Determine correlations among observations “Normal” model discovery + anomaly detection Exploit SLT Act Experiment to test correlations Prioritize and throttle Mark and annotate Control theory? Distributed analyses and actions

Network Layer Mechanism: Annotations Enhance network visibility: disseminate observations, communicate actions, provide in-band network management actions, iBox-to-iBox communications iBoxes label packets at annotation layer but do not rewrite packet contents Annotations stack, must be removed from packets before delivery to A-layer unaware end nodes Phy Link Network Annotation Transport Session Presentation Application

Scenario: Traffic Surge Inhibiting Network Services Internet Edge II R Primary & Secondary DNS Servers S Distribution Tier S E Mail Server E S R R IS IA E Spam Appliance Server Edge Access Edge S E DNS Server swamped by excessive request traffic Observe: DNS time outs, Web access traffic slowed, but also higher than normal mail delivery latency implying busy server edge (correlation between Mail Server and DNS Server utilization?) Root Cause: High DNS request rates generated by Spam Appliance triggered by mail surge

Scenario How Diagnosed? Internet Edge II R Primary & Secondary DNS Servers S Distribution Tier S E Mail Server E S R R IS IA E Spam Appliance Server Edge Access Edge S E How Diagnosed? I-S detects high link utilization but abnormally high DNS traffic Stats from I-I: high mail traffic, low outgoing web traffic, in traffic high but link utilization not high Stats from I-A: lower web traffic, no unusual mail origination Problem localized to Server edge, but visibility limited: RADS can help

Scenario Possible Action Responses Internet Edge II R Primary & Secondary DNS Servers S Distribution Tier S E Mail Server E S R R IS IA E Spam Appliance Server Edge Access Edge S E Possible Action Responses Experiment: Redirect local DNS requests to Secondary DNS server: if these complete, can infer the server is the problem, not the network Throttle: Due to MS-DNS correlation, block/slow email traffic at Server Edge: should expect reduced DNS server utilization

Outline Overall Vision Internet Services Vision (ServRADS) Network Vision (NetRADS) Internet Services Network architecture Principles and Summary

Embodying principles in a prototype Platform architecture and prototype to enable rapid innovation in network services by non-experts automatically accommodates scaling, provisioning, failure management multi-datacenter (geoplexed) observable networks connecting datacenters potentially planetary scale runs with minimal operator oversight Prototype keeps various research projects focused on common goal and allows ongoing testing Participation in standards processes to promote “best practices” in platform as open standards

Reliable Adaptive Distributed Systems Operator User Prototype Applications Programming Abstractions For Roll-back and wide-area distributed computations Distributed Middleware Client Distributed Middleware Server Crash-only services + Observation Infrastructure for System SLT SLT Services Application- Specific Overlay Network Checkable Protocols Fast Detection & Route Recovery Observation Infrastructure for network SLT iBox iBox Edge Network Edge Network Internet IP Network Router Commodity Internet

Generic iBox Architecture Interconnection Fabric Input Ports Output Ports Buffers CP Classification Processor CP AP Action Processor “Tag” Mem Rules & Programs

Possible architecture of a rack app. server & application, e.g. J2EE Microrecovery actions Datacenter boundary From other datacenters High-level effectors SLT algo. SLT algo. SLT algo. To other datacenters Control loops High-level sensor data Externally-induced failures, workload changes, etc. T-CQ engine Sanitized data SLT algo. SLT algo. SLT algo. Visualization Preprocessed data To other datacenters Syndrome identification

Outline Overall Vision Internet Services Vision (ServRADS) Network Vision (NetRADS) Internet Services Network architecture Principles and Summary

ServRADS: Observations & Summary SLT algorithms make sense of large amounts of data Classification, outlier/anomaly detection, clustering, etc. Viz helps operator use “visual pattern recognition” to quickly spot problems and cross-check SLT models Enables operator expertise to be quickly brought to bear Builds operators’ trust in statistical/machine learning models Challenge Fundamental challenges associated with applying SLT to problem determination (coming up next session) Unifying many techniques into a coherent approach - prototype platform as unifying artifact Idea: capture best practices in TCO-optimized, planetary-scale abstractions

NetRADS: Observations & Summary COPS: Paradigm for (more) automatically protecting critical resources when network is under stress Checkable protocols: visible semantics Observe network behavior: good (easy), bad (hard), suspicious Protect services: throttle, redirect Network management major contributor to TCO NetRADS built on: iBoxes: pervasive infrastructure for observation and action at the network level Annotation Layer: for marking, control, inter-iBox communications Integration with Internet service approach for service/server-level visibility and integrated management