Nick Feamster Georgia Tech Joint work with Mukarram bin Tariq, Murtaza Motiwala, Yiyi Huang, Mostafa Ammar, Anukool Lakhina, Jim Xu Detecting and Diagnosing Network Performance Degradations with Statistical Methods
2 Network Performance Problems Frequent and varied –Analysis of performance problems on the NANOG mailing list suggests a reasonably major incident every 2-3 days Causes range from malice, to misconfiguration, to software error, to physical breach
3 Conventional: Domain Knowledge Common approach: Apply domain knowledge to diagnose the cause of a problem or performance degradation Example –Router configuration checker Normalized Representation Correctness Specification Constraints Distributed router configurations (Single AS) Faults Problem: Must define (and know) problems in advance!
4 Complementary Approach: Statistics Need not have overly complete models of protocol, behavior, network, problems –All effects on application behavior –All causes of network disruption –All failure scenarios… Instead: Statistical model for desired behavior –Based on behavior of protocol –Agnostic to underlying causes –Automatically discover Dependencies Cases that might violate this behavior
5 This Talk: Two Problems Detecting network-wide routing anomalies –Monitoring network-wide routing disruptions –Watch for deviations from Detecting application-level performance degradations –Monitor application performance from array of clients –Place clients into strata and adjust for confounding factors
6 Routing Disruptions: Overview Network routing disruptions are frequent –On Abilene from January 1,2006 to June 30, s, 282 disruptions How to help network operators deal with disruptions quickly? –Massive amounts of data –Lots of noise –Need for fast detection
7 Existing Approaches Many existing tools and data sources –Tivoli Netcool, SNMP, Syslog, IGP, BGP, etc. Possible issues –Noise level –Time to detection Network-wide correlation/analysis –Not just reporting on manually specified traps This talk: Explore complementary data sources –First step: Mining BGP routing data
8 Challenges: Analyzing Routing Data Large volume of data Lack of semantics in a single stream of routing updates Needed: Mining, not simple reporting Idea: Can we improve detection by mining network- wide dependencies across routing streams?
9 Key Idea: Network-Wide Analysis Structure and configuration of the network gives rise to dependencies across routers Analysis should be cognizant of these dependencies. Dont treat streams of data independently. Big network events may cause correlated blips.
10 Overview
11 Approach: network-wide, multivariate analysis –Model network-wide dependencies directly from the data –Extract common trends –Look for deviations from those trends High detection rate (for acceptable false positives) –100% of node/link disruptions, 60% of peer disruptions Fast detection –Current time to reporting (in minutes) Detection
12 Identification: Approach Classify disruptions into four types –Internal node, internal link, peer, external node Track three features 1.Global iBGP next-hops 2.Local iBGP next-hops 3.Local eBGP next-hops Approach Goal
13 Identification: Results
14 Key Results 90% of local disruptions are visible in BGP –Many disruptions are low volume –Disruption size can vary by several orders of magnitude About 75% involve more than 2 routers –Analyze data across streams –BGP routing data is but one possible input data set Detection –100% of node and link disruptions –60% of peer disruptions Identification –100% of node disruptions, –74% of link disruptions –93% of peer disruptions
15 Two Problems Detecting network-wide routing anomalies –Monitoring network-wide routing disruptions –Watch for deviations from Detecting application-level performance degradations –Monitor application performance from array of clients –Place clients into strata and adjust for confounding factors
16 Net Neutrality
17 Example: BitTorrent Blocking
18 Throttling/prioritizing based on destination or service –Target domains, applications, or content Discriminatory peering –Resist peering with certain content providers … Many Forms of Discrimination
19 Problem Statement Identify whether a degradation in a service performance is caused by discrimination by an ISP –Quantify the causal effect Existing techniques detect specific ISP methods –TCP RST (Glasnost) –ToS-bit based de-prioritization (NVLens) Goal: Establish a causal relationship in the general case, without assuming anything about the ISPs methods
20 Causality: Analogy from Health Epidemiology: Study causal relationships between risk factors and health outcome NANO (Network Access Neutrality Observatory): Infer causal relationship between ISP and service performance
21 Does Aspirin Make You Healthy? Sample of patients Positive correlation in health and treatment Can we say that Aspirin causes better health? Confounding Variables: correlate with both cause and outcome variables and confuse the causal inference Aspirin No Aspirin Healthy 40%15% Not Healthy 10%35% Sleep Duration Diet Other Drugs Gender Aspirin Health ?
22 Comcast No Comcast BitTorrent Download Time 5 sec2 sec Client Setup ToD Content Locatio n Sample of client performances Some correlation in ISP and service performance Can we say that Comcast is discriminating? Many confounding variables can confuse inference. Comcast BT Downloa d Time ? Does an ISP Degrade Service?
23 Baseline Performance Performance with the ISP Causal Effect = E(Real Download time using Comcast) E(Real Download time not using Comcast) G 1, G 0 : Ground-truth values for performance (aka. Counter-factual values) Problem: No ground truth values for the same clients. in situ data sets cannot directly estimate causal effect. Causation vs. Association
24 Observed Baseline Performance Observed Performance with the ISP Association = E (Download time using Comcast) E (Download time not using Comcast) Observing association in an in situ data set In general,. How to estimate causal effect ( ) ? Causation vs. Association
25 Estimating Causal Effect Two common approaches –Random Treatment –Adjusting for Confounding Variables
26 Aspirin Treated Not Aspirin Treated !H H H HH H H HH H = = 0.55 Strawman: Random Treatment Given a population: –Treat subjects with Aspirin randomly, irrespective of their health –Observe new outcome and measure association –For large samples, association converges to causal effect if confounding variables do not change Diet, other drugs, etc. should not change
27 Random Treatment of ISPs: Hard! Ask clients to change ISP to an arbitrary one Difficult to achieve on the Internet –Changing ISP is cumbersome for the users –Changing ISP may change other confounding variables, e.g., the ISP network changes.
28 Adjusting for Confounding Variables !H H H HHH H H HH H H H H H H HH H H Treated Baseline Strata Effect An in situ data set 1. List confounders e.g., gender ={, } 2. Collect a data set 3. Stratify along confounder variable values 4. Measure association 5. If there still is association, then it must be causation
29 Adjusting for Confounding: ISPs What is the baseline? What are the confounding variables? Is the list of confounders sufficient? How to collect the data? Can we infer more than the effect? –e.g., the discrimination criteria
30 What is the Baseline? Baseline: performance when ISP is not used –We need to use some ISP for comparison –What if the one we use is not neutral? Solutions –Use average performance over all other ISPs –Use a lab model –Use service providers model
31 Determine Confounding Variables Client Side –Client setup (Network Setup) –Application (Browser, BT Client, VoIP client) –Resources (Memory, CPU, Utilization) ISP Related –Not all ISPs are equal; e.g., location Temporal –Diurnal cycles, transient failures
32 Data Collection Quantify confounders and effect Identify the treatment variable Unbiased Passive measurements at the client end
33 + Domain Content Size MB Inferring the Underlying Cause Label data in two classes: –discriminated (-), non-discriminated (+) Train a decision tree for classification –rules provide hints about the criteria
34 Clients use two applications: App 1 and App 2 to access services Service 1 is slower using App 2 App is confounding ISP B throttles access to Service 1 Association Service 1Service 2 Baseline ISP B Association0.92 (10%)0.04 (1%) Causation Service 1Service 2 App 1 App 2 App 1 App 2 Baseline ISP B Causation 2.05 (20%) 5.18(187 %) 0.06(2%)0.12(4%) Preliminary Evaluation: Simulation
35 Conclusion Detecting Routing Disruptions –Ability to detect and identify specific link/node disruptions without using domain-specific rules NANO: Black-box approach to infer and quantify discrimination; generically applicable. –Many open issues Privacy: Can we do local inference? Deployment: PlanetLab, CPR, Real Users How much data?: Depends on variance