Experimental Design for Practical Network Diagnosis

Slides:

Advertisements

Similar presentations

Bayesian Belief Propagation

Advertisements

Minimizing Probing Cost for Detecting Interface Failures: Algorithms and Scalability Analysis Hung Nguyen (Univ. of Adelaide, Australia) Renata Teixeira.

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

On Selfish Routing In Internet-like Environments Lili Qiu (Microsoft Research) Yang Richard Yang (Yale University) Yin Zhang (AT&T Labs – Research) Scott.

Minimizing Multi-Hop Wireless Routing State under Application- based Accuracy Constraints Mustafa Kilavuz & Murat Yuksel University of Nevada, Reno.

SIGCOMM 2003 Making Intra-Domain Routing Robust to Changing and Uncertain Traffic Demands: Understanding Fundamental Tradeoffs David Applegate Edith Cohen.

Robust Network Compressive Sensing Lili Qiu UT Austin NSF Workshop Nov. 12, 2014.

PROMISE: Peer-to-Peer Media Streaming Using CollectCast Mohamed Hafeeda, Ahsan Habib et al. Presented By: Abhishek Gupta.

Visual Recognition Tutorial

By Libo Song and David F. Kotz Computer Science,Dartmouth College.

Server-based Inference of Internet Performance V. N. Padmanabhan, L. Qiu, and H. Wang.

An Algebraic Approach to Practical and Scalable Overlay Network Monitoring Yan Chen, David Bindel, Hanhee Song, Randy H. Katz Presented by Mahesh Balakrishnan.

NetQuest: A Flexible Framework for Internet Measurement Lili Qiu Joint work with Mike Dahlin, Harrick Vin, and Yin Zhang UT Austin.

1 Network Tomography Venkat Padmanabhan Lili Qiu MSR Tab Meeting 22 Oct 2001.

Collaborative Ordinal Regression Shipeng Yu Joint work with Kai Yu, Volker Tresp and Hans-Peter Kriegel University of Munich, Germany Siemens Corporate.

Experimental Design for Practical Network Diagnosis Yin Zhang University of Texas at Austin Joint work with Han Hee Song and Lili.

Tradeoffs in CDN Designs for Throughput Oriented Traffic Minlan Yu University of Southern California 1 Joint work with Wenjie Jiang, Haoyuan Li, and Ion.

Radial Basis Function Networks

Tomo-gravity Yin ZhangMatthew Roughan Nick DuffieldAlbert Greenberg “A Northern NJ Research Lab” ACM.

Network Planète Chadi Barakat

Shannon Lab 1AT&T – Research Traffic Engineering with Estimated Traffic Matrices Matthew Roughan Mikkel Thorup

Network Aware Resource Allocation in Distributed Clouds.

Topology aggregation and Multi-constraint QoS routing Presented by Almas Ansari.

1 Passive Network Tomography Using Bayesian Inference Lili Qiu Joint work with Venkata N. Padmanabhan and Helen J. Wang Microsoft Research Internet Measurement.

Center for Radiative Shock Hydrodynamics Fall 2011 Review Assessment of predictive capability Derek Bingham 1.

Towards Efficient Large-Scale VPN Monitoring and Diagnosis under Operational Constraints Yao Zhao, Zhaosheng Zhu, Yan Chen, Northwestern University Dan.

Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:

A Passive Approach to Sensor Network Localization Rahul Biswas and Sebastian Thrun International Conference on Intelligent Robots and Systems 2004 Presented.

TCP with Variance Control for Multihop IEEE Wireless Networks Jiwei Chen, Mario Gerla, Yeng-zhong Lee.

Schreiber, Yevgeny. Value-Ordering Heuristics: Search Performance vs. Solution Diversity. In: D. Cohen (Ed.) CP 2010, LNCS 6308, pp Springer-

6 December On Selfish Routing in Internet-like Environments paper by Lili Qiu, Yang Richard Yang, Yin Zhang, Scott Shenker presentation by Ed Spitznagel.

On Selfish Routing In Internet-like Environments Lili Qiu (Microsoft Research) Yang Richard Yang (Yale University) Yin Zhang (AT&T Labs – Research) Scott.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

SCALABLE INFORMATION-DRIVEN SENSOR QUERYING AND ROUTING FOR AD HOC HETEROGENEOUS SENSOR NETWORKS Paper By: Maurice Chu, Horst Haussecker, Feng Zhao Presented.

1 Network Tomography Using Passive End-to-End Measurements Venkata N. Padmanabhan Lili Qiu Helen J. Wang Microsoft Research DIMACS’2002.

De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.

1 An Arc-Path Model for OSPF Weight Setting Problem Dr.Jeffery Kennington Anusha Madhavan.

NetQuest: A Flexible Framework for Large-Scale Network Measurement Lili Qiu University of Texas at Austin Joint work with Han Hee Song.

1 Network Tomography Using Passive End-to-End Measurements Lili Qiu Joint work with Venkata N. Padmanabhan and Helen J. Wang.

PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services Ming Zhang, Chi Zhang Vivek Pai, Larry Peterson, Randy Wang Princeton.

Computational Intelligence: Methods and Applications Lecture 14 Bias-variance tradeoff – model selection. Włodzisław Duch Dept. of Informatics, UMK Google:

SketchVisor: Robust Network Measurement for Software Packet Processing

Estimating standard error using bootstrap

Mingze Zhang, Mun Choon Chan and A. L. Ananda School of Computing

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Inference for a Single Population Proportion (p)

Impact of Interference on Multi-hop Wireless Network Performance

Chapter 7. Classification and Prediction

Deep Feedforward Networks

Monitoring Persistently Congested Internet Links

Vivaldi: A Decentralized Network Coordinate System

LECTURE 09: BAYESIAN ESTIMATION (Cont.)

Server Allocation for Multiplayer Cloud Gaming

ISP and Egress Path Selection for Multihomed Networks

On the Scale and Performance of Cooperative Web Proxy Caching

Multi-Core Parallel Routing

Statistical Learning Dong Liu Dept. EEIS, USTC.

Determining the distribution of Sample statistics

Hidden Markov Models Part 2: Algorithms

Qun Huang, Patrick P. C. Lee, Yungang Bao

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Data and Computer Communications

10701 / Machine Learning Today: - Cross validation,

Connecting Data with Domain Knowledge in Neural Networks -- Use Deep learning in Conventional problems Lizhong Zheng.

Backbone Traffic Engineering

Learning From Observed Data

Authors: Jinliang Fan and Mostafa H. Ammar

Communication Driven Remapping of Processing Element (PE) in Fault-tolerant NoC-based MPSoCs Chia-Ling Chen, Yen-Hao Chen and TingTing Hwang Department.

Multiobjective Optimization

Presentation transcript:

Experimental Design for Practical Network Diagnosis Yin Zhang University of Texas at Austin yzhang@cs.utexas.edu Joint work with Han Hee Song and Lili Qiu MSR EdgeNet Summit June 2, 2006

Practical Network Diagnosis Ideal Every network element is self-monitoring, self-reporting, self-…, there is no silent failures … Oracle walks through the haystack of data, accurately pinpoints root causes, and suggests response actions Reality Finite resources (CPU, BW, human cycles, …) cannot afford to instrument/monitor every element Decentralized, autonomous nature of the Internet infeasible to instrument/monitor every organization Protocol layering minimizes information exposure difficult to obtain complete information at every layer Practical network diagnosis: Maximize diagnosis accuracy under given resource constraint and information availability

Design of Diagnosis Experiments Input A candidate set of diagnosis experiments Reflects infrastructure constraints Information availability Existing information already available Information provided by each new experiment Resource constraint E.g., number of experiments to conduct (per hour), number of monitors available Output: A diagnosis experimental plan A subset of experiments to conduct Configuration of various control parameters E.g., frequency, duration, sampling ratio, …

Example: Network Benchmarking 1000s of virtual networks over the same physical network Wants to summarize the performance of each virtual net E.g. traffic-weighted average of individual virtual path performance (loss, delay, jitter, …) Similar problem exists for monitoring per-application/customer performance Challenge: Cannot afford to monitor all individual virtual paths N2 explosion times 1000s of virtual nets Solution: monitor a subset of virtual paths and infer the rest Q: which subset of virtual paths to monitor? R

Example: Client-based Diagnosis Clients probe each other Use tomography/inference to localize trouble spot E.g. links/regions with high loss rate, delay jitter, etc. Challenge: Pair-wise probing too expensive due to N2 explosion Solution: monitor a subset of paths and infer the link performance Q: which subset of paths to probe? AOL C&W Sprint UUNet AT&T Qwest Why is it so slow?

More Examples Wireless sniffer placement Cross-layer diagnosis Input: A set of locations to place wireless sniffers Not all locations possible – some people hate to be surrounded by sniffers Monitoring quality at each candidate location E.g. probabilities for capturing packets from different APs Expected workload of different APs Locations of existing sniffers Output: K additional locations for placing sniffers Cross-layer diagnosis Infer layer-2 properties based on layer-3 performance Which subset of layer-3 paths to probe?

Beyond Networking Software debugging Car crash test Medicine design Select a given number of tests to maximize the coverage of corner cases Car crash test Crash a given number of cars to find a maximal number of defects Medicine design Conducting a given number of tests to maximize the chance of finding an effective ingredient Many more …

Need Common Solution Framework Can we have a framework that solves them all? As opposed to ad hoc solutions for individual problems Key requirements: Scalable: work for large networks (e.g. 10000 nodes) Flexible: accommodate different applications Differentiated design Different quantities have different importance, e.g., a subset of paths belong to a major customer Augmented design Conduct additional experiments given existing observations, e.g., after measurement failures Multi-user design Multiple users interested in different parts of network or have different objective functions

NetQuest A baby step towards such a framework “NetQuest: A flexible framework for large-scale network measurement”, Han Hee Song, Lili Qiu and Yin Zhang. ACM SIGMETRICS 2006. Achieves scalability and flexibility by combining Bayesian experimental design Statistical inference Developed in the context of e2e performance monitoring Can extend to other network monitoring/ diagnosis problems

What We Want A function f(x) of link performance x We use a linear function f(x)=F*x in this talk Ex. 1: average link delay f(x) = (x1+…+x11)/11 Ex. 2: end-to-end delays Apply to any additive metric, eg. Log (1 – loss rate) x2 3 2 x4 x1 x3 x5 x6 4 5 1 x10 x7 x8 x11 1) The goal of network measurement is to estimate a function of link performance, denoted as f(x). 2) In this talk, we consider f(x) is a linear function. 3) Here are some examples of f(x). 4) For example, suppose we have the following network. xi denotes link delay. If we are interested in average link delay, f(x) = average of all x’s. 5) Another network property that is often interesting to measure is end-to-end delay on all network paths. It is easy to see that delay on one network path is a linear combination of link performance. For example, Delay between 1 and 2 is x1; delay between 1 and 3 is x1+x2. If we are interested in all network paths’ delay in this network, f(x) becomes the following form– a zero/one matrix multiplied by link delay. 6) f(x) can also be applied for other additive metric, such as log(1-loss rate). and we want to estimate end-to-end delay between every pair of network path. In this case, f(x) is A*x, where A is a routing matrix. Design space: a subset of paths to probe Given a subset of paths=> infer Which subset of paths to probe to maximize information gain 7 6 x9

Problem Formulation What we can measure: e2e performance Network performance estimation Goal: e2e performance on some paths  f(x) Design of experiments Select a subset of paths S to probe such that we can estimate f(x) based on the observed performance yS, AS, and yS=ASx Network inference Given e2e performance, infer link performance Infer x based on y=F*x, y, and F In practice, due to the constraints for instrumentation, we can usually measure e2e performance. Therefore we need network inference to infer link performance using e2e observations. More specifically, the inference algorithms need to infer x using y=A*x, and the values of y and A. State of art: pairwise probing. More recently, Chen et al. develop a rank-based approach. Their solution is based on the fact that only a subset of rows Fs in the routing matrix F are sufficient to span the space of F. As a result, given the measurements for paths corresponding to the rows in Fs, one can exactly reconstruct the e2e performance on all other paths. 4) Clearly probing every path is too expensive? 5) To be scalable, we can only afford to probe a small subset of paths. 6) We can’t just pick an arbitrary subset of paths to probe, as the accuracy of the inference depends on the set of paths we choose. So ideally, we want to pick a “best” subset of path to probe such that we can infer f(x) based on the measurement results as accurately as possible. often can probe a small subset of paths. Accuracy of inference depends on which Our goal is to select a small subset of paths s.t. we can still accurately infer f(x) We cast the path selection into bayesian framework. Associate utility Bayesian: pick a design that maximizes utility function. (typically gaussian distri on prior)

Design of Experiments State of the art Probe every path (e.g., RON) Not scalable since # paths grow quadratically with #nodes Rank-based approach [sigcomm04] Let A denote routing matrix Monitor rank(A) paths that are linearly independent to exactly reconstruct end-to-end path properties Still very expensive Select a “best” subset of paths to probe so that we can accurately infer f(x) How to quantify goodness of a subset of paths?

Bayesian Experimental Design A good design maximizes the expected utility under the optimal inference algorithm Different utility functions yield different design criteria Let , where is covariance matrix of x Bayesian A-optimality Goal: minimize the squared error Bayesian D-optimality Goal: maximize the expected gain in Shannon information We can formally cast the problem into a Bayesian decision theoretic framework. In this framework, let d denote a measurement design. Basically it is the subset of path we decide to probe. Let I be the inference algorithm we will use to infer f(x) given the e2e measurement results. We associate some utility function U(d,i) with design d and inference i. We will also assume some prior distribution of x (typically Gaussian). The bayesian experimental design states that a good design is the one that maximizes the expected utility under the optimal inference algorithm. Bayesian experimental design has found many applications in scientific research. For example, in car-crash … in medicine design …

Search Algorithm Given a design criterion , next step is to find s rows of A to optimize This problem is NP-hard We use a sequential search algorithm to greedily select the row that results in the largest improvement in Better search algorithms?

Flexibility Differentiated design Augmented design Multi-user design Give higher weights to the important rows of matrix F Augmented design Ensure the newly selected paths in conjunction with previously monitored paths maximize the utility Multi-user design New design criteria: a linear combination of different users’ design criteria We then develop a flexible framework on top of Bayesian experimental design. Our framework can accommodate common requirements in large-scale network performance inference. In particular, we can support multi-user measurement design, where users are interested in measuring different parts of networks or have different measurement objective. We use a design criteria that is a linear combination of all users’ design criteria. To support augmented design when some of previous measurement experiments fail, we select new paths to monitor such that the newly selected paths in conjunction with previously monitored paths maximize the utility. We also support differentiated design. This is achieved by assigning higher weights to the important elements.

Network Inference Goal: find x s.t. Y=Ax Main challenge: under-constrained problem L2-norm minimization L1-norm minimization Maximum entropy estimation

Evaluation Methodology Data sets Accuracy metric # nodes # overlay nodes # paths # links Rank PlanetLab-RTT 2514 61 3657 5467 769 Planetlab-loss 1795 60 3270 4628 690 Brite-n1000-o200 1000 200 39800 2883 2051 Brite-n5000-o600 5000 600 359400 14698 9729 1) We quantify the accuracy of inference using normalized mean absolute error ({\em MAE}). It is defined as \[\frac{\sum_i | infer_i - actual_i |}{ \sum_i actual_i}\] where $infer_i$ and $actual_i$ are the inferred and actual end-to-end performance for path $i$. A lower {\em MAE} indicates a higher accuracy. 2) We evaluate our approach using two sets of real Internet performance data: NLANR and RON. 3) The NLANR traces contain round-trip time, packet loss, and traceroute measurements between pairs of 140 universities in Oct. 2004 with a probing frequency of once per minute. 4) For diversity, we also use the latency and loss measurements provided by Resilient Overlay Network Project. The data contains round-trip measurement and loss among 12 - 15 hosts during March 2001 and May 2001, with altogether 5.6 million samples.

Comparison of DOE Algorithms: Estimating Network-Wide Mean RTT A global view of the performance aggregated over an entire network is useful for a variety of reasons. It can be used for estimating a typical user experience (as in the Internet End-to-end Performance Monitoring Project (IEPM)), detecting anomalies, trouble-shooting, and optimizing performance. As we can see, with the $A$-optimal design, the inference error decays very fast -- the error is within 15\% by monitoring only within 2\% paths. In comparison, the inference error is 50\% or higher (than $A$-optimal) when the same number of paths are monitored under the other schemes. In addition, we observe that to achieve within 10\% inference error, the other schemes require monitoring 60\% more paths than $A$-optimal. Finally, as the number of monitored paths increases, all the schemes converge to close to 0 inference error, since in this case there are sufficient information to reconstruct the global network view. Random selection converges slower because it does not ensure the selected paths are linearly independent. A-optimal yields the lowest error.

Comparison of DOE Algorithms: Estimating Per-Path RTT As we can see, the rank-based approach requires monitoring 769 paths, which is expensive. In comparison, the other approaches can provide a smooth tradeoff between inference accuracy and measurement overhead. Among them, $A$-optimal performs the best. To achieve 10\% inference error, the other algorithms need to monitor up to 60\% more paths than $A$-optimal. Note that $D$-optimal performs significantly worse than $A$-optimal, and sometimes even worse than the other alternatives. This is likely due to the fact that the Kullback-Leibler distance tends to under-penalize estimation errors. Finally, we observe that the performance benefit of $A$-optimal is largest when the number of monitored path is close to one to two thirds of the rank of them routing matrix. This is because when the number of monitored paths is too small, there is not enough information for accurately reconstructing the global view of network performance, regardless of which design scheme is used. In the other extreme, when the number of monitored paths is close to the rank, any independent row combinations (in the routing matrix) can provide sufficient information for accurate reconstruction of end-to-end path properties. A-optimal yields the lowest error.

Differentiated Design: Inference Error on Preferred Paths Figure shows the inference error on the preferred when we monitor 200 paths (5% paths) in PlanetLab-RTT topology, and vary the number of preferred paths from 20 to 160. As we can see, the inference error on the preferred paths decreases with an increasing weight. When the weight is 4 and higher, the inference error is close to 0. This is because when the weight is high enough, the performance of many preferred paths is either directly monitored or exactly reconstructed from the monitored paths. Lower error on the paths with higher weights.

Differentiated Design: Inference Error on the Remaining Paths Figure shows the inference error on the remaining paths when we monitor 200 paths (5% paths) in PlanetLab-RTT topology, and vary the number of preferred paths from 20 to 160. We observe that the inference error on the remaining paths increases slightly, since as we pay more attention to the preferred paths, the remaining paths are monitored less extensively. Error on the remaining paths increases slightly.

A-optimal is most effective in augmenting an existing design. Augmented Design Next we consider augmented design for supporting continuous monitoring. Our evaluation is based on the following scenario. Suppose we identify a set of paths to monitor, and some of them fail to provide us measurement data (e.g., due to software or hardware failures at monitor sites or at their incoming/outgoing links). In this case, we need to identify the additional measurements to conduct given that we have already obtained the measurement results from the unfailed paths. In our evaluation, we first use $A$-optimal to identify 100 paths to monitor in PlanetLab-RTT. Then we vary the number of failed paths from 10 to 80, and apply different augmented design algorithms to determine the additional paths to monitor. Figure~\ref{fig:compareSeq} shows the average inference error under different schemes. As we can see, $A$-optimal yields the lowest error. Moreover, its inference error is similar for a varying number of failed paths. In comparison, the inference error of the other schemes tends to increase with an increasing number of failed paths. This suggests that the $A$-optimal design is the most effective in augmenting existing designs. A-optimal is most effective in augmenting an existing design.

A-optimal yields the lowest error. Multi-user Design Next we compare the performance across different design schemes using their best mode. the inference error of various design schemes under their best design modes. They all use joint inference. As they show, $A$-optimal yields the lowest error. The improvement is most significant when # monitored paths is small. A-optimal yields the lowest error.

Summary Our contributions Future work Bring Bayesian experimental design to network measurement and diagnosis Develop a flexible framework to accommodate different design requirements Experimentally show its effectiveness Future work Making measurement design fault tolerant Applying our technique to other diagnosis problems Extend our framework to incorporate additional design constraints Last but not least, we are interested in developing applications that make use of our toolkit. Some applications we have in mind include fault diagnosis, anomaly detection, and knowledge plane for physical, overlay, and virtualized networks.

Thank you!