Toward Optimal Network Fault Correction via End-to-End Inference Patrick P. C. Lee, Vishal Misra, Dan Rubenstein Distributed Network Analysis (DNA) Lab.

Slides:

Advertisements

Similar presentations

Routing Complexity of Faulty Networks Omer Angel Itai Benjamini Eran Ofek Udi Wieder The Weizmann Institute of Science.

Advertisements

Analysis and Construction of Functional Regenerating Codes with Uncoded Repair for Distributed Storage Systems Yuchong Hu, Patrick P. C. Lee, Kenneth.

Multicast in Wireless Mesh Network Xuan (William) Zhang Xun Shi.

Copyright 2004 Koren & Krishna ECE655/DataRepl.1 Fall 2006 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing.

Study Group Randomized Algorithms 21 st June 03. Topics Covered Game Tree Evaluation –its expected run time is better than the worst- case complexity.

Playback delay in p2p streaming systems with random packet forwarding Viktoria Fodor and Ilias Chatzidrossos Laboratory for Communication Networks School.

1 Efficient and Robust Streaming Provisioning in VPNs Z. Morley Mao David Johnson Oliver Spatscheck Kobus van der Merwe Jia Wang.

Los Angeles September 27, 2006 MOBICOM Localization in Sparse Networks using Sweeps D. K. Goldenberg P. Bihler M. Cao J. Fang B. D. O. Anderson.

CPSC 668Set 10: Consensus with Byzantine Failures1 CPSC 668 Distributed Algorithms and Systems Fall 2009 Prof. Jennifer Welch.

Distributed Algorithms for Secure Multipath Routing

Ashish Gupta Under Guidance of Prof. B.N. Jain Department of Computer Science and Engineering Advanced Networking Laboratory.

ZIGZAG A Peer-to-Peer Architecture for Media Streaming By Duc A. Tran, Kien A. Hua and Tai T. Do Appear on “Journal On Selected Areas in Communications,

Measurements Meir Kalech Partially Based on slides of Brian Williams and Peter struss.

Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.

©NEC Laboratories America 1 Hui Zhang Samrat Ganguly Sudeept Bhatnagar Rauf Izmailov NEC Labs America Abhishek Sharma University of Southern California.

Server-based Inference of Internet Performance V. N. Padmanabhan, L. Qiu, and H. Wang.

An Algebraic Approach to Practical and Scalable Overlay Network Monitoring Yan Chen, David Bindel, Hanhee Song, Randy H. Katz Presented by Mahesh Balakrishnan.

Detecting Network Intrusions via Sampling : A Game Theoretic Approach Presented By: Matt Vidal Murali Kodialam T.V. Lakshman July 22, 2003 Bell Labs, Lucent.

CPSC 668Set 10: Consensus with Byzantine Failures1 CPSC 668 Distributed Algorithms and Systems Fall 2006 Prof. Jennifer Welch.

DNA Research Group 1 Growth Codes: Maximizing Sensor Network Data Persistence Abhinav Kamra, Vishal Misra, Dan Rubenstein Department of Computer Science,

Distributed Collaborative Key Agreement Protocols for Dynamic Peer Groups Patrick P. C. Lee, John C. S. Lui and David K. Y. Yau IEEE ICNP 2002.

Evaluating Hypotheses Chapter 9 Homework: 1-9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics ~

No Free Lunch (NFL) Theorem Many slides are based on a presentation of Y.C. Ho Presentation by Kristian Nolde.

PROMISE: Peer-to-Peer Media Streaming Using CollectCast M. Hefeeda, A. Habib, B. Botev, D. Xu, and B. Bhargava ACM Multimedia 2003, November 2003.

Design of Fault Tolerant Data Flow in Ptolemy II Mark McKelvin EE290 N, Fall 2004 Final Project.

Large Scale File Distribution Troy Raeder & Tanya Peters.

Near-Optimal Network Design with Selfish Agents By Elliot Anshelevich, Anirban Dasgupta, Eva Tardos, Tom Wexler STOC’03 Presented by Mustafa Suleyman CIFTCI.

Vassilios V. Dimakopoulos and Evaggelia Pitoura Distributed Data Management Lab Dept. of Computer Science, Univ. of Ioannina, Greece

Online Data Gathering for Maximizing Network Lifetime in Sensor Networks IEEE transactions on Mobile Computing Weifa Liang, YuZhen Liu.

1 A Novel Mechanism for Flooding Based Route Discovery in Ad hoc Networks Jian Li and Prasant Mohapatra Networks Lab, UC Davis.

Processing Rate Optimization by Sequential System Floorplanning Jia Wang 1, Ping-Chih Wu 2, and Hai Zhou 1 1 Electrical Engineering & Computer Science.

Theoretical Bounds on Control- Plane Self Monitoring in Routing Protocols Raj Kumar Rajendran Vishal Misra Dan Rubenstein.

QoS-Aware Path Protection in MPLS Networks Ashish Gupta Ashish Gupta Bijendra Jain Indian Institute of Technology Delhi Satish Tripathi University of California.

Scalable Construction of Resilient Overlays using Topology Information Mukund Seshadri Dr. Randy Katz.

1 Algorithms for Bandwidth Efficient Multicast Routing in Multi-channel Multi-radio Wireless Mesh Networks Hoang Lan Nguyen and Uyen Trang Nguyen Presenter:

PROMISE: Peer-to-Peer Media Streaming Using CollectCast Presented by: Randeep Singh Gakhal CMPT 886, July 2004.

MATE: MPLS Adaptive Traffic Engineering Anwar Elwalid, et. al. IEEE INFOCOM 2001.

1 Meeyoung Cha (KAIST) Sue Moon (KAIST) Chong-Dae Park (KAIST) Aman Shaikh (AT&T Labs – Research) IEEE INFOCOM 2005 Poster Session Positioning Relay Nodes.

PIC: Practical Internet Coordinates for Distance Estimation Manuel Costa joint work with Miguel Castro, Ant Rowstron, Peter Key Microsoft Research Cambridge.

Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.

Helsinki 19 May 2006 Fine Protection of Data-Paths in Multi-Layer Networks Based on the GMPLS paradigm G.Oriolo, Università Tor Vergata, Roma joint work.

Network Aware Resource Allocation in Distributed Clouds.

Wei Gao1 and Qinghua Li2 1The University of Tennessee, Knoxville

Trust-Aware Optimal Crowdsourcing With Budget Constraint Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department.

Towards Efficient Large-Scale VPN Monitoring and Diagnosis under Operational Constraints Yao Zhao, Zhaosheng Zhu, Yan Chen, Northwestern University Dan.

More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.

2007/03/26OPLAB, NTUIM1 A Proactive Tree Recovery Mechanism for Resilient Overlay Network Networking, IEEE/ACM Transactions on Volume 15, Issue 1, Feb.

Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.

Probabilistic Coverage in Wireless Sensor Networks Authors : Nadeem Ahmed, Salil S. Kanhere, Sanjay Jha Presenter : Hyeon, Seung-Il.

Sep. 1, SIGCOMM '99 Dan Rubenstein1 The Impact of Multicast Layering on Network Fairness Dan Rubenstein Jim Kurose Don Towsley.

1 - CS7701 – Fall 2004 Review of: Detecting Network Intrusions via Sampling: A Game Theoretic Approach Paper by: – Murali Kodialam (Bell Labs) – T.V. Lakshman.

Re-Configurable Byzantine Quorum System Lei Kong S. Arun Mustaque Ahamad Doug Blough.

On the Robustness of Soft- State Protocols John Lui, CUHK Vishal Misra, Columbia U. Dan Rubenstein, Columbia U.

On Reducing Mesh Delay for Peer- to-Peer Live Streaming Dongni Ren, Y.-T. Hillman Li, S.-H. Gary Chan Department of Computer Science and Engineering The.

Design and Analysis of Optimal Multi-Level Hierarchical Mobile IPv6 Networks Amrinder Singh Dept. of Computer Science Virginia Tech.

Reliable Multicast Routing for Software-Defined Networks.

Peer-to-Peer Result Dissemination in High-Volume Data Filtering Shariq Rizvi and Paul Burstein CS 294-4: Peer-to-Peer Systems.

Repeated Game Modeling of Multicast Overlays Mike Afergan (MIT CSAIL/Akamai) Rahul Sami (University of Michigan) April 25, 2006.

Efficient Resource Allocation for Wireless Multicast De-Nian Yang, Member, IEEE Ming-Syan Chen, Fellow, IEEE IEEE Transactions on Mobile Computing, April.

1 11 Distributed Channel Assignment in Multi-Radio Mesh Networks Bong-Jun Ko, Vishal Misra, Jitendra Padhye and Dan Rubenstein Columbia University.

Bing Wang, Wei Wei, Hieu Dinh, Wei Zeng, Krishna R. Pattipati (Fellow IEEE) IEEE Transactions on Mobile Computing, March 2012.

Models of Greedy Algorithms for Graph Problems Sashka Davis, UCSD Russell Impagliazzo, UCSD SIAM SODA 2004.

Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.

1 Minimum Interference Algorithm for Integrated Topology Control and Routing in Wireless Optical Backbone Networks Fangting Sun Mark Shayman University.

PATH DIVERSITY WITH FORWARD ERROR CORRECTION SYSTEM FOR PACKET SWITCHED NETWORKS Thinh Nguyen and Avideh Zakhor IEEE INFOCOM 2003.

A Study of Group-Tree Matching in Large Scale Group Communications

ISP and Egress Path Selection for Multihomed Networks

Differential Privacy in Practice

ECE 544 Protocol Design Project 2016

by Xiang Mao and Qin Chen

Presentation transcript:

Toward Optimal Network Fault Correction via End-to-End Inference Patrick P. C. Lee, Vishal Misra, Dan Rubenstein Distributed Network Analysis (DNA) Lab Columbia University May 9, 2007

Outline Motivation Framework for end-to-end inference Inference algorithm Performance evaluation Conclusions

Motivation Goal: Correct (diagnose and repair) data-path failures in a system where only end-to-end information is available and link-level probing is unreliable. Example: overlays across externally managed nodes Data stream server OK! No data?

Problem What should an administrator do if some paths fail to deliver data? What the administrator knows: some nodes on the faulty paths must have failed What the administrator doesn’t know: which nodes on the paths failed how many nodes on the paths failed reasons the nodes failed Solution: Checking, via a series of sanity tests, the nodes that potentially failed, and repairing those that did.

Constraints Checking and repairing a node incurs a cost e.g., wages and man-hours of support staff, or cost of test equipment Such a cost can be highly varying e.g., service providers may charge different costs of checking nodes

Objective Assume each node i has a priori known failure probability p i : the likelihood that node i has failed checking cost c i : the cost needed to perform sanity tests on node i Objective: minimize the expected total checking cost of correcting (i.e., diagnosing and repairing) all faulty nodes ∑i∑i minimize c i Pr (node i is actually checked) over all sequences of nodes to be checked

End-to-End Inference End-to-end inference approach for correcting data-path failures: Network topology Monitor paths Bad paths exist? Done Select the nodes to check No Yes Repair identified bad nodes Input: How to select nodes to check? Check nodes

How to Select Nodes to Check? Suppose that we check one node at a time. Most-Likely Fault (MLF) approach First check the most likely faulty node, i.e., the node with the highest conditional failure probability given that some paths failed to deliver data. Does the MLF approach necessarily minimize the expected total checking cost?

Example – Why the MLF Scheme is not Optimal? Node Conditional failure prob No, the MLF scheme is not optimal in general. Two data paths are given. Both failed to deliver data. Nodes have: different failure probabilities same checking cost. The conditional failure probabilities can be determined accordingly.

Example – Why the MLF Scheme is not Optimal? Findings: Node 3 has the highest conditional failure probability. However, by brute-force approach, we find that checking node 1 first is optimal (even nodes have the same checking cost). Intuition: Node 3 affects only one path, but node 1 affects both paths. We may repair both paths by only checking node 1. Node Conditional failure prob

Our Contributions Propose an end-to-end inference approach for correcting all data-path failures. Identify a set of candidate nodes, and prove that one of them must be checked first in order to minimize the expected total checking cost. Evaluate via simulation that our inference approach has a smaller expected cost than the prior MLF-based approaches [Katzela and Schwartz, 1995; Kandula et al., 2005; Steinder and Sethi, 2004].

Topologies Topologies that we consider: TreeMultiple trees We prove optimality results for a tree, and propose heuristics for multiple trees.

Finding Good/Bad Paths For each data path, Good – if the data path has no faulty node and can deliver data Bad – if the data path has at least one faulty node and cannot deliver data Assumption: Each node has the same data-forwarding behavior across all paths upon which it lies. This implies if a node lies on at least one good path, it is a non-faulty (good) node.

Forming a Bad Tree Monitor data streams from the root node 1 to each of the leaf nodes 6, 7, 8, Bad tree: a tree in which every path is a bad path Bad path Good path Keep only bad paths, and remove any nodes that are known to be good.

Inference Algorithm Our inference algorithm selects which nodes to check: Each node i is associated with a potential function: Φ(i) = Pr(T | X i, A i ) p i c i (1 – p i ) p i = failure probability of node i c i = checking cost of node i Pr(T | X i, A i ) = conditional probability of having a bad tree T = the event that the tree is a bad tree X i = the event that node i is bad A i = the event that ancestors of node i are good Intuitively, we should first check the node with high p i and small c i, i.e., the node with the high potential first.

Inference Algorithm Candidate node On each bad path, one node has the highest potential. We call this node a candidate node. Example of identifying candidate nodes: Main theorem To minimize the expected total checking cost of correcting all faulty nodes for a given bad tree, we must check a candidate node first. Bad pathCandidate node

Inference Algorithm For some special cases, we know which candidate node should be checked first to minimize the expected cost. Examples of the special cases: A path Check the node with the highest first A tree in which nodes have a fixed failure probability and a fixed checking cost Check the root node first p i c i (1 – p i )

Inference Algorithm For general cases, we don’t know which candidate node should be checked first to minimize the expected cost. e.g., not necessarily the candidate node with the highest potential Heuristics: Sequential strategy: Checks the candidate node with the highest potential Parallel strategy: Checks simultaneously multiple candidate nodes that cover all bad paths

Highlights of Experiments Setup Use BRITE to create 200 random experimental networks, each of which has 200 routers Assign each node a failure probability and a checking cost Focus on multi-tree topologies, each of which is a shortest-path tree rooted at a randomly selected router Metric Expected total checking cost to diagnose and repair all faulty nodes Heuristics to be compared: Candidate-based heuristics – check the candidate nodes first MLF-based heuristics – check the most-likely faulty nodes first

Highlights of Experiments Random failure prob., fixed checking cost p i ~ U(0, 0.2) c i = 1 Result: Both heuristics have almost the same expected total checking cost.

Highlights of Experiments Random failure prob., random checking cost p i ~ U(0, 0.2) c i ~ U(0, 1) Result: Checking first the candidate nodes decreases the expected total checking cost by ~10%.

Highlights of Experiments Fixed failure prob., random checking cost p i = 0.1 c i ~ U(0, 1) Result: Checking first the candidate nodes decreases the expected total checking cost by ~20%.

Conclusions Presented optimality results for diagnosing and repairing all data-path failures, with an objective to minimize the expected total checking cost. Constructed a potential function to identify candidate nodes, one of which must be checked first to minimize the expected total checking cost. Showed via evaluation that checking candidate nodes first can reduce the checking cost by up to 20% compared to checking the most likely faulty nodes first.