Toward Optimal Network Fault Correction via End-to-End Inference Patrick P. C. Lee, Vishal Misra, Dan Rubenstein Distributed Network Analysis (DNA) Lab.

Toward Optimal Network Fault Correction via End-to-End Inference Patrick P. C. Lee, Vishal Misra, Dan Rubenstein Distributed Network Analysis (DNA) Lab Columbia University May 9, 2007

Outline Motivation Framework for end-to-end inference Inference algorithm Performance evaluation Conclusions

Motivation Goal: Correct (diagnose and repair) data-path failures in a system where only end-to-end information is available and link-level probing is unreliable. Example: overlays across externally managed nodes Data stream server OK! No data?

Problem What should an administrator do if some paths fail to deliver data? What the administrator knows: some nodes on the faulty paths must have failed What the administrator doesn’t know: which nodes on the paths failed how many nodes on the paths failed reasons the nodes failed Solution: Checking, via a series of sanity tests, the nodes that potentially failed, and repairing those that did.

Constraints Checking and repairing a node incurs a cost e.g., wages and man-hours of support staff, or cost of test equipment Such a cost can be highly varying e.g., service providers may charge different costs of checking nodes

Objective Assume each node i has a priori known failure probability p i : the likelihood that node i has failed checking cost c i : the cost needed to perform sanity tests on node i Objective: minimize the expected total checking cost of correcting (i.e., diagnosing and repairing) all faulty nodes ∑i∑i minimize c i Pr (node i is actually checked) over all sequences of nodes to be checked

End-to-End Inference End-to-end inference approach for correcting data-path failures: Network topology Monitor paths Bad paths exist? Done Select the nodes to check No Yes Repair identified bad nodes Input: How to select nodes to check? Check nodes

How to Select Nodes to Check? Suppose that we check one node at a time. Most-Likely Fault (MLF) approach First check the most likely faulty node, i.e., the node with the highest conditional failure probability given that some paths failed to deliver data. Does the MLF approach necessarily minimize the expected total checking cost?

Example – Why the MLF Scheme is not Optimal? 1 2 34 0.45 0.3 0.6 0.5 Node Conditional failure prob. 10.616 20.411 30.663 40.579 No, the MLF scheme is not optimal in general. Two data paths are given. Both failed to deliver data. Nodes have: different failure probabilities same checking cost. The conditional failure probabilities can be determined accordingly.

Example – Why the MLF Scheme is not Optimal? Findings: Node 3 has the highest conditional failure probability. However, by brute-force approach, we find that checking node 1 first is optimal (even nodes have the same checking cost). Intuition: Node 3 affects only one path, but node 1 affects both paths. We may repair both paths by only checking node 1. Node Conditional failure prob. 10.616 20.411 30.663 40.579 1 2 34 0.45 0.3 0.6 0.5

Our Contributions Propose an end-to-end inference approach for correcting all data-path failures. Identify a set of candidate nodes, and prove that one of them must be checked first in order to minimize the expected total checking cost. Evaluate via simulation that our inference approach has a smaller expected cost than the prior MLF-based approaches [Katzela and Schwartz, 1995; Kandula et al., 2005; Steinder and Sethi, 2004].

Topologies Topologies that we consider: TreeMultiple trees We prove optimality results for a tree, and propose heuristics for multiple trees.

Finding Good/Bad Paths For each data path, Good – if the data path has no faulty node and can deliver data Bad – if the data path has at least one faulty node and cannot deliver data Assumption: Each node has the same data-forwarding behavior across all paths upon which it lies. This implies if a node lies on at least one good path, it is a non-faulty (good) node.

Forming a Bad Tree Monitor data streams from the root node 1 to each of the leaf nodes 6, 7, 8, 9. 1 2 43 567 89 3 56 89 Bad tree: a tree in which every path is a bad path Bad path Good path Keep only bad paths, and remove any nodes that are known to be good.

Inference Algorithm Our inference algorithm selects which nodes to check: Each node i is associated with a potential function: Φ(i) = Pr(T | X i, A i ) p i c i (1 – p i ) p i = failure probability of node i c i = checking cost of node i Pr(T | X i, A i ) = conditional probability of having a bad tree T = the event that the tree is a bad tree X i = the event that node i is bad A i = the event that ancestors of node i are good Intuitively, we should first check the node with high p i and small c i, i.e., the node with the high potential first.

Inference Algorithm Candidate node On each bad path, one node has the highest potential. We call this node a candidate node. Example of identifying candidate nodes: 3 56 89 Main theorem To minimize the expected total checking cost of correcting all faulty nodes for a given bad tree, we must check a candidate node first. Bad pathCandidate node 3-5-85 3-5-95 3-63

Inference Algorithm For some special cases, we know which candidate node should be checked first to minimize the expected cost. Examples of the special cases: A path Check the node with the highest first A tree in which nodes have a fixed failure probability and a fixed checking cost Check the root node first p i c i (1 – p i )

Inference Algorithm For general cases, we don’t know which candidate node should be checked first to minimize the expected cost. e.g., not necessarily the candidate node with the highest potential Heuristics: Sequential strategy: Checks the candidate node with the highest potential Parallel strategy: Checks simultaneously multiple candidate nodes that cover all bad paths

Highlights of Experiments Setup Use BRITE to create 200 random experimental networks, each of which has 200 routers Assign each node a failure probability and a checking cost Focus on multi-tree topologies, each of which is a shortest-path tree rooted at a randomly selected router Metric Expected total checking cost to diagnose and repair all faulty nodes Heuristics to be compared: Candidate-based heuristics – check the candidate nodes first MLF-based heuristics – check the most-likely faulty nodes first

Highlights of Experiments Random failure prob., fixed checking cost p i ~ U(0, 0.2) c i = 1 Result: Both heuristics have almost the same expected total checking cost.

Highlights of Experiments Random failure prob., random checking cost p i ~ U(0, 0.2) c i ~ U(0, 1) Result: Checking first the candidate nodes decreases the expected total checking cost by ~10%.

Highlights of Experiments Fixed failure prob., random checking cost p i = 0.1 c i ~ U(0, 1) Result: Checking first the candidate nodes decreases the expected total checking cost by ~20%.

Conclusions Presented optimality results for diagnosing and repairing all data-path failures, with an objective to minimize the expected total checking cost. Constructed a potential function to identify candidate nodes, one of which must be checked first to minimize the expected total checking cost. Showed via evaluation that checking candidate nodes first can reduce the checking cost by up to 20% compared to checking the most likely faulty nodes first.

Toward Optimal Network Fault Correction via End-to-End Inference Patrick P. C. Lee, Vishal Misra, Dan Rubenstein Distributed Network Analysis (DNA) Lab.

Similar presentations

Presentation on theme: "Toward Optimal Network Fault Correction via End-to-End Inference Patrick P. C. Lee, Vishal Misra, Dan Rubenstein Distributed Network Analysis (DNA) Lab."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Toward Optimal Network Fault Correction via End-to-End Inference Patrick P. C. Lee, Vishal Misra, Dan Rubenstein Distributed Network Analysis (DNA) Lab.

Similar presentations

Presentation on theme: "Toward Optimal Network Fault Correction via End-to-End Inference Patrick P. C. Lee, Vishal Misra, Dan Rubenstein Distributed Network Analysis (DNA) Lab."— Presentation transcript:

Similar presentations

About project

Feedback