Troubleshooting Mesh Networks Lili Qiu Joint Work with Victor Bahl, Ananth Rao, Lidong Zhou Microsoft Research Mesh Networking Summit 2004
2 Motivation Internet Why is it so slow? Cordless phone interference? Neighbors drop traffic? MAC misbehavior? Too much user traffic? Routing problems? TCP problems? …
3 Research Challenges Just knowing link statistics is insufficient Complicated interactions –Between different network elements –Between different network protocols –Between different faults –Signature-based schemes may not capture all the interactions Need to apply to a wide range of networks Multi-hop wireless networks –Unpredictable physical medium and dynamic topology –Limited resources –Scale to hundreds of nodes
4 Our Approach Framework: online trace-driven simulation -Create a real network inside a simulator -Identify root cause by searching for the faults that reproduce the same faulty symptom Advantages –Applicable to a large class of networks –Capture complicated interactions –Extensible to diagnose new faults –Facilitate what-if analysis
5 Troubleshooting Framework Data Collection Data Cleaning Fault Diagnosis Trace- Driven Simulation Raw Data Root Causes Measured Performance Routes Link Loads Candidate Faults Simulated Performance Root cause analysis module
6 Common Concerns and Our Approaches for Simulation-Based Diagnosis 1.Simulation accuracy - Trace-driven simulation - Remove erroneous data from the trace 2. Too expensive to simulate - Advances in network simulator - Focus on long-term faults - Compression, spatial scoping, adaptive monitoring, multicast 3. Too large fault space - Develop an efficient search heuristic
7 Simulator Accuracy: Good RF
8 Simulator Accuracy: Poor RF # wallsLoss rateMeasured thruput Simulated thruput 411% % %
9 Data Gathering What data to collect? –Network topology –Traffic statistics –Physical medium –Link performance Data sources: SNMP, WRAPI, Packet sniffers, NativeWiFi Dealing with Imperfect Data –Neighbor monitoring –Using history information –Find the smallest number of misbehaving nodes to explain inconsistency in traffic reports
10 Root Cause Analysis
11 Fault Diagnosis Algorithm Challenge –Large fault space brute-force search is infeasible 1. Initialization: diagnosed fault set F = { } 2. while (diff(MeasuredPerf, SimulatedPerf(F)) > threshold) { Foreach f in F Adjust f’s magnitudes if necessary Delete f is its magnitude is too small Add a new candidate fault if necessary Simulate } 3. Report F
12 Performance Evaluation Effectiveness of data cleaning –Detect >80% misbehaving nodes with <15% false positive Effectiveness of fault diagnosis # Faults Coverage100% 75%70%92%86% False positive %29% Accuracy of detecting combinations of packet dropping, MAC misbehavior, and external noise in 25-node random topology
13 Performance Evaluation Test-bed –Implemented the technique in a small multi- hop IEEE a mesh testbed –Detected network congestion and random packet dropping
14 What-if Analysis ActionThroughput Reduce 8 th flow by half1.15 Route 8 th flow via the grid boundary1.22 Increase power from 15 dBm to 20 dBm0.99 Increase power from 15 dBm to 25 dBm1.66
15 Conclusion & Future Work Propose online trace-driven simulation –Diagnose faults –Test alternative network configurations –Our evaluation results show it is promising Future work –Validate it in a larger-scale testbed –Extend it to handle mobility –Apply it to handle other types of faults
16 Thank you!
17 Related Work Protocols for wireless network management –Ad Hoc Network Management Protocol (ANMP) –Guerrilla Management Architecture –Complementary to our work Fault management for wireless infrastructure networks –AirWave, AirDefense, UniCenter, WNMS, IBM WSA, Wibhu SpectraMon … –Different from multihop wireless networks Detect specific faults in multihop wireless networks –Routing misbehavior –Mac misbehavior, …
18 Trace-driven Simulation Fault Injection Route Simulation Traffic Simulation Routing Updates Link Loads Candidate Faults Simulated Performance