1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group www.deeds.informatik.tu-darmstadt.de.

1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group www.deeds.informatik.tu-darmstadt.de Prof. Neeraj Suri Abdelmajid Khelil Dept. of Computer Science TU Darmstadt, Germany

2 Last lecture:.. FT environmental monitoring, e.g. target detection.... Data fusion.. But the WSN ages! The system is evolvable! “So who watches the watchmen?” Motivation

3 IDEALLY a self-healing network could Identify Problem: That a bird landed on a node Identify a Fix: Need to remove the bird Fix the Problem: Actuation to remove the bird Self-healing Goal: Enable WSN to self-monitoring system health and autonomous debug  Begin by enabling human debugging in order to learn what metrics and techniques are useful, in order to enable autonomous system debugging

4 Debugging Is Hard in WSN  Wide range of failures  node crashes, sensor fails, code bugs, transient environmental changes to the network  Bugs are multi-causal, non-repeatable, timing-sensitive and have ephemeral triggers  Transient problems are common  Not necessarily indicative of failures  Interactions between sensor hardware, protocols, and environmental characteristics are impossible to predict  Limited visibility  Hard accessibility of the system  Minimal resources, RAM, communication  WSN application design is an iterative process between debugging and deployment.

5 Some Debugging Challenges  Minimal resource  Cannot remotely log on to nodes  Bugs are hard to track down  Evolvable system/conditions  Application behavior changes after deployment  Operating conditions (energy..)  Extracting debugging information  Existing fault-tolerance techniques are limited  Ensuring system health

6 Scenario After Deploying a Sensor Network…  Very little data arrives at the sink, could be…. anything!  The sink is receiving fluctuating averages from a region – could be caused by  Environmental fluctuations  Bad sensors  Channel drops the data  Calculation / algorithmic errors; and  Bad nodes

7 Existing Works  Simulators / Emulators / Visualizers  E.g. EmTOS, EmView, moteview, Tossim.. Provide real-time information Do not capture historical context or aid in root-causing a failure  SNMS  Interactive health monitoring  focuses on infrastructure to deliver metrics, and high code size  Log files contain excessive data which can obfuscate important events  [1] focus on metric collection and not on metric content  Momento, Sympathy EmView

8 Sympathy: A Debugging System for Sensor Network Nithya Ramanathan, Kevin Chang, Rahul Kapur, Lewis Girod, Eddie Kohler, and Deborah Estrin. SenSys '05

9 Overview  System Model  Approach  Architecture  Evaluation

10 System Model  Network feature some regular traffic: SN are expected to generate traffic of some kind (Monitored traffic): routing updates, time synchronization beacons, periodical data..  Sympathy suspects a failure when a node generates less monitored traffic than expected.  Sympathy generates additional metrics traffic.  No malicious behavior

11 Model For Correct Data Flow  Sink may not receive sufficient traffic from a node for multiple reasons  To determine where and why traffic is lost, Sympathy outlines high-level requirements for data to flow through the network

12 Node Not Connected  If destination node is not connected, then it may not receive the packet, and it will not respond

13 Node Does Not Receive Packet  If destination node does not receive certain packets (e.g. a query) from source, it may not transmit expected traffic

14 Node Does Not Transmit Traffic  Destination node may receive traffic, but due to software or hardware failure it may not transmit expected traffic

15 Sink Does Not Receive Traffic  Sink may not receive traffic due to collisions or other problems along the route

16 Tracking Traffic Through The Data Flow Model Should Node Transmit Traffic? Did Node Transmit Traffic? Did Sink Receive Traffic? Node Connected? Node NOT Connected! (Node Crash) (Asymmetric Communication) Node NOT Connected!

17 Design Requirements  Tool for detecting and debugging failures in pre- and post-deployment phases.  Debugging information should provide  Most precise and meaningful failure detection Accuracy Latency  Lowest overhead  Transmitted debugging information must be minimized

18 3 Challenges in WSN Debugging  Has a failure happened?  Which failure happened?  Is the failure important? Sympathy aids users in - finding (detecting and localizing) and - fixing failures by attempting to answer these questions

19 Sympathy Approach actively Sink collects stats passively & actively Sink Monitors data flow from nodes / components Identifies and localizes failures X  Highlights failure dependencies and event correlations 2 1 3 4 Idea: “There is a direct relationship between amount of data collected at the sink and the existence of failures in the system”

20  Network’s purpose is to communicate  If nodes are communicating sufficiently, network is working  Simplest solution is the best  “Insufficient” Traffic => Failure  Application defines “sufficient”  Sympathy detects many different failure types by tracking application end-to-end data flow Channel Contention Node Crash Asymmetric Links Sensor Failure No Sensor Data Has a Failure Happened?

21 Valid Neighbor Table No Neighbors Yes Valid Route Table No Route Yes No Sufficient #Pkts Received Bad Path to Node No Yes Bad Node Transmit No Bad Path to Sink Anybody heard from node Sink Timestamp/Neighbor Tables Node Crash Yes No Time Awake increases No Yes Node Reboot No Should Node Transmit? Did Node Transmit? Did Sink Receive Traffic? Node Connected? Sufficient #Pkts Received at Sink Yes Sufficient #Pkts Transmitted Which Failure Happened?

22 Is the Failure Important?  Analyze failure dependences to highlight primary failures  Based on reported node topology  “Can failure X be caused by failure Y”  Deemphasize secondary failures to focus user’s attention  Does NOT identify all failures or debug failures to line of code Primary Failure Secondary Failures

23 Failure Localization  Determining why data is missing  Physically narrow down cause  E.g. Where is the data lost X Was the data even sent by the component? Where in the transmission path was the data lost? OR

24 Bad Node Transmit (ADC failure) Node crashed No Neighbors PRIMARY Failures In Red Boxes Bad Path To Node (due to contention at sink) Bad Path To Sink (due to contention at sink)  Detect Failures  Determine what the failure is (Root Cause)  Determine if the failure is important (Primary Failures) Contention at Sink Final Goal

25 Sympathy Routing … Apps Sympathy Routing … Apps SINK User Processes Collect Metrics Sympathy Sympathy System: Architecture

26 Architecture Definitions  Network: a sink and distributed nodes  Component  Node components  Sink components  Sympathy-sink  Communicates with sink components  Understands all packet formats sent to the sink  Non resource constrained node  Sympathy-node  Statistics period  Epoch Sympathy sink Sink Component Sympathy node Node Component Sink Sensor node

27 Collect Stats Perform Diagnostic If Insufficient data Run Tests Run Fault Localization Algorithm SYMPATHY Sympathy Routing … Comp 1 SINK Collect Stats Perform Diagnostic If No/Insufficient data Run Tests Run Fault Localization Algorithm SYMPATHY USER Sink Components Nodes Architecture: Overview 1

28 Routing Layer MAC Layer Retrieve Comp Statistics Ring Buffer Stats Recorder & Event Processor Sympathy - Node Data Return … Comp 1 Sympathy Code on Sensor Node  Each component is monitored independently  Return generic or app-specific statistics

29 Metrics  Metrics are collected in 3 ways:  Sympathy code on each SN actively reports to the sink (periodically or on-demand)  Sink passively snoops its own transmission area  Sympathy code on sink extracts sink metrics from sink application  3 metric categories:  Connectivity: ROUTING TABLE, NEIGHBOUR LIST from each node, either passively or actively.  Flow: PACKETS SENT, PACKETS RECEIVED, #SINK PACKETS TRANMITTED from each SN, and #SINK PACKETS RECEIVED and SINK LAST TIMESTAMP from sink.  Node: SN actively report UPTIME, BAD PACKETS RECEIVED, GOOD PACKETS RECEIVED. Sink also maintains BAD and GOOD PACKETS RECEIVED.  All Metrics timeout after EPOCH

30 Node Statistics Passive (in sink’s broadcast domain) and actively transmitted by nodes Statistic NameDescription ROUTING TABLE (Sink, next hop, quality) tuples. NEIGHBOUR LIST Neighbors and associated ingress/ egress UP TIME Time node is awake #Statistics tx #Statistics packets transmitted to sink #Pkts routed#Packets routed by the node

31 Component Statistics Actively transmitted by a node to the sink, for each instrumented component Statistic NameDescription #Reqs rx Number of packets component received #Pkts txNumber of packets component transmitted SINK LAST TIMESTAMP Timestamp of last data stored by component

32 Collect Stats SYMPATHY Sympathy Routing … Comp 1 2 SINK Collect Stats SYMPATHY Sink Components Comp 1 Sympathy System

33 Sink Interface  Sympathy passes comp-specific statistics using a packet queue  Components return ascii translations for Sympathy to print to the log file Sympathy Comp 1 Comp 2 Comp 3 Comp-specific statistics Ascii translation of statistics / Data received

34 Collect Stats Perform Diagnostic If Insufficient data Run Tests Run Fault Localization Algorithm SYMPATHY Sympathy Routing … Comp 1 SINK Collect Stats Perform Diagnostic Run Tests Run Failure Localization Algorithm SYMPATHY Sink Components 3 Sympathy System If No/Insufficient data

35 Node Rebooted Yes No Rx a Pkt from node Yes No Some node has heard this node No Node Crashed Yes Some node has route to sink No Yes Some node has sink as neighbor No No node has sink on their neighbor list No node has a Route to sink Yes No Data Rx Statistics No stats YesNo Rx all Comp’s Data NO FAILURE (Comp has no Data to Tx) No Yes Comp Rx Reqs No Yes Node not Rx Reqs Comp Tx Resps No Yes Node not Tx Resps Sink Rx Resps Comp Tx No Yes Sink not Rx Resps DIAGNOSTIC No Data Insufficient Data Insufficient Data Failure Localization: Decision Tree Tx: transmit Rx: receive

36 Functional “No Data” Failure Localization FailureDescription Node CrashNode has crashed and not come back No Route to Sink No valid route exists to the sink from a node No DataNo data received from a node, and Sympathy cannot localize the failure

37 Performance “Insufficient Data” Failure Localization FailureDescription Node RebootNode has rebooted CongestionCorrelated failures on packet reception No requests rxComponent is not receiving requests from sink No response txComponent is not transmitting data in response to requests No response rxSink is not receiving data transmitted by a component No statistics rxSink has not received Sympathy statistics on the component

38 Source Localization Root Causes with Associated Metrics and Source: Three localized sources for Failures: Node self (crash, reboot, local bug, connectivity issue..) Path between node and sink (relay failure, collisions..) Sink

39 Collect Stats Perform Diagnostic If Insufficient data Run Tests Run Fault Localization Algorithm SYMPATHY Sympathy Routing … Comp 1 4 SINK Collect Stats Perform Diagnostic If Insufficient data Run Tests Run Fault Localization Algorithm SYMPATHY USER Sink Components Sympathy System

40 Informational Log File Node 25 Time: Node awake: 78 (mins) Sink awake: 78(mins) Route: 25 -> 18 -> 15 -> 12 -> 10 -> 8 -> 6 -> 2 Num neighbors heard this node: 6 Pkt-type #Rx Mins-since-last #Rx-errors Mins-since-last 1:Beacon 15(2) 0 mins 1(0) 52 mins 3:Route 3(0) 37 mins 0(0) INF Symp-stats 12(2) 1 mins Reported Stats from Components ------------------------------------ **Sympathy: #metrics tx/#stats tx/#metrics expected/#pkts routed: 13(2)/12(2)/13(1)/0(0) Node-ID Egress Ingress ----------------------------------------------- 8 128 71 13 128 121 24 249 254

41 Failure Log File Node 18 Node awake: 0 (mins) Sink awake: 3 (mins) Node Failure Category: Node Failed! TESTS Received stats from module [FAILED] Received data this period [FAILED] Node thinks it is transmitting data [FAILED] Node has been claimed by other nodes as a neighbor [FAILED] Sink has heard some packets from node [FAILED] Received data this period: Num pkts rx: 0(0) Received stats from module: Num pkts rx: 0(0) Node’s next-hop has no failures

42 Spurious Failures  An artifact of another failure  Sympathy highlights failure dependencies in order to distinguish spurious failures Sympathy Sink Node Crashed Congestion Appears to be sending very little data Appears to not be sending data

43 Testing Methodology  Application  Run in Sympathy with the ESS (Extensible Sensing System) application  In simulation, emulation and deployment  Traffic conditions: no traffic, application traffic, congestion  Node failures  Node reboot – only requires information from the node  Node crash – requires spatial information from neighboring nodes to diagnose  Failure injected in one node per run, for each node  18 node network, with maximum 7 hops to the sink

44 Evaluation Metrics  Accuracy of Failure Detection:  Number of primary failure notifications  Latency of Failure Detection/notification  Time from when the failure is injected to when Sympathy notifies the user about the failure  There is a tradeoff between accuracy and latency

45 Notification Latency  Does Sympathy always detects an injected failure? Detection =  Assign a root cause of node crash  Highlight the failure as primary EPOCH

46 Notification Accuracy

47 Memory Footprint  TinyOS, mica2 BinaryRAMROM ESS w/o Sympathy3089 B96094 B ESS w/ Sympathy3160 B104802 B Difference71 B8708 B Sympathy47 B1558 B

48 Extensibility  Adding new metrics requires ~5 lines of code on the nodes and ~10 lines of code on the sink  Extensible to application classes with predictable data flow within bounds of an epoch  User specifies expected amount of data  Extensible to different routing layers due to modular design  Multihop routing plug-in was 140 lines  Mintroute routing plug-in was 100 lines

49 Conclusion  A deployed system that aids in debugging by detecting and localizing failures  Small list of statistics that are effective in identifying and localizing failures  Behavioral model for a certain application class that provides a simple diagnostic to measure system health

50 Literature [1] Zhao, J. Govindan, R. Estrin, D. “Computing aggregates for monitoring wireless sensor networks” SNPA 2003. [2] Nithya Ramanathan, Kevin Chang, Rahul Kapur, Lewis Girod, Eddie Kohler, and Deborah Estrin “Sympathy for the Sensor Network Debugger”, Sensys 2005. [3] Rost, S.; Balakrishnan, H. “Memento: A Health Monitoring System for Wireless Sensor Networks“, SECON 2006.

1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group www.deeds.informatik.tu-darmstadt.de.

Similar presentations

Presentation on theme: "1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group www.deeds.informatik.tu-darmstadt.de."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group www.deeds.informatik.tu-darmstadt.de.

Similar presentations

Presentation on theme: "1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group www.deeds.informatik.tu-darmstadt.de."— Presentation transcript:

Similar presentations

About project

Feedback