How Dirty is your Data : The Duality between detecting Events and Faults J. Gupchup A. Terzis R. Burns A. Szalay Department of Computer Science Johns Hopkins University
Outline Background Problem Statement Experiments Results Discussion
Application Monitoring nesting conditions of the Maryland Box turtles Science Questions: Do nesting conditions determine sex ? Important to correlate observations with environmental events (rain, snow etc)
Duality of Faults & Events Data gathered from Sensor Networks contain faults Delivering faulty data consumes resources and pollutes statistics Need for fault detection techniques Fault Detection methods detect readings that deviate from “normal” or “expected” values Environmental Events : –Scientifically interesting –Deviate from the norm
Research Question(s) Are “Events” misclassified as “Faults” ? What metrics could be used to quantify the misclassification ? How does the misclassification vary with: – Type of Fault – Type of Fault Detection method – Type of modality (Moisture, Temperature) Is it possible to design a fault detection mechanism that minimizes the misclassification ?
Know Thy Faults Short Faults –Sudden Change in measurement Noise Faults –Large variations in amplitude than expected –Little or no variation in amplitude (unresponsive)
Fault Detection Methods SHORT Rule – If X i – X (i-1) > δ SHORT mark current measurement as fault (point method) δ SHORT is established from domain knowledge NOISE Rule – Take W successive samples – IF ( σ W ≤ σ train -σ allow ) OR ( σ W ≥ σ train +σ allow ), mark all W readings as faulty (block method) –σ train and σ allow are established from training data Linear Least-square Estimation (LLSE) – Estimate expected value of a sensor’s value using other sensors using LLSE – If X model – X actual > δ LLSE for k of the node’s neighbors, mark the reading as faulty (point method) A. Sharma, L. Golubchik, and R. Govindan, “On the prevalence of sensor faults in real world deployments”, IEEE conference on Sensor, Mesh and Ad Hoc Communications and networks (SECON), 2007
Evaluation Metrics Misclassification error (μ) for Point faults: μ = event readings tagged as faults / total event measurements Total Misclassification (μ )= ∑ i D i / ∑ i E i Misclassification error (μ) for Block Faults: Misclassification Fault detection evaluation metric : False negative ratio = fraction of faults failed to be detected Event Period (Ei) time Misclassification Di Event Period (Ei) time Di
Jug bay Deployment Map Turtle Nests , Weather Station Courtesy: Google maps
Dataset Sensor Data: Box temperature and soil moisture 3 motes from Jug Bay (previous slide) 5 months of data (sampled every 10 min.) Train Data Set (1 month), Test Data Set (4 months) Event Ground Truth (Weather Data): Precipitation data collected from a weather station ~ 700 m away (sampled every 15 min.) 21 major events (i.e. rainfall) occurred Total rainfall hours : 158 hours
Faults Ground Truth Start with a clean data set Inject Faults to Establish ground Truth
Methodology For Each Fault Detection Method & Each modality Use 1 st month’s data to Train Obtain Model Parameters Evaluate Method on Fault-Injected Test Data
Soil Moisture ‘SHORT RULE’ Reducing the number of misclassification errors increases false negatives
Misclassification LLSE method ModalityMisclassification errorFalse Negatives Box Temperature0.3 %77.19 % Soil Moisture46.3 %50.03 % Higher misclassification can occur due to : Spatial & Temporal Heterogeneity of the soil
Lessons Learned There exists a tension between detecting Events and Faults Fault Detection Algorithms need to take this into consideration –Events can be misclassified as faults Need for novel Fault Detection methods that are robust in the presence of Events
Need for Pattern Recognition techniques
Acknowledgements Abhishek Sharma, Dept. of Computer Science, University of Southern California Chris Swarth, Jug Bay Wetlands Sanctuary Life Under Your Feet team Marcus Chang, University of Copenhagen (Courtesy : Andreas Terzis)
Questions !!!!