Model-driven Data Acquisition in Sensor Networks Amol Deshpande 1,4 Carlos Guestrin 4,2 Sam Madden 4,3 Joe Hellerstein 1,4 Wei Hong 4 1 UC Berkeley 2 Carnegie Mellon University 3 MIT 4 Intel Research - Berkeley
Sensor networks and distributed systems A collection of devices that can sense, actuate, and communicate over a wireless network Available resources 4 MHz, 8 bit CPU 40 Kbps wireless 3V battery (lasts days or months) Sensors for temperature, humidity, pressure, sound, magnetic fields, acceleration, visible and ultraviolet light, etc. Analogous issues in other distributed systems, including streams and the Internet
Leach's Storm Petrel Real deployments Great Duck Island Redwoods Precision agriculture Fabrication monitoring
Example: Intel Berkeley Lab deployment
Every time step Analogy: Sensor net as a database TinyDB Query Distribute query Collect query answer or data SQL-style query Declarative interface: Sensor nets are not just for PhDs Decrease deployment time Data aggregation: Can reduce communication
Every time step Limitations of existing approach TinyDB Query Distribute query Collect data New Query SQL-style query Redo process every time query changes Query distribution: Every node must receive query Data collection: Every node must wake up at every time step Data loss ignored No quality guarantees Data inefficient – ignoring correlations
Sensor net data is correlated Spatial-temporal correlation Inter-attributed correlation Data is not i.i.d. shouldn’t ignore missing data Observing one sensor information about other sensors (and future values) Observing one attribute information about other attributes
tt SQL-style query with desired confidence Model-driven data acquisition: overview Probabilistic Model Query Data gathering plan Condition on new observations New Query posterior belief Strengths of model-based data acquisition Observe fewer attributes Exploit correlations Reuse information between queries Directly deal with missing data Answer more complex (probabilistic) queries
Probabilistic models and queries User’s perspective: Query SELECT nodeId, temp ± 0.5°C, conf(.95) FROM sensors WHERE nodeId in {1..8} System selects and observes subset of nodes Observed nodes: {3,6,8} Query result Node Temp Conf.98%95%100%99%95%100%98%100%
Probabilistic models and queries Joint distribution P(X 1,…,X n ) Probabilistic query Example: Value of X 2 ± with prob. > 1- Prob. below 1- ? Observe attributes Example: Observe X 1 =18 P(X 2 |X 1 =18) Higher prob., could answer query Learn from historical data
Dynamic models: filtering Joint distribution at time t Observe attributes Example: Observe X 1 =18 Condition on observations t Fewer obs. in future queries Example: Kalman filter Learn from historical data
Supported queries Value query X i ± with prob. at least 1- SELECT and Range query X i [a,b] with prob. at least 1- which sensors have temperature greater than 25°C ? Aggregation average ± of subset of attribs. with prob. > 1- combine aggregation and selection probability > 10 sensors have temperature greater than 25°C ? Queries require solution to integrals Many queries computed in closed-form Some require numerical integration/sampling
tt SQL-style query with desired confidence Model-driven data acquisition: overview Probabilistic Model Query Data gathering plan Condition on new observations posterior belief What sensors do we observe ? How do we collect observations?
Acquisition costs Attributes have different acquisition costs Exploit correlation through probabilistic model Must consider networking cost cheaper?
Network model and plan format Assume known (quasi-static) network topology Define traversal using (1.5-approximate) TSP C t (S ) is expected cost of TSP (lossy communication) Cost of collecting subset S of sensor values: C(S ) = C a (S )+ C t (S ) Goal: Find subset S that is sufficient to answer query at minimum cost C(S )
Choosing observation plan Is a subset S sufficient? X i 2 [a,b] with prob. > 1- If we observe S =s : R i (s ) = max{ P(X i 2 [a,b] | s ), 1-P(X i 2 [a,b] | s )} Value of S is unknown: R i (S ) = P(s ) R i (s ) ds Optimization problem:
tt SQL-style query with desired confidence BBQ system Probabilistic Model Query Data gathering plan Condition on new observations posterior belief Value Range Average Multivariate Gaussians Learn from historical data Equivalent to Kalman filter Simple matrix operations Exhaustive or greedy search Factor 1.5 TSP approximation
Experimental results Redwood trees and Intel Lab datasets Learned models from data Static model Dynamic model – Kalman filter, time-indexed transition probabilities Evaluated on a wide range of queries
Cost versus Confidence level
Obtaining approximate values Query: True temperature value ± epsilon with confidence 95%
Approximate range queries Query: Temperature in [T 1,T 2 ] with confidence 95%
Comparison to other methods
Intel Lab traversals
tt SQL-style query with desired confidence BBQ system Probabilistic Model Query Data gathering plan Condition on new observations posterior belief Value Range Average Multivariate Gaussians Learn from historical data Equivalent to Kalman filter Simple matrix operations Exhaustive or greedy search Factor 1.5 TSP approximation Extensions More complex queries Other probabilistic models More advanced planning Outlier detection Dynamic networks Continuous queries …
Conclusions Model-driven data acquisition Observe fewer attributes Exploit correlations Reuse information between queries Directly deal with missing data Answer more complex (probabilistic) queries Basis for future sensor network systems