Uncertain Data Management for Sensor Networks Amol Deshpande, University of Maryland (joint work w/ Bhargav Kanagal, Prithviraj Sen, Lise Getoor, Sam Madden)

Slides:



Advertisements
Similar presentations
Research Challenges in the CarTel Mobile Sensor System Samuel Madden Associate Professor, MIT.
Advertisements

Dialogue Policy Optimisation
State Estimation and Kalman Filtering CS B659 Spring 2013 Kris Hauser.
1 VLDB 2006, Seoul Mapping a Moving Landscape by Mining Mountains of Logs Automated Generation of a Dependency Model for HUG’s Clinical System Mirko Steinle,
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Representing and Querying Correlated Tuples in Probabilistic Databases
Cleaning Uncertain Data with Quality Guarantees Reynold Cheng, Jinchuan Chen, Xike Xie 2008 VLDB Presented by SHAO Yufeng.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Online Filtering, Smoothing & Probabilistic Modeling of Streaming Data In short, Applying probabilistic models to Streams Bhargav Kanagal & Amol Deshpande.
Kien A. Hua Division of Computer Science University of Central Florida.
Dynamic Bayesian Networks (DBNs)
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
Probabilistic Aggregation in Distributed Networks Ling Huang, Ben Zhao, Anthony Joseph and John Kubiatowicz {hling, ravenben, adj,
Xyleme A Dynamic Warehouse for XML Data of the Web.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Distributed Regression: an Efficient Framework for Modeling Sensor Network Data Carlos Guestrin Peter Bodik Romain Thibaux Mark Paskin Samuel Madden.
Approximate data collection in sensor networks the appeal of probabilistic models David Chu Amol Deshpande Joe Hellerstein Wei Hong ICDE 2006 Atlanta,
Exploiting Correlated Attributes in Acquisitional Query Processing Amol Deshpande University of Maryland Joint work with Carlos Sam
Probabilistic Data Aggregation Ling Huang, Ben Zhao, Anthony Joseph Sahara Retreat January, 2004.
Model-Driven Data Acquisition in Sensor Networks - Amol Deshpande et al., VLDB ‘04 Jisu Oh March 20, 2006 CS 580S Paper Presentation.
Probabilistic Databases Amol Deshpande, University of Maryland.
Model-driven Data Acquisition in Sensor Networks Amol Deshpande 1,4 Carlos Guestrin 4,2 Sam Madden 4,3 Joe Hellerstein 1,4 Wei Hong 4 1 UC Berkeley 2 Carnegie.
Bayesian Filtering for Robot Localization
Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
Using Probabilistic Models for Data Management in Acquisitional Environments Sam Madden MIT CSAIL With Amol Deshpande (UMD), Carlos Guestrin (CMU)
Sensor Data Management: Challenges and (some) Solutions Amol Deshpande, University of Maryland.
Implementation Yaodong Bi. Introduction to Implementation Purposes of Implementation – Plan the system integrations required in each iteration – Distribute.
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
ITEC224 Database Programming
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
Chapter 1 Introduction to Data Mining
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 12: Overview.
Visual Discovery Management: Divide and Conquer Abhishek Mukherji, Professor Elke A. Rundensteiner, Professor Matthew O. Ward XMDVTool, Department of Computer.
Access Path Selection in a Relational Database Management System Selinger et al.
March 6th, 2008Andrew Ofstad ECE 256, Spring 2008 TAG: a Tiny Aggregation Service for Ad-Hoc Sensor Networks Samuel Madden, Michael J. Franklin, Joseph.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
RELATIONAL FAULT TOLERANT INTERFACE TO HETEROGENEOUS DISTRIBUTED DATABASES Prof. Osama Abulnaja Afraa Khalifah
A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung
1 Robot Environment Interaction Environment perception provides information about the environment’s state, and it tends to increase the robot’s knowledge.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Sensor Database System Sultan Alhazmi
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
 Three-Schema Architecture Three-Schema Architecture  Internal Level Internal Level  Conceptual Level Conceptual Level  External Level External Level.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
Yanlei Diao, University of Massachusetts Amherst Future Directions in Sensor Data Management Yanlei Diao University of Massachusetts, Amherst.
Dr. Sudharman K. Jayaweera and Amila Kariyapperuma ECE Department University of New Mexico Ankur Sharma Department of ECE Indian Institute of Technology,
Analyzing wireless sensor network data under suppression and failure in transmission Alan E. Gelfand Institute of Statistics and Decision Sciences Duke.
Indexing Correlated Probabilistic Databases Bhargav Kanagal, Amol Deshpande University of Maryland, College Park, USA SIGMOD Presented.
DISTIN: Distributed Inference and Optimization in WSNs A Message-Passing Perspective SCOM Team
DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
W. Hong & S. Madden – Implementation and Research Issues in Query Processing for Wireless Sensor Networks, ICDE 2004.
In-Network Query Processing on Heterogeneous Hardware Martin Lukac*†, Harkirat Singh*, Mark Yarvis*, Nithya Ramanathan*† *Intel.
1 Scalable Probabilistic Databases with Factor Graphs and MCMC Michael Wick, Andrew McCallum, and Gerome Miklau VLDB 2010.
A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB
Applying Control Theory to Stream Processing Systems
Distributed database approach,
A paper on Join Synopses for Approximate Query Answering
Probabilistic Data Management
Probabilistic Data Management
Sam Madden MIT CSAIL With Amol Deshpande (UMD), Carlos Guestrin (CMU)
Probabilistic Databases
Presentation transcript:

Uncertain Data Management for Sensor Networks Amol Deshpande, University of Maryland (joint work w/ Bhargav Kanagal, Prithviraj Sen, Lise Getoor, Sam Madden)

Motivation: Sensor Networks Unprecedented, and rapidly increasing, instrumentation of our every-day world Huge data volumes generated continuously that must be processed in real-time Imprecise, unreliable and incomplete data Inherent measurement noises (e.g. GPS) Low success rates (e.g. RFID) Communication link or sensor node failures (e.g. wireless sensor networks) Spatial and temporal biases Typically acquisitional environments Energy-efficiency the primary concern Wireless sensor networks RFID Distributed measurement networks (e.g. GPS) Industrial Monitoring

Motivation: Uncertain Data Similar challenges in other domains Data integration Noisy data sources, automatically derived schema mappings Reputation/trust/staleness issues Information extraction Automatically extracted knowledge from text Social networks, biological networks Noisy, error-prone observations Ubiquitous use of entity resolution, link prediction etc… Need to develop database systems for efficiently representing and managing uncertainty

Example: Wireless Sensor Networks A wireless sensor network deployed to monitor temperature Moteiv Invent: 8Mhz uProc, 250kbps 2.4GHz Transreceiver 10K RAM, 48K program/ 512k data flash Rechargeable Battery (USB) Light, temperature, acceleration, and sound sensors

Example: Wireless Sensor Networks A wireless sensor network deployed to monitor temperature timeidtemp 10am120 10am221.. … 10am729 sensors select time, avg(temp) from sensors epoch 1 hour User 2. High data loss rates  averages of different sets of sensors 1. Spatially biased deployment  these are not true averages {10am, 23.5} {11am, 24} {12pm, 70} 3. Measurement errors propagated to the user

Example: Wireless Sensor Networks A wireless sensor network deployed to monitor temperature timeidtemp 10am120 10am221.. … 10am729 sensors User Impedance mismatch User wants to query the “underlying environment”, and not the sensor readings at selected locations

Example: Inferring High-level Events Inferring “transportation mode”/ “activities” Using easily obtainable sensor data (GPS, RFID proximity data) Can do much if we can infer these automatically Have access to noisy “GPS” data Infer the transportation mode: walking, running, in a car, in a bus home office

Inferring “transportation mode”/ “activities” Using easily obtainable sensor data (GPS, RFID proximity data) Can do much if we can infer these automatically office home Preferred end result: Clean path annotated with transportation mode Example: Inferring High-level Events

Data Processing Step 1 Apply a statistical model to the data Eliminate spatial/temporal biases, handle missing data through extrapolation (e.g. regression, interpolation models) Filter measurement noise (e.g. Kalman Filters) Infer hidden variables, pattern recognition (e.g. HMMs) Fault/anomaly detection Forecasting/prediction (e.g. ARIMA) No support in current database systems ! Regression/interpolation models Temperature monitoring Kalman Filters … GPS Data

Sensor Data Processing: Now Database timeidtemp 10am120 10am221.. … 10am729 Table raw-data Sensor Network 1.Extract all readings into a file 2.Run MATLAB/R/other data processing tools 3.Write output to a file/back to the database 4.Write data processing tools to process/aggregate the output (maybe using DB) 5.Decide new data to acquire User Repeat

Sensor Data Processing: What we want Database timeidtemp 10am120 10am221.. … 10am729 Table raw-data Sensor Network Models to be applied in real-time for data cleaning, forecasting, anomaly/event detection etc… User timeidtemp 10am120 10am221.. … 10am729 Table processed-data Tasks Data Continuous (standing) queries e.g. alert monitoring Results to continuous queries Ad hoc queries (possibly against processed, modeled data)

Challenges Abstractions and language constructs for pushing statistical models into databases Large diversity in the models used in practice Efficiently processing high-rate data streams Querying over probabilistic model outputs Naturally exhibit high degrees of correlations Many different types of uncertainty Model-driven data acquisition Minimize the data acquired to answer a query Need for in-network, distributed processing Global inference needed to achieve consistency

Outline Motivation Statistical modeling of sensor data Abstraction of model-based views Regression-based views Views based on dynamic Bayesian networks Query processing over model outputs Some interesting sensor network problems Model-driven data acquisition Distributed inference in sensor networks

Outline Motivation Statistical modeling of sensor data Abstraction of model-based views Regression-based views Views based on dynamic Bayesian networks Query processing over model outputs Some interesting sensor network problems Model-driven data acquisition Distributed inference in sensor networks Model-based User Views for Sensor Data; A. Deshpande, S. Madden; SIGMOD 2006

Abstraction: Model-based Views An abstraction analogous to traditional database views Provides independence from the messy measurement details acct-nobalancezipcode 101a b User avg-balances select zipcode, avg(balance) from accounts group by zipcode A traditional database view (defined using an SQL query) accounts timeidtemp 10am120 10am221.. … 10am729 temperatures Use Regression to predict missing values and to remove spatial bias A model-based database view (defined using a statistical model ) raw-temp-data User No difference from a user’s perspective

Grid Abstraction timeidtemp 10am120 10am221.. … 10am729 temperatures Use Regression to model temperature as: temp = w 1 + w 2 x + w 3 x 2 + w 4 y + w 5 y 2 A Regression-based View raw-temp-data User x y Continuous Function User x y Consistent uniform view Apply regression; Compute “temp” at grid points

MauveDB System Being written using the Apache Derby Java open source database system codebase Supports the abstraction of Model-based User Views Declarative language constructs for creating such views SQL queries over model-based views Keep the models up-to-date as new data is inserted in database

MauveDB System Architecture Query Processor View Manager Model-based view USER View Maintenance SELECT * FROM regression-view WHERE … EPOCH 30 min User Queries CREATE VIEW regression-view AS … TRAINING DATA … View Creation Sensor Data Streams Sensor Data Streams

MauveDB System Architecture Query Processor View Manager Model-based view USER View Maintenance SELECT * FROM regression-view WHERE … EPOCH 30 min User Queries CREATE VIEW regression-view AS … TRAINING DATA … View Creation Sensor Data Streams Sensor Data Streams CREATE VIEW RegView(time [0::1], x [0:100:10], y[0:100:10], temp) AS FIT temp USING time, x, y BASES 1, x, x 2, y, y 2 FOR EACH time T TRAINING DATA SELECT temp, time, x, y FROM raw-temp-data WHERE raw-temp-data.time = T Details specific to the model being used

Query Processing Key challenge: Integrating in a traditional database system Two operators per view type that support get_next() API ScanView: Returns the contents of the view one-by-one IndexView (condition): Returns tuples that match a condition e.g. return temperature where (x, y) = (10, 20) select * from locations l, reg-view r where (l.x, l.y) = (r.x, r.y) and r.time = “10am” Seqscan(l) Scanview(r) Hash join Plan 1 Seqscan(l) Indexview(r) Index join Plan 2

View Maintenance Strategies Option 1: Compute the view as needed from base data For regression view, scan the tuples and compute the weights Option 2: Keep the view materialized Sometimes too large to be practical E.g. if the grid is very fine May need to be recomputed with every new tuple insertion E.g. a regression view that fits a single function to the entire data Option 3: Lazy materialization/caching Materialize query results as computed Generic options shared between all view types

View Maintenance Strategies Option 4: Maintain an efficient intermediate representation Typically model-specific Regression-based Views Say temp = f(x, y) = w 1 h 1 (x, y) + … + w k h k (x, y) Maintain the weights for f(x, y) and a sufficient statistic Two matrices (O(k 2 ) space) that can be incrementally updated ScanView: Execute f(x, y) on all grid points IndexView: Execute f(x, y) on the specified point InsertTuple: Recompute the coefficients Can be done very efficiently using the sufficient statistic

Thoughts Table functions/User-defined functions Can be used to apply a statistical model to a raw data table Using code written in C or Java etc Must be applied repeatedly as new data items arrive No optimization opportunities Not declarative Complex data analysis tasks May not be doable using our primitives Our focus is on easy application of statistical models to data By a layperson not familiar with Matlab (or other tools)

Outline Motivation Statistical modeling of sensor data Abstraction of model-based views Regression-based views Views based on dynamic Bayesian networks Query processing over model outputs Some interesting sensor network problems Model-driven data acquisition Distributed inference in sensor networks Online filtering, smoothing, and modeling of streaming data; B. Kanagal, A. Deshpande; ICDE 2008

Dynamic Bayesian Networks A class of models that can capture temporal evolution of a complex stochastic process Widely used for many tasks Eliminating measurement noise (Kalman Filters) Anomaly/failure detection Inferring high-level hidden variables (HMMs) e.g. working status of a remote sensor, activity recognition

Inferring “transportation mode”/ “activities” Using easily obtainable sensor data (GPS, RFID proximity data) Can do much if we can infer these automatically office home Preferred end result: Clean path annotated with transportation mode Example: Inferring High-level Events

Dynamic Bayesian Networks Use a “generative model” that describes how the observations were generated Time = t MtMt XtXt OtOt Transportation Mode: Walking, Car, Bus True velocity and location Observed location Need conditional probability distributions that capture the process 1.p(X t | M t ): How (position,velocity) depends on mode 2. p(O t | X t ): The noise model for observations Prior knowledge or learned from data

Dynamic Bayesian Networks Use a “generative model” that describes how the observations were generated Time = t MtMt XtXt OtOt Transportation Mode: Walking, Car, Bus True velocity and location Observed location Need conditional pdfs: 1. p(M t+1 | M t, X t+1 ) 2. p(X t+1 | X t ) Prior knowledge or learned from data Time = t+1 M t+1 X t+1 O t+1

Dynamic Bayesian Networks Inference task: Given a sequence of observations (O t ), find most likely M t ’s that explain it. Alternatively, could provide a probability distribution on the possible M t ’s. Time = t MtMt XtXt OtOt Transportation Mode: Walking, Car, Bus True velocity and location Observed location Time = t+1 M t+1 X t+1 O t+1 Time = t+2 O t+2 M t+2 X t+2

Example DBN-based View Original noisy GPS data TIMEUSERMODE (INFERRED) Location (INFERRED) 5pmJohn Walking: 0.9 Car: 0.1 5pmJane Walking: 0.9 Car : 0.1 5:05pmJohn Walking: 0 Car: 1 TIMEUSERLocation (Observed) 5pmJohn(x1,y1) 5pmJane(x1’,y1’) 5:05pmJohn(x2,y2) User view of the data - Smoothed locations - Inferred variables Can query inferred variables: select count(*) group by mode sliding window 5 min User

Representing DBN-based Views TIMEUSERMODE (INFERRED) Location (INFERRED) 5pmJohn Walking: 0.9 Car: 0.1 5pmJane Walking: 0.9 Car : 0.1 5:05pmJohn Walking: 0 Car: 1 Challenges  Probabilistic attributes  Strong spatial and temporal Correlations Particle-based Representation  Each tuple stored as a set of weighted samples  Naturally ties in with inference  Efficient query processing using existing infrastructure idTIMEUSERMODELOCATIONweight... ……… 15pmJohnW(a1,b1) pmJohnW(a2,b2) pmJohnW(a3,b3) pmJohnC(a4,b4)0.01 ……………… PARTICLE TABLE

Architecture Query Processor View Manager Model-based view USER 1325 View Maintenance Sensor Data Streams Sensor Data Streams CREATE VIEW dpmview DPM hmm.dpm STREAM sensors View Creation Particle Tables TIMEUSERMODE (INFERRED) Location (INFERRED) 5pmJohn Walking: 0.9 Car: 0.1 5pmJane Walking: 0.9 Car : 0.1 5:05pmJohn Walking: 0 Car: 1 idTIM E USERMODELOCATIONweight 15pmJohnW(a1,b1) pmJohnW…0.02 e.g. using particle filters SELECT time, user, location FROM dpmview WHERE mode = “W” WITH CONFIDENCE 0.95 User Queries

USER TIMEUSERMODE (INFERRED) Location (INFERRED) 5pmJohn Walking: 0.9 Car: 0.1 5pmJane Walking: 0.9 Car : 0.1 5:05pmJohn Walking: 0 Car: 1 View Manager Query Processor Model-based view View Maintenance Query Processing idTIM E USERMODELOCATIONweight 15pmJohnW(a1,b1) pmJohnW…0.02 Particle Tables SELECT time, user, SUM(location*weight) FROM particles p1 GROUP BY time HAVING 0.95 < SELECT SUM(weight) FROM particles p2 WHERE p1.time = p2.time AND status = “W” Able to support single-table select, project & aggregate queries Can reason about spatial correlations However, temporal correlations ignored SELECT time, user, location FROM dpmview WHERE mode = “W” WITH CONFIDENCE 0.95 User Queries

Outline Motivation Statistical modeling of sensor data Abstraction of model-based views Regression-based views Views based on dynamic Bayesian networks Query processing over model outputs Some interesting sensor network problems Model-driven data acquisition Distributed inference in sensor networks Representing and Querying Correlated Tuples in Probabilistic Databases; P. Sen, A. Deshpande; ICDE 2007 Efficient Query Evaluation over Temporally Correlated Probabilistic Streams; B. Kanagal, A. Deshpande; ICDE 2009 Shared Correlations in Probabilistic Databases; P.Sen, A. Deshpande, L. Getoor, VLDB 2008

Querying Model Outputs Challenges: The model outputs typically probabilistic Strong spatial and temporal correlations Continuous queries over streaming data Numerous approaches proposed in recent years Typically make strong independence assumptions Limited support for attribute-value uncertainty In spite of that, query evaluation known to be #P-Hard Our goal: Develop a general, uniform framework that… Captures both tuple-existence and attribute-value uncertainties Can reason about correlations in the data Can handle continuous queries over probabilistic streams

Overview of Our Approach Represent the uncertainties and correlations graphically using small functions called factors Concepts borrowed from the graphical models literature TIMEUSERMODE (inferred) LOCATION (inferred) 5pmJohn Walking: 0.9 Car: 0.1 5pmJane Walking: 0.9 Car : 0.1 5:05pmJohn Walking: 0.1 Car: 0.9 5:05pmJane Walking: 0.1 Car: 0.9 ………… TIMEUSERMODE (inferred) LOCATION (inferred) 5pmJohn 5pmJane 5:05p m John 5:05p m Jane ………… f() WW1 WC0 CW0 CC1 M 5pm John M 5pm Jane M 5:05pm John M 5:05pm Jane L 5pm John L 5pm Jane L 5:05pm John L 5:05pm Jane M 5pm John M 5pm Jane

Overview of Our Approach Represent the uncertainties and correlations graphically using small functions called factors Concepts borrowed from the graphical models literature S ABprob s1‘m ’ 10.6 s2‘n’10.5 T CDprob t11‘p’0.4 s1f 1 (s1) s2t1f 2 (s2, t1) s2 and t1 mutually exclusive s1s1 s1s1 s2s2 s2s2 t1 f 1 (s1) f 2 (s2, t1)

Overview of Our Approach During query processing, add new factors corresponding to intermediate tuples Example query: S AB s1‘m ’ 1 s2‘n’1 T CD t11‘p’ s1s1 s1s1 s2s2 s2s2 t1 f 1 (s1) f 2 (s2, t1) π D (S B=C T) ABCD i1‘m’11‘p’ i2‘n’11‘p’ πDπD D r1‘p ’ r1 f OR (i1,i2,r1) i1 f AND (s1,t1,i1) i2 f AND (…)

Overview of Our Approach Query evaluation ≡ Inference !! Can use standard techniques like variable elimination Can exploit the structure in probabilistic databases for scalable inference s1s1 s1s1 s2s2 s2s2 t1 f 1 (s1) f 2 (s2, t1) i1 i2 r1 f OR (i1,i2,r1) f AND (i1,i2,r1) f AND (…) S AB s1‘m ’ 1 s2‘n’1 T CD t11‘p’ ABCD i1‘m’11‘p’ i2‘n’11‘p’ πDπD D r1‘p ’ See Prithvi’s talk for more details

Querying Probabilistic Streams Need to support “continuous” queries over “sliding windows” “alert me when the number of people in a mall exceeds 1000” Must take spatial correlations into account “how many people drove for at least one hour yesterday” Can’t ignore the temporal correlations in the data Observations: Probabilistic streams typically obey “Markovian” property Variables at times “t” and “t+2” are independent given the values of the variables at time “t+1” Although the actual parameters change, the correlation “structure” remains unchanged across time At every instance, we get the same set of input factors with different probability numbers

Querying Probabilistic Streams Brief summary of the key ideas: Extend the query language to support MAP (using Viterby’s algorithm) and ML operations over probabilistic streams Augment the “schema” of the probabilistic streams to include the correlation structure Implement the operators to support the iterator interface Only the parameters are transferred from operator to operator Enables efficient, incremental processing of new inputs Choose query plans that postpone generation of intermediate non-Markovian streams as long as possible

Ongoing and Future Work Developing APIs for adding arbitrary models Minimize the work of the model developer Identify intermediate representations useful across classes of models Designing index structures for querying, updating large collections of uncertain facts Approximate inference techniques for more efficient query processing

Outline Motivation Statistical modeling of sensor data Abstraction of model-based views Regression-based views Views based on dynamic Bayesian networks Query processing over model outputs Some interesting sensor network problems Model-driven data acquisition Distributed inference in sensor networks Model-Driven Data Acquisition in Sensor Networks; A. Deshpande et al., VLDB 2004

Model-based Query Processing Declarative Query Select nodeID, temp ±.1C, conf(.95) Where nodeID in {1..6} Observation Plan {[temp, 1], [voltage, 3], [voltage, 6]} Data 1, temp = 22.73, 3, voltage = , voltage = 2.65 USER SENSOR NETWORK Query Results 1, 22.73, 100% … 6, 22.1, 99% Probabilistic Model Query Processor

Model-based Query Processing Declarative Query Select nodeID, temp ±.1C, conf(.95) Where nodeID in {1..6} Observation Plan {[temp, 1], [voltage, 3], [voltage, 6]} Data 1, temp = 22.73, 3, voltage = , voltage = 2.65 USER SENSOR NETWORK Query Results 1, 22.73, 100% … 6, 22.1, 99% Probabilistic Model Query Processor Advantages: - Exploit correlations for efficient approximate query processing - Handle noise, biases in the data - Predict missing or future values

Model-based Query Processing Declarative Query Select nodeID, temp ±.1C, conf(.95) Where nodeID in {1..6} Observation Plan {[temp, 1], [voltage, 3], [voltage, 6]} Data 1, temp = 22.73, 3, voltage = , voltage = 2.65 USER SENSOR NETWORK Query Results 1, 22.73, 100% … 6, 22.1, 99% Probabilistic Model Query Processor Advantages: - Exploit correlations for efficient approximate query processing - Handle noise, biases in the data - Predict missing or future values Many interesting research challenges: - Finding optimal data collection paths - Different type of queries (max/min, top-k) - Learning, re-training models - Long-term planning, Continuous queries - …

Outline Motivation Statistical modeling of sensor data Abstraction of model-based views Regression-based views Views based on dynamic Bayesian networks Query processing over model outputs Some interesting sensor network problems Model-driven data acquisition Distributed inference in sensor networks

Distributed, In-network Inference Often need to do in-network, distributed inference Target tracking through information fusion Optimal control (for actuation) Distributed sensor calibration (using neighboring sensors) In-network regression or function fitting Need to reconcile information across all sensors

Distributed, In-network Inference Often need to do in-network, distributed inference Target tracking through information fusion Optimal control (for actuation) Distributed sensor calibration (using neighboring sensors) In-network regression or function fitting Obey a common structure: Each sensor has/observes some local information Information across sensors is correlated … must be combined together to form a global picture The global picture (or relevant part thereof) should be sent to each sensor

Distributed, In-network Inference Naïve option: Collect all data at the centralized base station – too expensive Using graphical models Form a junction tree on the nodes directly Use message passing/loopy propagation for globally consistent view

Distributed, In-network Inference Naïve option: Collect all data at the centralized base station – too expensive Using graphical models Form a junction tree on the nodes directly Use message passing/loopy propagation for globally consistent view

Conclusions Increasing number of applications generate and need to process uncertain data Statistical/probabilistic modeling provide an elegant framework to handle such data But little support in current database systems MauveDB Supports the abstraction of Model-based User Views Enables declarative querying over noisy, imprecise data Exploits commonalities to define, to create, and to process queries over such views

Conclusions Prototype implementation Using the Apache Derby open source DBMS Supports Regression-, Interpolation-, and DBN-based views Supports many different view maintenance strategies Probabilistic databases Increasingly important research area Designed a uniform and general framework for representing and querying uncertain data with correlations New inference techniques that exploit the structure in probabilistic databases

Thank you !! Questions ?