Using Probabilistic Models for Data Management in Acquisitional Environments Sam Madden MIT CSAIL With Amol Deshpande (UMD), Carlos Guestrin (CMU)

Slides:



Advertisements
Similar presentations
Copyright ©2004 Carlos Guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.
Advertisements

Bayesian Belief Propagation
State Estimation and Kalman Filtering CS B659 Spring 2013 Kris Hauser.
Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.
When Data Management Systems Meet Approximate Hardware: Challenges and Opportunities Author: Bingsheng He (Nanyang Technological University, Singapore)
Representing and Querying Correlated Tuples in Probabilistic Databases
Online Filtering, Smoothing & Probabilistic Modeling of Streaming Data In short, Applying probabilistic models to Streams Bhargav Kanagal & Amol Deshpande.
Probabilistic Histograms for Probabilistic Data Graham Cormode AT&T Labs-Research Antonios Deligiannakis Technical University of Crete Minos Garofalakis.
David Chu--UC Berkeley Amol Deshpande--University of Maryland Joseph M. Hellerstein--UC Berkeley Intel Research Berkeley Wei Hong--Arched Rock Corp. Approximate.
The Design of the Borealis Stream Processing Engine Brandeis University, Brown University, MIT Magdalena BalazinskaNesime Tatbul MIT Brown.
Approximating Sensor Network Queries Using In-Network Summaries Alexandra Meliou Carlos Guestrin Joseph Hellerstein.
Probabilistic Aggregation in Distributed Networks Ling Huang, Ben Zhao, Anthony Joseph and John Kubiatowicz {hling, ravenben, adj,
Distributed Regression: an Efficient Framework for Modeling Sensor Network Data Carlos Guestrin Peter Bodik Romain Thibaux Mark Paskin Samuel Madden.
Near-optimal Nonmyopic Value of Information in Graphical Models Andreas Krause, Carlos Guestrin Computer Science Department Carnegie Mellon University.
Aggregation in Sensor Networks NEST Weekly Meeting Sam Madden Rob Szewczyk 10/4/01.
A Survey of Wireless Sensor Network Data Collection Schemes by Brett Wilson.
Approximate data collection in sensor networks the appeal of probabilistic models David Chu Amol Deshpande Joe Hellerstein Wei Hong ICDE 2006 Atlanta,
Exploiting Correlated Attributes in Acquisitional Query Processing Amol Deshpande University of Maryland Joint work with Carlos Sam
Probabilistic Data Aggregation Ling Huang, Ben Zhao, Anthony Joseph Sahara Retreat January, 2004.
Chess Review May 11, 2005 Berkeley, CA Closing the loop around Sensor Networks Bruno Sinopoli Shankar Sastry Dept of Electrical Engineering, UC Berkeley.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Re-thinking Data Management for Storage-Centric Sensor Networks Deepak Ganesan University.
Model-Driven Data Acquisition in Sensor Networks - Amol Deshpande et al., VLDB ‘04 Jisu Oh March 20, 2006 CS 580S Paper Presentation.
Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA * Bioinformatics.
Probabilistic Databases Amol Deshpande, University of Maryland.
Bayesian Filtering for Location Estimation D. Fox, J. Hightower, L. Liao, D. Schulz, and G. Borriello Presented by: Honggang Zhang.
Winter Retreat Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen, Emre Kıcıman, Anthony Accardi, Armando Fox, Eric Brewer
Model-driven Data Acquisition in Sensor Networks Amol Deshpande 1,4 Carlos Guestrin 4,2 Sam Madden 4,3 Joe Hellerstein 1,4 Wei Hong 4 1 UC Berkeley 2 Carnegie.
Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.
1 Data Mining over the Deep Web Tantan Liu, Gagan Agrawal Ohio State University April 12, 2011.
WaveScope – An Adaptive Wireless Sensor Network System for High Data- Rate Applications PIs: Hari Balakrishan (MIT) Sam Madden (MIT) Kevin Amaratunga (Metis.
Sensor Data Management: Challenges and (some) Solutions Amol Deshpande, University of Maryland.
Summary Alan S. Willsky SensorWeb MURI Review Meeting September 22, 2003.
Layered Approach using Conditional Random Fields For Intrusion Detection.
SCAN: a Scalable, Adaptive, Secure and Network-aware Content Distribution Network Yan Chen CS Department Northwestern University.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
Tufts Wireless Laboratory School Of Engineering Tufts University “Network QoS Management in Cyber-Physical Systems” Nicole Ng 9/16/20151 by Feng Xia, Longhua.
Applications of Bayesian sensitivity and uncertainty analysis to the statistical analysis of computer simulators for carbon dynamics Marc Kennedy Clive.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
An Integration Framework for Sensor Networks and Data Stream Management Systems.
Master Thesis Defense Jan Fiedler 04/17/98
Sensor Database System Sultan Alhazmi
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Query Processing for Sensor Networks Yong Yao and Johannes Gehrke (Presentation: Anne Denton March 8, 2003)
Topic (vi): New and Emerging Methods Topic organizer: Maria Garcia (USA) UNECE Work Session on Statistical Data Editing Oslo, Norway, September 2012.
ICS280 Presentation by Suraj Nagasrinivasa (1) Evaluating Probabilistic Queries over Imprecise Data (SIGMOD 2003) by R Cheng, D Kalashnikov, S Prabhakar.
MURI: Integrated Fusion, Performance Prediction, and Sensor Management for Automatic Target Exploitation 1 Dynamic Sensor Resource Management for ATE MURI.
REED: Robust, Efficient Filtering and Event Detection in Sensor Networks Daniel Abadi, Samuel Madden, Wolfgang Lindner MIT United States VLDB 2005.
1 REED: Robust, Efficient Filtering and Event Detection in Sensor Networks Daniel Abadi, Samuel Madden, Wolfgang Lindner MIT United States VLDB 2005.
Load Shedding Techniques for Data Stream Systems Brian Babcock Mayur Datar Rajeev Motwani Stanford University.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)
Towards Social User Profiling: Unified and Discriminative Influence Model for Inferring Home Locations Rui Li, Shengjie Wang, Hongbo Deng, Rui Wang, Kevin.
DISTIN: Distributed Inference and Optimization in WSNs A Message-Passing Perspective SCOM Team
In-Network Query Processing on Heterogeneous Hardware Martin Lukac*†, Harkirat Singh*, Mark Yarvis*, Nithya Ramanathan*† *Intel.
Bing Wang, Wei Wei, Hieu Dinh, Wei Zeng, Krishna R. Pattipati (Fellow IEEE) IEEE Transactions on Mobile Computing, March 2012.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB
Data-Driven Processing in Sensor Networks Adam Silberstein, Rebecca Braynard, Gregory Filpus, Gavino Puggioni, Alan Gelfand, Kamesh Munagala, Jun Yang.
Paper: A. Kapoor, H. Ahn, and R. Picard, “Mixture of Gaussian Processes for Combining Multiple Modalities,” MIT Media Lab Technical Report, Paper.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
The Design of an Acquisitional Query Processor For Sensor Networks Samuel Madden, Michael J. Franklin, Joseph M. Hellerstein, and Wei Hong Presentation.
Kalman Filter and Data Streaming Presented By :- Ankur Jain Department of Computer Science 7/21/03.
BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review.
Distributed database approach,
A paper on Join Synopses for Approximate Query Answering
Probabilistic Data Management
The Design of an Acquisitional Query Processor For Sensor Networks
Load Shedding Techniques for Data Stream Systems
Sam Madden MIT CSAIL With Amol Deshpande (UMD), Carlos Guestrin (CMU)
REED : Robust, Efficient Filtering and Event Detection
Overview: Chapter 2 Localization and Tracking
Presentation transcript:

Using Probabilistic Models for Data Management in Acquisitional Environments Sam Madden MIT CSAIL With Amol Deshpande (UMD), Carlos Guestrin (CMU)

Overview Querying to monitor distributed systems – Sensor-actuator networks – Distributed databases Probabilistic models provide a framework for dealing with all of these issues Berkeley Mote Issues – Missing, uncertain data – High acquisition, querying costs Distributed P2P I’m not proposing a complete system!

Outline Motivation Probabilistic Models New Queries and UI Applications Challenges and Concluding Remarks

Outline Motivation Probabilistic Models New Queries and UI Applications Challenges and Concluding Remarks

Not your mother’s DBMS Data doesn’t exist apriori – Acquisition in DBMS Critical issue: given limited amount of noisy, lossy data, how can users interpret answers? Insufficient bandwidth – Selective observation Sometimes, desired data is unavailable – Must be robust to loss

Data is correlated Temperature and voltage Temperature and light Temperature and humidity Temperature and time of day etc. Source: Google.com

Outline Motivation Probabilistic Models New Queries and UI Applications Challenges and Concluding Remarks

Solution: Probabilistic Models Probability distribution (PDF) to estimate current state Model captures correlation between variables Directly answer queries from PDF Incorporate new observations – Via probabilistic inference on model Model the passage of time – Via transition model (e.g., Kalman filters) t0t0 t1t1 Transition Model t0t0 t1t1 Models learned from historical data

tt “SELECT nodeid,temp FROM sensors CONF.95 TO ±.5°” Architecture: Model-driven Sensornet DBMS Probabilistic Model Query Data gathering plan Condition on new observations New Query posterior belief Advantages vs. “Best-Effort Query-Everything”  Observe fewer attributes  Exploit correlations  Reuse information between queries  Directly deal with missing data  Answer more complex (probabilistic) queries

Outline Motivation Probabilistic Models New Queries and UI Applications Challenges and Concluding Remarks

New Types of Queries Architecture enables efficient execution of many new queries Approximate queries – “Tell me the temperature to within ±.5 degrees with 95% confidence?” Query SELECT nodeId, temp ± 0.5°C, conf(.95) FROM sensors WHERE nodeId in {1..8} System selects and observes subset of avail. nodes Observed nodes: {3,6,8} Query result Node Temp Conf.98%95%100%99%95%100%98%100%

Probabilistic Query Optimization Problem What observations will satisfy confidence bounds at minimum cost? – Must define cost metric and model Sensornets: metric = power, cost = sensing + comm – Decide if a set of observations satisfies bounds – Choose a search strategy

P(X i [a,b]) > 1-  Choosing observation plan Is a subset S sufficient? If we observe S =s : R i (s ) = max{ P(X i [a,b] | s ), 1-P(X i [a,b] | s )} Query Predicate Value of S is unknown: R i (S ) = P(s ) R i (s ) ds  reward Optimization problem: Pick your favorite search strategy

User More New Queries Outlier queries – “Report temperature readings that have a 1% or less chance of occurring.” Extend architecture with local filters: Transmit Outliers Local Models Central Model Update Models Issues: Bias Inefficiency

Even More New Queries Prediction queries – “What is the expected temperature at 5PM today, given that it is very humid?” Influence queries – “What percentage of network traffic at site A is explained by traffic at sites B and C?” Queries could not be answered without a model!

UI Issues How to make probability “intuitive”? How to allow users to express queries? Issues – Query Language – UI Load vs. Time

Outline Motivation Probabilistic Models New Queries and UI Applications Challenges and Concluding Remarks

Applications Sensor-based Building Monitoring – Often battery powered – 100s-1000s of nodes Example: HVAC Control – Tolerant of approximate answers – Reduction in energy significant

App: Distributed System Monitoring Goal: detect/predict overload, reprovision Many metrics that may indicate overload – Disk usage, CPU load, network load, network latency, active queries, etc. – Cost to observe Problem: What metrics foreshadow overload? Soln: – Train on data labeled w/ overload status – Choose obs. plan that predicts label

Other Apps Stream load shedding Sensor network intrusion detection Database statistics See paper!

Outline Motivation Probabilistic Models New Queries and UI Applications Challenges and Concluding Remarks

Extension, Not Restriction Acquisition Layer + Tabular Data Model 1Model 2 System State Query Gaussians Discrete (Histograms) Integration Layer Query Possible to have many views of same data – Different models – Base data Number of architectural challenges

Every rose… Models can can fail to capture details Models can be wrong Models can be expensive to build Models can be expensive to maintain Paper suggests a number of known techniques from the ML community.

Whither hence? See the paper for technical details See other work – Probabilistic data models – Outlier and change detection Generalize these ideas to: – New models – Non-numeric types – New environments, queries Make some AI and stats friends

Conclusions Emerging data management opportunities: – Ad-hoc networks of tiny devices – Large scale distributed system monitoring These environments are: – Acquisitional – Loss-prone Probabilistic models are an essential tool – Tolerate missing data – Answer sophisticated new queries – Framework for efficient acquisitional execution

Questions

Example: TinyDB Declarative queries for sensornets SELECT roomNo, AVG(temp) FROM sensors GROUP BY roomNo HAVING MAX(light) > 100 lux SAMPLE PERIOD 1 s Queries flooded, reverse-flood aggregation Best effort

TinyDB Limitations Difficult to interpret answers – Answering nodes can change between samples Limited queries – No historical trends – No future predictions – No outlier detection High overhead – Query flooding, full network traversal – No information sharing between sample periods

App: Value-Based Load Shedding User prioritizes some output values over others – May have to shed load Issue: what inputs correspond to desired outputs? – Esp. hard for aggregates, UDFs Can learn a probabilistic model that gives P(output value | input tuple) – Requires source tuple references on result tuples Use this model to decide which tuples to drop

Coping with Complexity Graphical models

Coping with Mistakes Retraining models

Prior Work On probabilistic data models / statistics – Not addressing issue of data acquisition On using models to decide what to capture – Tends to focus on performance issues Our Concerns: – What data to acquire? – Interpretability given missing data