Sam Madden MIT CSAIL With Amol Deshpande (UMD), Carlos Guestrin (CMU)

Slides:



Advertisements
Similar presentations
Copyright ©2004 Carlos Guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.
Advertisements

State Estimation and Kalman Filtering CS B659 Spring 2013 Kris Hauser.
Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Representing and Querying Correlated Tuples in Probabilistic Databases
Online Filtering, Smoothing & Probabilistic Modeling of Streaming Data In short, Applying probabilistic models to Streams Bhargav Kanagal & Amol Deshpande.
Efficient Query Evaluation on Probabilistic Databases
David Chu--UC Berkeley Amol Deshpande--University of Maryland Joseph M. Hellerstein--UC Berkeley Intel Research Berkeley Wei Hong--Arched Rock Corp. Approximate.
Approximating Sensor Network Queries Using In-Network Summaries Alexandra Meliou Carlos Guestrin Joseph Hellerstein.
Probabilistic Aggregation in Distributed Networks Ling Huang, Ben Zhao, Anthony Joseph and John Kubiatowicz {hling, ravenben, adj,
Distributed Regression: an Efficient Framework for Modeling Sensor Network Data Carlos Guestrin Peter Bodik Romain Thibaux Mark Paskin Samuel Madden.
Near-optimal Nonmyopic Value of Information in Graphical Models Andreas Krause, Carlos Guestrin Computer Science Department Carnegie Mellon University.
Aggregation in Sensor Networks NEST Weekly Meeting Sam Madden Rob Szewczyk 10/4/01.
Approximate data collection in sensor networks the appeal of probabilistic models David Chu Amol Deshpande Joe Hellerstein Wei Hong ICDE 2006 Atlanta,
Exploiting Correlated Attributes in Acquisitional Query Processing Amol Deshpande University of Maryland Joint work with Carlos Sam
Chess Review May 11, 2005 Berkeley, CA Closing the loop around Sensor Networks Bruno Sinopoli Shankar Sastry Dept of Electrical Engineering, UC Berkeley.
Model-Driven Data Acquisition in Sensor Networks - Amol Deshpande et al., VLDB ‘04 Jisu Oh March 20, 2006 CS 580S Paper Presentation.
Streaming Data, Continuous Queries, and Adaptive Dataflow Michael Franklin UC Berkeley NRC June 2002.
Probabilistic Databases Amol Deshpande, University of Maryland.
Winter Retreat Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen, Emre Kıcıman, Anthony Accardi, Armando Fox, Eric Brewer
Model-driven Data Acquisition in Sensor Networks Amol Deshpande 1,4 Carlos Guestrin 4,2 Sam Madden 4,3 Joe Hellerstein 1,4 Wei Hong 4 1 UC Berkeley 2 Carnegie.
Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.
WaveScope – An Adaptive Wireless Sensor Network System for High Data- Rate Applications PIs: Hari Balakrishan (MIT) Sam Madden (MIT) Kevin Amaratunga (Metis.
Using Probabilistic Models for Data Management in Acquisitional Environments Sam Madden MIT CSAIL With Amol Deshpande (UMD), Carlos Guestrin (CMU)
Sensor Data Management: Challenges and (some) Solutions Amol Deshpande, University of Maryland.
Layered Approach using Conditional Random Fields For Intrusion Detection.
SCAN: a Scalable, Adaptive, Secure and Network-aware Content Distribution Network Yan Chen CS Department Northwestern University.
Applications of Bayesian sensitivity and uncertainty analysis to the statistical analysis of computer simulators for carbon dynamics Marc Kennedy Clive.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
An Integration Framework for Sensor Networks and Data Stream Management Systems.
Master Thesis Defense Jan Fiedler 04/17/98
REED: Robust, Efficient Filtering and Event Detection in Sensor Networks Daniel Abadi, Samuel Madden, Wolfgang Lindner MIT United States VLDB 2005.
1 REED: Robust, Efficient Filtering and Event Detection in Sensor Networks Daniel Abadi, Samuel Madden, Wolfgang Lindner MIT United States VLDB 2005.
Yanlei Diao, University of Massachusetts Amherst Future Directions in Sensor Data Management Yanlei Diao University of Massachusetts, Amherst.
SCALABLE INFORMATION-DRIVEN SENSOR QUERYING AND ROUTING FOR AD HOC HETEROGENEOUS SENSOR NETWORKS Paper By: Maurice Chu, Horst Haussecker, Feng Zhao Presented.
Towards Social User Profiling: Unified and Discriminative Influence Model for Inferring Home Locations Rui Li, Shengjie Wang, Hongbo Deng, Rui Wang, Kevin.
DISTIN: Distributed Inference and Optimization in WSNs A Message-Passing Perspective SCOM Team
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
Data-Driven Processing in Sensor Networks Adam Silberstein, Rebecca Braynard, Gregory Filpus, Gavino Puggioni, Alan Gelfand, Kamesh Munagala, Jun Yang.
Using Bayesian Belief Networks in Assessing Software Architectures Jilles van Gurp & Jan Bosch.
Paper: A. Kapoor, H. Ahn, and R. Picard, “Mixture of Gaussian Processes for Combining Multiple Modalities,” MIT Media Lab Technical Report, Paper.
Building Wireless Efficient Sensor Networks with Low-Level Naming J. Heihmann, F.Silva, C. Intanagonwiwat, R.Govindan, D. Estrin, D. Ganesan Presentation.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
The Design of an Acquisitional Query Processor For Sensor Networks Samuel Madden, Michael J. Franklin, Joseph M. Hellerstein, and Wei Hong Presentation.
Managing Qualitative Knowledge in Software Architecture Assesment Jilles van Gurp & Jan Bosch Högskolan Karlskrona/Ronneby in Sweden Department of Software.
Kalman Filter and Data Streaming Presented By :- Ankur Jain Department of Computer Science 7/21/03.
BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review.
Learning Deep Generative Models by Ruslan Salakhutdinov
Zhu Han University of Houston Thanks for Professor Dan Wang’s slides
Demetrios Zeinalipour-Yazti (Univ. of Cyprus)
Weikuan Yu, Hui Cao, and Vineet Mittal The Ohio State University
Distributed database approach,
A paper on Join Synopses for Approximate Query Answering
Context Sensing.
Probabilistic Data Management
The Design of an Acquisitional Query Processor For Sensor Networks
WISENET Wireless Sensor Network
The Scientific Method in Psychology
Load Shedding in Stream Databases – A Control-Based Approach
Distributing Queries Over Low Power Sensor Networks
CHAPTER 18: Inference in Practice
C.U.SHAH COLLEGE OF ENG. & TECH.
Probabilistic Databases
Motion-Aware Routing in Vehicular Ad-hoc Networks
Kyoungwoo Lee, Minyoung Kim, Nikil Dutt, and Nalini Venkatasubramanian
REED : Robust, Efficient Filtering and Event Detection
An Analysis of Stream Processing Languages
Overview: Chapter 2 Localization and Tracking
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Sam Madden MIT CSAIL With Amol Deshpande (UMD), Carlos Guestrin (CMU) Using Probabilistic Models for Data Management in Acquisitional Environments Sam Madden MIT CSAIL With Amol Deshpande (UMD), Carlos Guestrin (CMU)

Overview Querying to monitor distributed systems Issues Sensor-actuator networks Distributed databases Berkeley Mote Distributed P2P Issues Missing, uncertain data High acquisition, querying costs web querying I’m not proposing a complete system! Probabilistic models provide a framework for dealing with all of these issues

Outline Motivation Probabilistic Models New Queries and UI Applications Challenges and Concluding Remarks

Outline Motivation Probabilistic Models New Queries and UI Applications Challenges and Concluding Remarks

Not your mother’s DBMS Data doesn’t exist apriori Acquisition in DBMS Insufficient bandwidth Selective observation Sometimes, desired data is unavailable Must be robust to loss make this more about problems in current database architecture Critical issue: given limited amount of noisy, lossy data, how can users interpret answers?

Data is correlated Temperature and voltage Temperature and light Source: Google.com Data is correlated Temperature and voltage Temperature and light Temperature and humidity Temperature and time of day etc.

Outline Motivation Probabilistic Models New Queries and UI Applications Challenges and Concluding Remarks

Solution: Probabilistic Models Probability distribution (PDF) to estimate current state Model captures correlation between variables Directly answer queries from PDF Incorporate new observations Via probabilistic inference on model Model the passage of time Via transition model (e.g., Kalman filters) Models learned from historical data t0 t1 mention loss here t0 t1 Transition Model Transition Model

Architecture: Model-driven Sensornet DBMS posterior belief Probabilistic Model New Query Query Advantages vs. “Best-Effort Query-Everything” Observe fewer attributes Exploit correlations Reuse information between queries Directly deal with missing data Answer more complex (probabilistic) queries “SELECT nodeid,temp FROM sensors CONF .95 TO ± .5°” Data gathering plan Condition on new observations Dt label axes

Outline Motivation Probabilistic Models New Queries and UI Applications Challenges and Concluding Remarks

New Types of Queries Architecture enables efficient execution of many new queries Approximate queries “Tell me the temperature to within ± .5 degrees with 95% confidence?” Query SELECT nodeId, temp ± 0.5°C, conf(.95) FROM sensors WHERE nodeId in {1..8} System selects and observes subset of avail. nodes Observed nodes: {3,6,8} Query result play up novelty, mention missing Node 1 2 3 4 5 6 7 8 Temp. 17.3 18.1 17.4 16.1 19.2 21.3 17.5 16.3 Conf. 98% 95% 100% 99%

Probabilistic Query Optimization Problem What observations will satisfy confidence bounds at minimum cost? Must define cost metric and model Sensornets: metric = power, cost = sensing + comm Decide if a set of observations satisfies bounds Choose a search strategy

Choosing observation plan Query Predicate Is a subset S sufficient? P(Xi[a,b]) > 1- If we observe S =s : Ri(s ) = max{ P(Xi[a,b] | s ), 1-P(Xi[a,b] | s )} reward Value of S is unknown: Ri(S ) = P(s ) Ri(s ) ds  Optimization problem: Pick your favorite search strategy

More New Queries Outlier queries “Report temperature readings that have a 1% or less chance of occurring.” Extend architecture with local filters: 10 20 30 User Update Models Local Models Central Model issues: bias, efficiency Transmit Outliers Issues: Bias Inefficiency 10 20 30

Queries could not be answered without a model! Even More New Queries Prediction queries “What is the expected temperature at 5PM today, given that it is very humid?” Influence queries “What percentage of network traffic at site A is explained by traffic at sites B and C?” Queries could not be answered without a model!

UI Issues How to make probability “intuitive”? How to allow users to express queries? Issues Query Language UI Load vs. Time

Outline Motivation Probabilistic Models New Queries and UI Applications Challenges and Concluding Remarks

Applications Sensor-based Building Monitoring Example: HVAC Control Often battery powered 100s-1000s of nodes Example: HVAC Control Tolerant of approximate answers Reduction in energy significant

App: Distributed System Monitoring Goal: detect/predict overload, reprovision Many metrics that may indicate overload Disk usage, CPU load, network load, network latency, active queries, etc. Cost to observe Problem: What metrics foreshadow overload? Soln: Train on data labeled w/ overload status Choose obs. plan that predicts label

Other Apps Stream load shedding Sensor network intrusion detection Database statistics See paper!

Outline Motivation Probabilistic Models New Queries and UI Applications Challenges and Concluding Remarks

Extension, Not Restriction Possible to have many views of same data Different models Base data Query Query Integration Layer Model 1 Model 2 Discrete (Histograms) Gaussians Acquisition Layer + Tabular Data System State Number of architectural challenges

Every rose… Models can can fail to capture details Models can be wrong Models can be expensive to build Models can be expensive to maintain Paper suggests a number of known techniques from the ML community.

Whither hence? See the paper for technical details See other work Probabilistic data models Outlier and change detection Generalize these ideas to: New models Non-numeric types New environments, queries Make some AI and stats friends

Conclusions Emerging data management opportunities: Ad-hoc networks of tiny devices Large scale distributed system monitoring These environments are: Acquisitional Loss-prone Probabilistic models are an essential tool Tolerate missing data Answer sophisticated new queries Framework for efficient acquisitional execution

Questions

App: Value-Based Load Shedding User prioritizes some output values over others May have to shed load Issue: what inputs correspond to desired outputs? Esp. hard for aggregates, UDFs Can learn a probabilistic model that gives P(output value | input tuple) Requires source tuple references on result tuples Use this model to decide which tuples to drop