Download presentation
Presentation is loading. Please wait.
Published byElwin Stanley Modified over 9 years ago
1
Using Probabilistic Models for Data Management in Acquisitional Environments Sam Madden MIT CSAIL With Amol Deshpande (UMD), Carlos Guestrin (CMU)
2
Overview Querying to monitor distributed systems – Sensor-actuator networks – Distributed databases Probabilistic models provide a framework for dealing with all of these issues Berkeley Mote Issues – Missing, uncertain data – High acquisition, querying costs Distributed P2P I’m not proposing a complete system!
3
Outline Motivation Probabilistic Models New Queries and UI Applications Challenges and Concluding Remarks
4
Outline Motivation Probabilistic Models New Queries and UI Applications Challenges and Concluding Remarks
5
Not your mother’s DBMS Data doesn’t exist apriori – Acquisition in DBMS Critical issue: given limited amount of noisy, lossy data, how can users interpret answers? Insufficient bandwidth – Selective observation Sometimes, desired data is unavailable – Must be robust to loss
6
Data is correlated Temperature and voltage Temperature and light Temperature and humidity Temperature and time of day etc. Source: Google.com
7
Outline Motivation Probabilistic Models New Queries and UI Applications Challenges and Concluding Remarks
8
Solution: Probabilistic Models Probability distribution (PDF) to estimate current state Model captures correlation between variables Directly answer queries from PDF Incorporate new observations – Via probabilistic inference on model Model the passage of time – Via transition model (e.g., Kalman filters) t0t0 t1t1 Transition Model t0t0 t1t1 Models learned from historical data
9
tt “SELECT nodeid,temp FROM sensors CONF.95 TO ±.5°” Architecture: Model-driven Sensornet DBMS Probabilistic Model Query Data gathering plan Condition on new observations New Query posterior belief Advantages vs. “Best-Effort Query-Everything” Observe fewer attributes Exploit correlations Reuse information between queries Directly deal with missing data Answer more complex (probabilistic) queries
10
Outline Motivation Probabilistic Models New Queries and UI Applications Challenges and Concluding Remarks
11
New Types of Queries Architecture enables efficient execution of many new queries Approximate queries – “Tell me the temperature to within ±.5 degrees with 95% confidence?” Query SELECT nodeId, temp ± 0.5°C, conf(.95) FROM sensors WHERE nodeId in {1..8} System selects and observes subset of avail. nodes Observed nodes: {3,6,8} Query result Node12345678 Temp.17.318.117.416.119.221.317.516.3 Conf.98%95%100%99%95%100%98%100%
12
Probabilistic Query Optimization Problem What observations will satisfy confidence bounds at minimum cost? – Must define cost metric and model Sensornets: metric = power, cost = sensing + comm – Decide if a set of observations satisfies bounds – Choose a search strategy
13
P(X i [a,b]) > 1- Choosing observation plan Is a subset S sufficient? If we observe S =s : R i (s ) = max{ P(X i [a,b] | s ), 1-P(X i [a,b] | s )} Query Predicate Value of S is unknown: R i (S ) = P(s ) R i (s ) ds reward Optimization problem: Pick your favorite search strategy
14
102030 102030 102030 102030 102030 User More New Queries Outlier queries – “Report temperature readings that have a 1% or less chance of occurring.” Extend architecture with local filters: Transmit Outliers Local Models Central Model Update Models 102030 102030 102030 102030 102030 Issues: Bias Inefficiency
15
Even More New Queries Prediction queries – “What is the expected temperature at 5PM today, given that it is very humid?” Influence queries – “What percentage of network traffic at site A is explained by traffic at sites B and C?” Queries could not be answered without a model!
16
UI Issues How to make probability “intuitive”? How to allow users to express queries? Issues – Query Language – UI Load vs. Time
17
Outline Motivation Probabilistic Models New Queries and UI Applications Challenges and Concluding Remarks
18
Applications Sensor-based Building Monitoring – Often battery powered – 100s-1000s of nodes Example: HVAC Control – Tolerant of approximate answers – Reduction in energy significant
19
App: Distributed System Monitoring Goal: detect/predict overload, reprovision Many metrics that may indicate overload – Disk usage, CPU load, network load, network latency, active queries, etc. – Cost to observe Problem: What metrics foreshadow overload? Soln: – Train on data labeled w/ overload status – Choose obs. plan that predicts label
20
Other Apps Stream load shedding Sensor network intrusion detection Database statistics See paper!
21
Outline Motivation Probabilistic Models New Queries and UI Applications Challenges and Concluding Remarks
22
Extension, Not Restriction Acquisition Layer + Tabular Data Model 1Model 2 System State Query Gaussians Discrete (Histograms) Integration Layer Query Possible to have many views of same data – Different models – Base data Number of architectural challenges
23
Every rose… Models can can fail to capture details Models can be wrong Models can be expensive to build Models can be expensive to maintain Paper suggests a number of known techniques from the ML community.
24
Whither hence? See the paper for technical details See other work – Probabilistic data models – Outlier and change detection Generalize these ideas to: – New models – Non-numeric types – New environments, queries Make some AI and stats friends
25
Conclusions Emerging data management opportunities: – Ad-hoc networks of tiny devices – Large scale distributed system monitoring These environments are: – Acquisitional – Loss-prone Probabilistic models are an essential tool – Tolerate missing data – Answer sophisticated new queries – Framework for efficient acquisitional execution
26
Questions
27
1 2 3 4 5 Example: TinyDB 1 2 3 4 5 1 2 3 4 5 Declarative queries for sensornets SELECT roomNo, AVG(temp) FROM sensors GROUP BY roomNo HAVING MAX(light) > 100 lux SAMPLE PERIOD 1 s Queries flooded, reverse-flood aggregation Best effort
28
TinyDB Limitations Difficult to interpret answers – Answering nodes can change between samples Limited queries – No historical trends – No future predictions – No outlier detection High overhead – Query flooding, full network traversal – No information sharing between sample periods
29
App: Value-Based Load Shedding User prioritizes some output values over others – May have to shed load Issue: what inputs correspond to desired outputs? – Esp. hard for aggregates, UDFs Can learn a probabilistic model that gives P(output value | input tuple) – Requires source tuple references on result tuples Use this model to decide which tuples to drop
30
Coping with Complexity Graphical models
31
Coping with Mistakes Retraining models
32
Prior Work On probabilistic data models / statistics – Not addressing issue of data acquisition On using models to decide what to capture – Tends to focus on performance issues Our Concerns: – What data to acquire? – Interpretability given missing data
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.