Probabilistic Databases Amol Deshpande, University of Maryland
Overview V.S. Subrahmanian ProbView, PXML, Temporal Probabilistic Databases, Probabilistic Aggregates Lise Getoor Statistical Relational Learning, Probabilistic Relational Models, Entity Resolution Amol MauveDB: Statistical Modeling in Databases, Correlated tuples in probabilistic databases
Overview of Today’s Presentation Model-based Views/MauveDB [Amol] Statistical Relational Learning [Lise] Representing arbitrarily correlated data and processing queries over it [Prithviraj]
Overview of Today’s Presentation Model-based Views/MauveDB [Amol] Goal: Making it easy to continuously apply statistical models to streaming data Current focus on designing declarative interfaces, and on efficient maintenance algorithms Less on the “probabilistic databases” issues Statistical Relational Learning [Lise] Representing arbitrarily correlated data and processing queries over it [Prithviraj]
Motivation Unprecedented, and rapidly increasing, instrumentation of our every-day world Huge data volumes generated continuously that must be processed in real-time Typically imprecise, unreliable and incomplete data Measurement noises, low success rates, failures etc… Wireless sensor networks RFID Distributed measurement networks (e.g. GPS) Industrial Monitoring
Data Processing Step 1 Process data using a statistical/probabilistic model Regression and interpolation models To eliminate spatial or temporal biases, handle missing data, prediction Filtering techniques (e.g. Kalman Filters), Bayesian Networks To eliminate measurement noise, to infer hidden variables etc Regression/interpolation models Temperature monitoring Kalman Filters et GPS Data
A Motivating Example Inferring “transportation mode”/ “activities” [Henry Kautz et al] Using easily obtainable sensor data, e.g. GPS, RFID proximity data Can do much if we can infer these automatically office home Have access to noisy “GPS” data Infer the transportation mode: walking, running, in a car, in a bus
Motivating Example Inferring “transportation mode”/ “activities” [Henry Kautz et al] Using easily obtainable sensor data, e.g. GPS, RFID proximity data Can do much if we can infer these automatically office home Preferred end result: Clean path annotated with transportation mode
Dynamic Bayesian Network Use a “generative model” for describing how the observations were generated Time = t MtMt XtXt OtOt Transportation Mode: Walking, Running, Car, Bus True velocity and location Observed location Need conditional probability distributions e.g. a distribution on (velocity, location) given the transportation mode Prior knowledge or learned from data
Dynamic Bayesian Network Use a “generative model” for describing how the observations were generated Time = t MtMt XtXt OtOt Transportation Mode: Walking, Running, Car, Bus True velocity and location Observed location Time = t+1 M t+1 X t+1 O t+1
Dynamic Bayesian Network Given a sequence of observations (O t ), find the most likely M t ’s that explain it. Or could provide a probability distribution on the possible M t ’s. Time = t MtMt XtXt OtOt Transportation Mode: Walking, Running, Car, Bus True velocity and location Observed location Time = t+1 M t+1 X t+1 O t+1
Statistical Modeling of Sensor Data No support in database systems --> Database ends up being used as a backing store With much replication of functionality Very inefficient, not declarative… How can we push statistical modeling inside a database system ?
Abstraction: Model-based Views An abstraction analogous to traditional database views Present the output of the application of model as a database view That the user can query as with normal database views
Example DBN View UserTimeLocationModeprob John5pm(x’1, y’1)Walking0.9 John5pm(x’1, y’1)Car0.1 John5:05pm(x’2, y’2)Walking0 John5:05pm(x’2, y’2)Car1 UserTimeLocation John5pm(x1, y1) John5:05pm(x2, y2) Original noisy GPS data User view of the data - Smoothed locations - Inferred variables User e.g. select count(*) group by mode sliding window 5 minutes Application of the model/inference is pushed inside the database Opens up many optimization opportunities e.g. can do inference lazily when queried etc
Correlations UserTimeLocationModeprob John5pm(x’1, y’1)Walking0.9 John5pm(x’1, y’1)Car0.1 John5:05pm(x’2, y’2)Walking0 John5:05pm(x’2, y’2)Car1 User Strong and complex correlations across tuples - Mutual exclusivity - Temporal correlations
MauveDB: Status Written in the Apache Derby Java open source database system Support for Regression- and Interpolation-based views Neither produce probabilistic data SIGMOD 2006 (w/ Sam Madden) Currently building support for views based on Dynamic Bayesian networks [Bhargav] Kalman Filters, HMMs etc Initial focus on the user interfaces and efficient inference Will generate probabilistic data; may not be able to do anything too sophisticated with it
Research Challenges/Future Work Generalizing to arbitrary models ? Develop APIs for adding arbitrary models Try to minimize the work of the model developer Probabilistic databases Uncertain data with complex correlation patterns Query processing, query optimization View maintenance in presence of high-rate measurement streams
Thanks !! Mauve == Model-based User Views