Lineage Processing over Correlated Probabilistic Databases Bhargav Kanagal Amol Deshpande University of Maryland.

Slides:

Advertisements

Similar presentations

Exact Inference. Inference Basic task for inference: – Compute a posterior distribution for some query variables given some observed evidence – Sum out.

Advertisements

Variational Methods for Graphical Models Micheal I. Jordan Zoubin Ghahramani Tommi S. Jaakkola Lawrence K. Saul Presented by: Afsaneh Shirazi.

Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,

Representing and Querying Correlated Tuples in Probabilistic Databases

Learning Influence Probabilities in Social Networks 1 2 Amit Goyal 1 Francesco Bonchi 2 Laks V. S. Lakshmanan 1 U. of British Columbia Yahoo! Research.

LEARNING INFLUENCE PROBABILITIES IN SOCIAL NETWORKS Amit Goyal Francesco Bonchi Laks V. S. Lakshmanan University of British Columbia Yahoo! Research University.

Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.

Online Filtering, Smoothing & Probabilistic Modeling of Streaming Data In short, Applying probabilistic models to Streams Bhargav Kanagal & Amol Deshpande.

Fast Algorithms For Hierarchical Range Histogram Constructions

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.

School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.

Exact Inference in Bayes Nets

Junction Trees And Belief Propagation. Junction Trees: Motivation What if we want to compute all marginals, not just one? Doing variable elimination for.

Dynamic Bayesian Networks (DBNs)

1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.

A Randomized Linear-Time Algorithm to Find Minimum Spanning Trees David R. Karger David R. Karger Philip N. Klein Philip N. Klein Robert E. Tarjan.

Experiences with Streaming Construction of SAH KD Trees Stefan Popov, Johannes Günther, Hans-Peter Seidel, Philipp Slusallek.

A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.

Junction Trees: Motivation Standard algorithms (e.g., variable elimination) are inefficient if the undirected graph underlying the Bayes Net contains cycles.

Indexing Correlated Probabilistic Databases Bhargav Kanagal & Amol Deshpande University of Maryland.

GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.

Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.

Sensitivity Analysis & Explanations for Robust Query Evaluation in Probabilistic Databases Bhargav Kanagal, Jian Li & Amol Deshpande.

1 Graphical Models in Data Assimilation Problems Alexander Ihler UC Irvine Collaborators: Sergey Kirshner Andrew Robertson Padhraic Smyth.

Recent Development on Elimination Ordering Group 1.

Approximate data collection in sensor networks the appeal of probabilistic models David Chu Amol Deshpande Joe Hellerstein Wei Hong ICDE 2006 Atlanta,

Belief Propagation, Junction Trees, and Factor Graphs

Exploiting Correlated Attributes in Acquisitional Query Processing Amol Deshpande University of Maryland Joint work with Carlos Sam

Time-Variant Spatial Network Model Vijay Gandhi, Betsy George (Group : G04) Group Project Overview of Database Research Fall 2006.

The community-search problem and how to plan a successful cocktail party Mauro SozioAris Gionis Max Planck Institute, Germany Yahoo! Research, Barcelona.

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Probabilistic Databases Amol Deshpande, University of Maryland.

Bayesian Networks Alan Ritter.

Distributed Constraint Optimization * some slides courtesy of P. Modi

CBLOCK: An Automatic Blocking Mechanism for Large-Scale Deduplication Tasks Ashwin Machanavajjhala Duke University with Anish Das Sarma, Ankur Jain, Philip.

Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.

Sensor Data Management: Challenges and (some) Solutions Amol Deshpande, University of Maryland.

Efficient Gathering of Correlated Data in Sensor Networks

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Undirected Models: Markov Networks David Page, Fall 2009 CS 731: Advanced Methods in Artificial Intelligence, with Biomedical Applications.

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.

1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 11 th, 2006 Readings: K&F: 8.1, 8.2, 8.3,

Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.

Solving Bayesian Decision Problems: Variable Elimination and Strong Junction Tree Methods Presented By: Jingsong Wang Scott Langevin May 8, 2009.

CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.

Slides for “Data Mining” by I. H. Witten and E. Frank.

An Introduction to Variational Methods for Graphical Models

1 Mean Field and Variational Methods finishing off Graphical Models – Carlos Guestrin Carnegie Mellon University November 5 th, 2008 Readings: K&F:

CS4432: Database Systems II Query Processing- Part 2.

Indexing Correlated Probabilistic Databases Bhargav Kanagal, Amol Deshpande University of Maryland, College Park, USA SIGMOD Presented.

Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:

Christopher M. Bishop, Pattern Recognition and Machine Learning 1.

1 Scalable Probabilistic Databases with Factor Graphs and MCMC Michael Wick, Andrew McCallum, and Gerome Miklau VLDB 2010.

A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB

Today Graphical Models Representing conditional dependence graphically

1 Relational Factor Graphs Lin Liao Joint work with Dieter Fox.

1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 15 th, 2008 Readings: K&F: 8.1, 8.2, 8.3,

Dependency Networks for Inference, Collaborative filtering, and Data Visualization Heckerman et al. Microsoft Research J. of Machine Learning Research.

Spatial Data Management

Abolfazl Asudeh Azade Nazi Nan Zhang Gautam DaS

Probabilistic Data Management

Probabilistic Data Management

Spatial Online Sampling and Aggregation

Data Integration with Dependent Sources

Random Sampling over Joins Revisited

Read R&N Ch Next lecture: Read R&N

Continuous Density Queries for Moving Objects

Variable Elimination Graphical Models – Carlos Guestrin

Efficient Aggregation over Objects with Extent

Presentation transcript:

Lineage Processing over Correlated Probabilistic Databases Bhargav Kanagal Amol Deshpande University of Maryland

Motivation: Information Extraction/Integration [Gupta&Sarawagi’2006, Jayram et al. 2006] Structured entities extracted from text in the internet Reputed SENTIMENT ANALYSIS Location...located at 52 A Goregaon West Mumbai... ADDRESS SEGMENTATION CarAds INFORMATION EXTRACTION CORRELATIONS

Reputed SELECT SellerId FROM Location, CarAds, Reputed WHERE reputation = ‘good’ AND city = `Mumbai’ Location.SellerId = CarAds.SellerId AND CarAds.SellerId = Reputed.SellerId Why Lineage Processing ? [Das Sarma et al. 2006] Location CarAds List all “reputed” car sellers in “Mumbai” who offer Honda cars We need to compute the probability of the above boolean formula

Motivation: RFID based Event Monitoring [RFID Ecosystem UW, Diao et al. 2009, Letchner et al. 2009, KD 2008] A building instrumented with RFID readers to track assets / personnel found(PC, X, 2pm), prob = 0.9 RFID readings are noisy – Miss readings – Add spurious readings Subjected to probabilistic modeling Probabilities associated with events Spatial and Temporal correlations found(x,PC) ∧ found(z,PC) ∧ [found(y 1,PC) ∨ found(y 2,PC)] Was the PC correctly transferred from room A to the conference room ?

A Relational DBMS Data tables Uncertainty Parameters INDSEP Indexes Query Processor PARSER INDSEP Manager User insert into reputation values (‘z1’,219, uncertain(‘Good 0.5; Bad 0.5’); insert factor ‘0 0 1; 1 1 1’ in address on ‘y1.e’,‘y2.e’; PrDB System Overview [Kanagal & Deshpande SIGMOD 2009, SDG08, Insert data + correlations Issue – SPJ queries – Inference queries – Aggregation queries

Outline Motivation & Problem definition [done] Background – Probabilistic Databases as Junction trees – Query processing over Junction trees – INDSEP Lineage Processing over Junction trees Lineage Processing using INDSEP Results

idYExists ? 134? 233? 325?.. 511? idYExists ? 134a 233b 325c.. 511q Background: ProbDBs as Junction trees Tuple Uncertainty Attribute Uncertainty Converted to Tuple Uncertainty Attribute Uncertainty Converted to Tuple Uncertainty Correlations Consise encoding of the joint probability distribution Query evaluation is performed directly over Junction Trees Forest of junction trees Random Variable 1 tuple exists 0 otherwise

Background: Junction trees Each clique and separator stores joint pdf (POTENTIAL) Tree structure reflects Markov property Given b, c: a independent of d p(a,b,c) p(b,c) p(b,c,d) Clique Separator Marginal: p(a,d) Joint distribution

Marginal Computation {b, c, n} Keep query variables Keep correlations Remove others Keep query variables Keep correlations Remove others PIVOT Steiner tree + Send messages toward a given pivot node For ProbDBs ≈ 1 million tuples, not scalable (1)Span of the query can be very large – almost the complete database accessed even for a 3 variable query (2)Searching for cliques is expensive: Linear scan over all the nodes is inefficient

50 ops Shortcut Potentials How can we make marginal computation scalable ? 100 ops Shortcut Potential Junction tree on set variables {c, f, g, j, k, l, m} Shortcut Potential Junction tree on set variables {c, f, g, j, k, l, m} (1)Boundary separators (2)Distribution required to completely shortcut the partition (3)Which to build ? (1)Boundary separators (2)Distribution required to completely shortcut the partition (3)Which to build ?

Root I1I1 I1I1 I2I2 I2I2 I3I3 I3I3 P1P1 P1P1 P2P2 P2P2 P6P6 P6P6 P5P5 P5P5 P4P4 P4P4 P3P3 P3P3 INDSEP - Overview 1.Variables: {a,b,..} {c,f,..} {j,n..q} 2.Child Separators: p(c), p(j) 3.Tree induced on the children 4.Shortcut potentials of children: {p(c), p(c,j), p(j)} 1.Variables: {a,b,..} {c,f,..} {j,n..q} 2.Child Separators: p(c), p(j) 3.Tree induced on the children 4.Shortcut potentials of children: {p(c), p(c,j), p(j)} Obtained by hierarchical partitioning of the junction tree Actual Construction: [Kanagal & Deshpande SIGMOD 2009]

Computing Marginals using INDSEP Root I1I1 I1I1 I2I2 I2I2 I3I3 I3I3 P1P1 P1P1 P2P2 P2P2 P6P6 P6P6 P5P5 P5P5 P4P4 P4P4 P3P3 P3P3 {b, c, n} {b, c} {n} {b, c} {c, j} {j, n} {b, c, n} Intermediate Junction tree Intermediate Junction tree [Kanagal & Deshpande SIGMOD 2009] Recursion on INDSEP

Outline Motivation & Problem definition [done] Background [done] – Junction trees & Query processing over junction trees – INDSEP Lineage Processing over Junction trees Lineage Processing using INDSEP Results

Lineage Processing Typically classified into 2 types Read-Once (a ∧ b) ∨ (c ∧ d) Non-Read-Once (a ∧ b) ∨ (b ∧ c) ∨ (c ∧ d) The problem of lineage processing is #P- complete in general for correlated probabilistic databases, even for read-once lineages Reduction from #DNF

Lineage Processing on Junction trees Naïve: (a ∧ b) ∨ (c ∧ d) p(a, b, c, d) p(a, b, a ∧ b, c, d) p(a ∧ b, c, d) p(a ∧ b, c, d, c ∧ d) p(a ∧ b, c ∧ d) p((a ∧ b) ∨ (c ∧ d)) Multiply with p(a ∧ b|a,b) Eliminate a,b Multiply / Eliminate Evaluate marginal query over variables in formula COMPLEXITY (1)Simplifcation (name of the above process) (2)Dependent on the size of the intermediate pdf (3)Here, it is at least (n+1) (#terms in the formula) (4)Not scalable to large formulae Multipl y Eliminate

Lineage Processing [Optimization opportunities] 1. EAGER Exploit conditional independence & simplify early p(a, c, d) p(a, c ∧ d) PIVOT Query: (a ∧ b) ∨ (c ∧ d) [Kanagal & Deshpande SIGMOD 2010] p(a, d)

Lineage Processing [Optimization opportunities] [Kanagal & Deshpande SIGMOD 2010] p(c, f, g, h, m ∧ n) p(c, h, m ∧ n) p((c ∧ h) ∨ (m ∧ n)) p(f, h) p(g,m ∧ n) p(c, f, g)p(c, f, g, h) p(g,m ∧ n) p(g, c ∧ h) p(c ∧ h, m ∧ n) Max pdf: 5 Distribute simplification into the product 2. EAGER+ORDER (c ∧ h) ∨ (m ∧ n) Max pdf: 4 How to compute good ordering ?

Lineage Processing [Pivot Selection] Also influences the intermediate pdf size Optimal Pivot: Only n possible choices, estimate pdf size for each pivot location Pivot = (ab) Pivot = (cfg) (b ∧ c) ∨ g Max pdf: 3 Max pdf: 4

Outline Motivation & Problem definition [done] Background [done] – Junction trees & Query processing over junction trees – INDSEP Lineage Processing over Junction trees [done] Lineage Processing using INDSEP Results

Lineage Processing using INDSEP Root I1I1 I1I1 I2I2 I2I2 I3I3 I3I3 P1P1 P1P1 P2P2 P2P2 P6P6 P6P6 P5P5 P5P5 P4P4 P4P4 P3P3 P3P3 {b, c, d, e} {n, o} {j, n ∨ o} {b ∧ c, d ∨ e, c} {c, j} Recursion bottomed out using EAGER+ORDER (b ∧ c) ∨ (( d ∨ e) ∧ ( n ∨ o) ) But what is the running time ?

Lineage Planning Phase (b ∧ c) ∨ (( d ∨ e) ∧ ( n ∨ o) ) QUERY PLAN Estimate maximum intermediate pdf size at each node If a node exceeds a threshold, do approximations to estimate probability In addition, modify query plan for: – Multiple lineages that share variables – Exploiting disconnections

Results Query Processing times for different heuristics Query Processing times for different heuristics Datasets (1)D1: Fully independent (2)D2: Correlated (3)D3: Highly Correlated (long chains) NOTE: LOG scale Comparison Systems (1)NAIVE (2)EAGER (3)EAGER + ORDER EAGER+ORDER is much more efficient than others

Results Query Processing time vs Lineage size Query Processing time vs Lineage size NOTE: LOG scale Ratio vs Sharing factor Ratio vs Sharing factor Multiquery processing exploits sharing Highly dependent on size of lineage

Conclusions Proposed a scalable system for evaluating boolean formula queries over correlated probabilistic databases Future – Plan to further the approximation approaches – Envelopes of boolean formulas for upper and lower bounds Thank you

Lineage Processing (contd.) Amount of simplification possible when nodes are multiplied Construct complete graph on factors to be multiplied p(g, c ∧ h) p(f, h)p(c, f, g) Pick the biggest edge 2.Merge / Simplify nodes together 3.Recompute new edge weights

Lineage Processing via INDSEP [Improvement 1] Multiple Lineage Processing: Exploit possibility of sharing Root I1I1 I1I1 I2I2 I2I2 I3I3 I3I3 P1P1 P1P1 P2P2 P2P2 P6P6 P6P6 P5P5 P5P5 P4P4 P4P4 P3P3 P3P3 {c, g, j} {j, m} {c, g, j} {j, n} (m ∧ c) ∨ g (n ∧ c) ∨ g Sharing across multiple levels Need not even share variables, just paths

Lineage Processing via INDSEP [Improvement 2] Extend to forest of junction trees: Real world data sets may have independences Root I1I1 I1I1 I2I2 I2I2 I3I3 I3I3 P1P1 P1P1 P2P2 P2P2 P6P6 P6P6 P5P5 P5P5 P4P4 P4P4 P3P3 P3P3 Index constructed to minimize disk wastage, combining forests together (a ∧ o) {a, c} {j, o} {a, c} {c, j} {j} {o} j and o are disconnected !! a and o are disconnected !! Preprocess formula, keep variables in connected components together

Lineage Processing via INDSEP [Improvement 3] What about complexity ? Complexity not evident from the algorithm Root I1I1 I1I1 I2I2 I2I2 I3I3 I3I3 P1P1 P1P1 P2P2 P2P2 P6P6 P6P6 P5P5 P5P5 P4P4 P4P4 P3P3 P3P3 {b, c, d, e} {n, o} {j, n ∨ o} {b ∧ c, d ∨ e, c} {c, j} {d ∨ e} {b ∧ c, c} {j, n} {o} Compute lwidth here Intermediate junction tree “Predict” how large the intermediate cliques will be Approximate for all portions whose estimate is more than a threshold, e.g., 10