Indexing Correlated Probabilistic Databases Bhargav Kanagal, Amol Deshpande University of Maryland, College Park, USA SIGMOD Presented by JongHeum Yeon
Motivation Correlated Probabilistic Data Sensor Networks Information Extraction Data Integration Activity Recognition RFID Stream Analysis Representing complex correlations and for querying over correlated database Rapidly increasing scale of such databases 2Copyright© 2010 by CEBTCenter for E-Business Technology
Example: Event Monitoring RFID Event monitoring application Raw RFID data is noisy and incomplete aggregate queries “how many business meetings occurred over the last week?” what-if queries “what is the likelihood that Bob and Mary attended a meeting given that John did not attend?” 3Copyright© 2010 by CEBTCenter for E-Business Technology
Example: Information Extraction Information Extraction “Mr. X left Company A to join Company B” works(X, B) and works(X, A) cannot occur simultaneously Queries “where does X work ?” “how many employees work in Company B ?” 4Copyright© 2010 by CEBTCenter for E-Business Technology
Indirectly Correlated among Variables The key challenge in evaluating queries over large-scale correlated databases Simple queries involving a few tuple or attribute variables may require accessing and manipulating the probability distributions in the entire database Two variables are not directly correlated with each other, they may be indirectly correlated through a chain of other variables in the database. e.g. the events entered(Bob, lounge, 2pm) and left(John, conf-room, 3pm) are correlated A query involving those two variables must process the correlations among many other variables 5Copyright© 2010 by CEBTCenter for E-Business Technology
Contributions INDSEP A novel hierarchical index structure for large correlated probabilistic databases Builds upon the junction tree framework, designed to answer inference queries over large-scale probabilistic graphical models (PGM) Methodology for answering various types of queries efficiently using such a data structure Extraction queries: extract the correlations over a subset of the variables Inference (what-if) queries: computing a conditional probability distribution Aggregate queries: probability distribution over the aggregate value 6Copyright© 2010 by CEBTCenter for E-Business Technology
Preliminaries Probabilistic Graphical Models (PGMs) Junction Tree Representation of PGMs 7Copyright© 2010 by CEBTCenter for E-Business Technology
Graphical Models Compact graphical representation of joint probability. 8Copyright© 2010 by CEBTCenter for E-Business Technology
Bayesian Network 9Copyright© 2010 by CEBTCenter for E-Business Technology
Markov Random Fields 10Copyright© 2010 by CEBTCenter for E-Business Technology
Markov Random Fields (cont’d) 11Copyright© 2010 by CEBTCenter for E-Business Technology
Junction Tree Algorithm Aim To perform exact inference efficiently Transform the graph into an appropriate data structure Ensure joint probability remains the same Ensure exact marginals can be computed Converts Bayes Net into an undirected tree Joint probability remains unchanged Exact marginals can be computed Benefits Uniform treatment of Bayes Net and MRF Efficient inference is possible for undirected trees 12Copyright© 2010 by CEBTCenter for E-Business Technology
Junction Tree Algorithm (cont’d) 13Copyright© 2010 by CEBTCenter for E-Business Technology
Junction Tree Algorithm (cont’d) The cliques of this graph are inconsistent with the original one Node D just lost a parent 14Copyright© 2010 by CEBTCenter for E-Business Technology
Junction Tree Algorithm (cont’d) Ensure that a node and its parents are part of the same clique “Marry the parents for a happy family” Now you can make the graph undirected 15Copyright© 2010 by CEBTCenter for E-Business Technology
Junction Tree Algorithm (cont’d) Moralizing a graph Marry all unconnected parents Drop the edge directions Ensure joint probability remains the same 16Copyright© 2010 by CEBTCenter for E-Business Technology
Joint Distribution Probabilistic Graphical Models (PGMs) Junction Tree Representation of PGMs 17Copyright© 2010 by CEBTCenter for E-Business Technology
Query Processing Extraction queries a junction tree that includes all the query variables and all the correlations that exist among the variables A naive algorithm – computing the smallest Steiner tree on the junction tree that connects all the query variables of interest Steiner tree can be computed on a tree structured graph in polynomial time e.g. – {g, k} – {e, o} 18Copyright© 2010 by CEBTCenter for E-Business Technology
Query Processing (cont’d) Inference queries Hugin's algorithm e.g. – {g, k} – {e, o} 1: m12 = p(d, e) 2: p(a, d, e) = p(a, d) x p(d, e) / p(d) Eliminate d (not necessary) m23 = p(a, e) 3: p(a, c, e) = p(a, c)*p(a, e) / p(a) m34 = p(c, e) 19Copyright© 2010 by CEBTCenter for E-Business Technology
INDSEP Each index node in INDSEP corresponds to a connected subtree of the junction tree 20Copyright© 2010 by CEBTCenter for E-Business Technology
INDSEP (cont’d) Each node in the data structure stores Set of variables – e.g. I2: {c, f, g, h, i, j, k} Pointers to index nodes of the children and parent pointers for index traversal Set of separator potentials – e.g. I2: {p(c), p(f), p(j)} The graph induced on its children Set of shortcut potentials corresponding to the children of this node 21Copyright© 2010 by CEBTCenter for E-Business Technology
Shortcut Potentials The joint distribution of all the separator nodes that are adjacent to the node I p(X,Y) e.g. Node I2 stores the shortcut potentials for P3 (p(c,f,j)) and P4 (p(f)) 22Copyright© 2010 by CEBTCenter for E-Business Technology
Index Construction Objective function is to find the fewest number of partitions(< disk block size) Partition the junction tree into subtrees(< disk block size) 1. Performs a depth first search on the tree 2. Iterates through the nodes starting from the lowest level of the tree and each node computes the weight of the subtree below itself 3. Once the weight of some node exceeds the block size – Removes the children below this node – (children with highest sub-tree weight are removed first) – Creates a new partition for each of them, subsequently reducing the subtree weight 4. The algorithm continues until we reach the root Kundu et al. prove that the number of partitions generated using this algorithm is minimum S. Kundu and J. Misra. A linear tree partitioning algorithm. SIAM J. Comput., Copyright© 2010 by CEBTCenter for E-Business Technology
Query Processing Inference Query: {e, o} e ⊆ I1, o ⊆ I3 Steiner tree joining → (a) {e, c} in I1, {c, j} in I2, {j, o} in I3 Steiner tree joining → (c) {a, c} in P1, {a, e} in P2 {c, j} in I2 → shortcut potential … Finally, constructed (d) 24Copyright© 2010 by CEBTCenter for E-Business Technology
Experimental Evaluation Dataset General probabilistic database – 500,000 tuples corresponding to detected events – connecting each random variable to k neighbors (randomly [1, 5]) Markov Sequence database – 1 million time slices, which corresponds to 3 million nodes in the junction tree Workloads W1: Shortest-range queries. These are queries that have a span of about 20% of the junction tree W2: Short-range queries. These have a span of 40% of the junction tree W3: Long-range queries. These have a span of 60% of the junction tree W4: Longest-range queries. Each query in W4 spans at least 80% of the tree 25Copyright© 2010 by CEBTCenter for E-Business Technology
Experimental Evaluation (cont’d) Effectiveness of the Index Query Processing Performance 26Copyright© 2010 by CEBTCenter for E-Business Technology
Conclusion Index data structure for correlated probabilistic databases allows for efficient processing of decision support queries Novel shortcut potentials Reducing query time by orders of magnitude Experimental results demonstrate the benefits of the indexing mechanisms for query processing in probabilistic databases 27Copyright© 2010 by CEBTCenter for E-Business Technology