Indexing Correlated Probabilistic Databases Bhargav Kanagal, Amol Deshpande University of Maryland, College Park, USA SIGMOD 2009 2010. 4. 9 Presented.

Indexing Correlated Probabilistic Databases Bhargav Kanagal, Amol Deshpande University of Maryland, College Park, USA SIGMOD 2009 2010. 4. 9 Presented by JongHeum Yeon

Motivation  Correlated Probabilistic Data Sensor Networks Information Extraction Data Integration Activity Recognition RFID Stream Analysis  Representing complex correlations and for querying over correlated database  Rapidly increasing scale of such databases 2Copyright© 2010 by CEBTCenter for E-Business Technology

Example: Event Monitoring  RFID Event monitoring application  Raw RFID data is noisy and incomplete  aggregate queries “how many business meetings occurred over the last week?”  what-if queries “what is the likelihood that Bob and Mary attended a meeting given that John did not attend?” 3Copyright© 2010 by CEBTCenter for E-Business Technology

Example: Information Extraction  Information Extraction “Mr. X left Company A to join Company B” works(X, B) and works(X, A) cannot occur simultaneously  Queries “where does X work ?” “how many employees work in Company B ?” 4Copyright© 2010 by CEBTCenter for E-Business Technology

Indirectly Correlated among Variables  The key challenge in evaluating queries over large-scale correlated databases Simple queries involving a few tuple or attribute variables may require accessing and manipulating the probability distributions in the entire database  Two variables are not directly correlated with each other, they may be indirectly correlated through a chain of other variables in the database. e.g. the events entered(Bob, lounge, 2pm) and left(John, conf-room, 3pm) are correlated  A query involving those two variables must process the correlations among many other variables 5Copyright© 2010 by CEBTCenter for E-Business Technology

Contributions  INDSEP A novel hierarchical index structure for large correlated probabilistic databases Builds upon the junction tree framework, designed to answer inference queries over large-scale probabilistic graphical models (PGM)  Methodology for answering various types of queries efficiently using such a data structure Extraction queries: extract the correlations over a subset of the variables Inference (what-if) queries: computing a conditional probability distribution Aggregate queries: probability distribution over the aggregate value 6Copyright© 2010 by CEBTCenter for E-Business Technology

Junction Tree Algorithm  Aim To perform exact inference efficiently Transform the graph into an appropriate data structure Ensure joint probability remains the same Ensure exact marginals can be computed  Converts Bayes Net into an undirected tree Joint probability remains unchanged Exact marginals can be computed  Benefits Uniform treatment of Bayes Net and MRF Efficient inference is possible for undirected trees 12Copyright© 2010 by CEBTCenter for E-Business Technology

Junction Tree Algorithm (cont’d)  Ensure that a node and its parents are part of the same clique “Marry the parents for a happy family”  Now you can make the graph undirected 15Copyright© 2010 by CEBTCenter for E-Business Technology

Junction Tree Algorithm (cont’d)  Moralizing a graph Marry all unconnected parents Drop the edge directions  Ensure joint probability remains the same 16Copyright© 2010 by CEBTCenter for E-Business Technology

Query Processing  Extraction queries a junction tree that includes all the query variables and all the correlations that exist among the variables A naive algorithm – computing the smallest Steiner tree on the junction tree that connects all the query variables of interest Steiner tree can be computed on a tree structured graph in polynomial time e.g. – {g, k} – {e, o} 18Copyright© 2010 by CEBTCenter for E-Business Technology

Query Processing (cont’d)  Inference queries Hugin's algorithm e.g. – {g, k} – {e, o} 1: m12 = p(d, e) 2: p(a, d, e) = p(a, d) x p(d, e) / p(d)  Eliminate d (not necessary) m23 = p(a, e) 3: p(a, c, e) = p(a, c)*p(a, e) / p(a) m34 = p(c, e) 19Copyright© 2010 by CEBTCenter for E-Business Technology

INDSEP (cont’d)  Each node in the data structure stores Set of variables – e.g. I2: {c, f, g, h, i, j, k} Pointers to index nodes of the children and parent pointers for index traversal Set of separator potentials – e.g. I2: {p(c), p(f), p(j)} The graph induced on its children Set of shortcut potentials corresponding to the children of this node 21Copyright© 2010 by CEBTCenter for E-Business Technology

Shortcut Potentials  The joint distribution of all the separator nodes that are adjacent to the node I  p(X,Y)  e.g. Node I2 stores the shortcut potentials for P3 (p(c,f,j)) and P4 (p(f)) 22Copyright© 2010 by CEBTCenter for E-Business Technology

Index Construction  Objective function is to find the fewest number of partitions(< disk block size)  Partition the junction tree into subtrees(< disk block size) 1. Performs a depth first search on the tree 2. Iterates through the nodes starting from the lowest level of the tree and each node computes the weight of the subtree below itself 3. Once the weight of some node exceeds the block size – Removes the children below this node – (children with highest sub-tree weight are removed first) – Creates a new partition for each of them, subsequently reducing the subtree weight 4. The algorithm continues until we reach the root  Kundu et al. prove that the number of partitions generated using this algorithm is minimum S. Kundu and J. Misra. A linear tree partitioning algorithm. SIAM J. Comput., 1977. 23Copyright© 2010 by CEBTCenter for E-Business Technology

Query Processing  Inference Query: {e, o} e ⊆ I1, o ⊆ I3 Steiner tree joining → (a) {e, c} in I1, {c, j} in I2, {j, o} in I3 Steiner tree joining → (c) {a, c} in P1, {a, e} in P2 {c, j} in I2 → shortcut potential … Finally, constructed (d) 24Copyright© 2010 by CEBTCenter for E-Business Technology

Experimental Evaluation  Dataset General probabilistic database – 500,000 tuples corresponding to detected events – connecting each random variable to k neighbors (randomly [1, 5]) Markov Sequence database – 1 million time slices, which corresponds to 3 million nodes in the junction tree  Workloads W1: Shortest-range queries. These are queries that have a span of about 20% of the junction tree W2: Short-range queries. These have a span of 40% of the junction tree W3: Long-range queries. These have a span of 60% of the junction tree W4: Longest-range queries. Each query in W4 spans at least 80% of the tree 25Copyright© 2010 by CEBTCenter for E-Business Technology

Conclusion  Index data structure for correlated probabilistic databases allows for efficient processing of decision support queries  Novel shortcut potentials Reducing query time by orders of magnitude  Experimental results demonstrate the benefits of the indexing mechanisms for query processing in probabilistic databases 27Copyright© 2010 by CEBTCenter for E-Business Technology

Indexing Correlated Probabilistic Databases Bhargav Kanagal, Amol Deshpande University of Maryland, College Park, USA SIGMOD 2009 2010. 4. 9 Presented.

Similar presentations

Presentation on theme: "Indexing Correlated Probabilistic Databases Bhargav Kanagal, Amol Deshpande University of Maryland, College Park, USA SIGMOD 2009 2010. 4. 9 Presented."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Indexing Correlated Probabilistic Databases Bhargav Kanagal, Amol Deshpande University of Maryland, College Park, USA SIGMOD 2009 2010. 4. 9 Presented.

Similar presentations

Presentation on theme: "Indexing Correlated Probabilistic Databases Bhargav Kanagal, Amol Deshpande University of Maryland, College Park, USA SIGMOD 2009 2010. 4. 9 Presented."— Presentation transcript:

Similar presentations

About project

Feedback