Indexing Correlated Probabilistic Databases Bhargav Kanagal, Amol Deshpande University of Maryland, College Park, USA SIGMOD 2009 2010. 4. 9 Presented.

Slides:



Advertisements
Similar presentations
CS188: Computational Models of Human Behavior
Advertisements

CSE 473/573 Computer Vision and Image Processing (CVIP) Ifeoma Nwogu Lecture 27 – Overview of probability concepts 1.
Markov Networks Alan Ritter.
Variational Methods for Graphical Models Micheal I. Jordan Zoubin Ghahramani Tommi S. Jaakkola Lawrence K. Saul Presented by: Afsaneh Shirazi.
CS498-EA Reasoning in AI Lecture #15 Instructor: Eyal Amir Fall Semester 2011.
Graphical Models BRML Chapter 4 1. the zoo of graphical models Markov networks Belief networks Chain graphs (Belief and Markov ) Factor graphs =>they.
Online Filtering, Smoothing & Probabilistic Modeling of Streaming Data In short, Applying probabilistic models to Streams Bhargav Kanagal & Amol Deshpande.
Lauritzen-Spiegelhalter Algorithm
BAYESIAN NETWORKS. Bayesian Network Motivation  We want a representation and reasoning system that is based on conditional independence  Compact yet.
Exact Inference in Bayes Nets
Junction Trees And Belief Propagation. Junction Trees: Motivation What if we want to compute all marginals, not just one? Doing variable elimination for.
Dynamic Bayesian Networks (DBNs)
B+-Trees (PART 1) What is a B+ tree? Why B+ trees? Searching a B+ tree
An Introduction to Variational Methods for Graphical Models.
EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.
Chapter 8-3 Markov Random Fields 1. Topics 1. Introduction 1. Undirected Graphical Models 2. Terminology 2. Conditional Independence 3. Factorization.
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
Lineage Processing over Correlated Probabilistic Databases Bhargav Kanagal Amol Deshpande University of Maryland.
Junction Tree Algorithm Brookes Vision Reading Group.
Junction Trees: Motivation Standard algorithms (e.g., variable elimination) are inefficient if the undirected graph underlying the Bayes Net contains cycles.
From Variable Elimination to Junction Trees
Indexing Correlated Probabilistic Databases Bhargav Kanagal & Amol Deshpande University of Maryland.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
CS774. Markov Random Field : Theory and Application Lecture 06 Kyomin Jung KAIST Sep
CPSC 322, Lecture 26Slide 1 Reasoning Under Uncertainty: Belief Networks Computer Science cpsc322, Lecture 27 (Textbook Chpt 6.3) March, 16, 2009.
A Graphical Model For Simultaneous Partitioning And Labeling Philip Cowans & Martin Szummer AISTATS, Jan 2005 Cambridge.
On Computing Compression Trees for Data Collection in Wireless Sensor Networks Jian Li, Amol Deshpande and Samir Khuller Department of Computer Science,
Bayesian Networks Chapter 2 (Duda et al.) – Section 2.11
Global Approximate Inference Eran Segal Weizmann Institute.
Bayesian Belief Networks
Graphical Models Lei Tang. Review of Graphical Models Directed Graph (DAG, Bayesian Network, Belief Network) Typically used to represent causal relationship.
Bayesian Networks Alan Ritter.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Multiple Object Class Detection with a Generative Model K. Mikolajczyk, B. Leibe and B. Schiele Carolina Galleguillos.
Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester.
Machine Learning CUNY Graduate Center Lecture 21: Graphical Models.
Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
Sensor Data Management: Challenges and (some) Solutions Amol Deshpande, University of Maryland.
Directed - Bayes Nets Undirected - Markov Random Fields Gibbs Random Fields Causal graphs and causality GRAPHICAL MODELS.
Undirected Models: Markov Networks David Page, Fall 2009 CS 731: Advanced Methods in Artificial Intelligence, with Biomedical Applications.
第十讲 概率图模型导论 Chapter 10 Introduction to Probabilistic Graphical Models
1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 11 th, 2006 Readings: K&F: 8.1, 8.2, 8.3,
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
Ch 8. Graphical Models Pattern Recognition and Machine Learning, C. M. Bishop, Revised by M.-O. Heo Summarized by J.W. Nam Biointelligence Laboratory,
Learning the Structure of Related Tasks Presented by Lihan He Machine Learning Reading Group Duke University 02/03/2006 A. Niculescu-Mizil, R. Caruana.
Union-find Algorithm Presented by Michael Cassarino.
An Introduction to Variational Methods for Graphical Models
Marginalization & Conditioning Marginalization (summing out): for any sets of variables Y and Z: Conditioning(variant of marginalization):
DISTIN: Distributed Inference and Optimization in WSNs A Message-Passing Perspective SCOM Team
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
Inference Algorithms for Bayes Networks
Christopher M. Bishop, Pattern Recognition and Machine Learning 1.
Pattern Recognition and Machine Learning
A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB
Today Graphical Models Representing conditional dependence graphically
1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 15 th, 2008 Readings: K&F: 8.1, 8.2, 8.3,
Dependency Networks for Inference, Collaborative filtering, and Data Visualization Heckerman et al. Microsoft Research J. of Machine Learning Research.
Inference in Bayesian Networks
Probabilistic Data Management
Learning Bayesian Network Models from Data
Exact Inference Continued
Probabilistic Data Management
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18
CSCI 5822 Probabilistic Models of Human and Machine Learning
المشرف د.يــــاســـــــــر فـــــــؤاد By: ahmed badrealldeen
Markov Random Fields Presented by: Vladan Radosavljevic.
Readings: K&F: 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7 Markov networks, Factor graphs, and an unified view Start approximate inference If we are lucky… Graphical.
Variable Elimination Graphical Models – Carlos Guestrin
Presentation transcript:

Indexing Correlated Probabilistic Databases Bhargav Kanagal, Amol Deshpande University of Maryland, College Park, USA SIGMOD Presented by JongHeum Yeon

Motivation  Correlated Probabilistic Data Sensor Networks Information Extraction Data Integration Activity Recognition RFID Stream Analysis  Representing complex correlations and for querying over correlated database  Rapidly increasing scale of such databases 2Copyright© 2010 by CEBTCenter for E-Business Technology

Example: Event Monitoring  RFID Event monitoring application  Raw RFID data is noisy and incomplete  aggregate queries “how many business meetings occurred over the last week?”  what-if queries “what is the likelihood that Bob and Mary attended a meeting given that John did not attend?” 3Copyright© 2010 by CEBTCenter for E-Business Technology

Example: Information Extraction  Information Extraction “Mr. X left Company A to join Company B” works(X, B) and works(X, A) cannot occur simultaneously  Queries “where does X work ?” “how many employees work in Company B ?” 4Copyright© 2010 by CEBTCenter for E-Business Technology

Indirectly Correlated among Variables  The key challenge in evaluating queries over large-scale correlated databases Simple queries involving a few tuple or attribute variables may require accessing and manipulating the probability distributions in the entire database  Two variables are not directly correlated with each other, they may be indirectly correlated through a chain of other variables in the database. e.g. the events entered(Bob, lounge, 2pm) and left(John, conf-room, 3pm) are correlated  A query involving those two variables must process the correlations among many other variables 5Copyright© 2010 by CEBTCenter for E-Business Technology

Contributions  INDSEP A novel hierarchical index structure for large correlated probabilistic databases Builds upon the junction tree framework, designed to answer inference queries over large-scale probabilistic graphical models (PGM)  Methodology for answering various types of queries efficiently using such a data structure Extraction queries: extract the correlations over a subset of the variables Inference (what-if) queries: computing a conditional probability distribution Aggregate queries: probability distribution over the aggregate value 6Copyright© 2010 by CEBTCenter for E-Business Technology

Preliminaries  Probabilistic Graphical Models (PGMs)  Junction Tree Representation of PGMs 7Copyright© 2010 by CEBTCenter for E-Business Technology

Graphical Models  Compact graphical representation of joint probability. 8Copyright© 2010 by CEBTCenter for E-Business Technology

Bayesian Network 9Copyright© 2010 by CEBTCenter for E-Business Technology

Markov Random Fields 10Copyright© 2010 by CEBTCenter for E-Business Technology

Markov Random Fields (cont’d) 11Copyright© 2010 by CEBTCenter for E-Business Technology

Junction Tree Algorithm  Aim To perform exact inference efficiently Transform the graph into an appropriate data structure Ensure joint probability remains the same Ensure exact marginals can be computed  Converts Bayes Net into an undirected tree Joint probability remains unchanged Exact marginals can be computed  Benefits Uniform treatment of Bayes Net and MRF Efficient inference is possible for undirected trees 12Copyright© 2010 by CEBTCenter for E-Business Technology

Junction Tree Algorithm (cont’d) 13Copyright© 2010 by CEBTCenter for E-Business Technology

Junction Tree Algorithm (cont’d)  The cliques of this graph are inconsistent with the original one  Node D just lost a parent 14Copyright© 2010 by CEBTCenter for E-Business Technology

Junction Tree Algorithm (cont’d)  Ensure that a node and its parents are part of the same clique “Marry the parents for a happy family”  Now you can make the graph undirected 15Copyright© 2010 by CEBTCenter for E-Business Technology

Junction Tree Algorithm (cont’d)  Moralizing a graph Marry all unconnected parents Drop the edge directions  Ensure joint probability remains the same 16Copyright© 2010 by CEBTCenter for E-Business Technology

Joint Distribution  Probabilistic Graphical Models (PGMs)  Junction Tree Representation of PGMs 17Copyright© 2010 by CEBTCenter for E-Business Technology

Query Processing  Extraction queries a junction tree that includes all the query variables and all the correlations that exist among the variables A naive algorithm – computing the smallest Steiner tree on the junction tree that connects all the query variables of interest Steiner tree can be computed on a tree structured graph in polynomial time e.g. – {g, k} – {e, o} 18Copyright© 2010 by CEBTCenter for E-Business Technology

Query Processing (cont’d)  Inference queries Hugin's algorithm e.g. – {g, k} – {e, o} 1: m12 = p(d, e) 2: p(a, d, e) = p(a, d) x p(d, e) / p(d)  Eliminate d (not necessary) m23 = p(a, e) 3: p(a, c, e) = p(a, c)*p(a, e) / p(a) m34 = p(c, e) 19Copyright© 2010 by CEBTCenter for E-Business Technology

INDSEP  Each index node in INDSEP corresponds to a connected subtree of the junction tree 20Copyright© 2010 by CEBTCenter for E-Business Technology

INDSEP (cont’d)  Each node in the data structure stores Set of variables – e.g. I2: {c, f, g, h, i, j, k} Pointers to index nodes of the children and parent pointers for index traversal Set of separator potentials – e.g. I2: {p(c), p(f), p(j)} The graph induced on its children Set of shortcut potentials corresponding to the children of this node 21Copyright© 2010 by CEBTCenter for E-Business Technology

Shortcut Potentials  The joint distribution of all the separator nodes that are adjacent to the node I  p(X,Y)  e.g. Node I2 stores the shortcut potentials for P3 (p(c,f,j)) and P4 (p(f)) 22Copyright© 2010 by CEBTCenter for E-Business Technology

Index Construction  Objective function is to find the fewest number of partitions(< disk block size)  Partition the junction tree into subtrees(< disk block size) 1. Performs a depth first search on the tree 2. Iterates through the nodes starting from the lowest level of the tree and each node computes the weight of the subtree below itself 3. Once the weight of some node exceeds the block size – Removes the children below this node – (children with highest sub-tree weight are removed first) – Creates a new partition for each of them, subsequently reducing the subtree weight 4. The algorithm continues until we reach the root  Kundu et al. prove that the number of partitions generated using this algorithm is minimum S. Kundu and J. Misra. A linear tree partitioning algorithm. SIAM J. Comput., Copyright© 2010 by CEBTCenter for E-Business Technology

Query Processing  Inference Query: {e, o} e ⊆ I1, o ⊆ I3 Steiner tree joining → (a) {e, c} in I1, {c, j} in I2, {j, o} in I3 Steiner tree joining → (c) {a, c} in P1, {a, e} in P2 {c, j} in I2 → shortcut potential … Finally, constructed (d) 24Copyright© 2010 by CEBTCenter for E-Business Technology

Experimental Evaluation  Dataset General probabilistic database – 500,000 tuples corresponding to detected events – connecting each random variable to k neighbors (randomly [1, 5]) Markov Sequence database – 1 million time slices, which corresponds to 3 million nodes in the junction tree  Workloads W1: Shortest-range queries. These are queries that have a span of about 20% of the junction tree W2: Short-range queries. These have a span of 40% of the junction tree W3: Long-range queries. These have a span of 60% of the junction tree W4: Longest-range queries. Each query in W4 spans at least 80% of the tree 25Copyright© 2010 by CEBTCenter for E-Business Technology

Experimental Evaluation (cont’d)  Effectiveness of the Index  Query Processing Performance 26Copyright© 2010 by CEBTCenter for E-Business Technology

Conclusion  Index data structure for correlated probabilistic databases allows for efficient processing of decision support queries  Novel shortcut potentials Reducing query time by orders of magnitude  Experimental results demonstrate the benefits of the indexing mechanisms for query processing in probabilistic databases 27Copyright© 2010 by CEBTCenter for E-Business Technology