Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science.

Slides:

Advertisements

Similar presentations

Association Rule Mining

Advertisements

Recap: Mining association rules from large datasets

Variational Methods for Graphical Models Micheal I. Jordan Zoubin Ghahramani Tommi S. Jaakkola Lawrence K. Saul Presented by: Afsaneh Shirazi.

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.

Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.

Fast Algorithms For Hierarchical Range Histogram Constructions

Exact Inference in Bayes Nets

Dynamic Bayesian Networks (DBNs)

Chapter 7 – Classification and Regression Trees

Introduction to Belief Propagation and its Generalizations. Max Welling Donald Bren School of Information and Computer and Science University of California.

1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.

Belief Propagation by Jakob Metzler. Outline Motivation Pearl’s BP Algorithm Turbo Codes Generalized Belief Propagation Free Energies.

Chapter 8-3 Markov Random Fields 1. Topics 1. Introduction 1. Undirected Graphical Models 2. Terminology 2. Conditional Independence 3. Factorization.

Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol

1 Graphical Models in Data Assimilation Problems Alexander Ihler UC Irvine Collaborators: Sergey Kirshner Andrew Robertson Padhraic Smyth.

Data Mining Association Analysis: Basic Concepts and Algorithms

Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)

Recent Development on Elimination Ordering Group 1.

Placement of Integration Points in Multi-hop Community Networks Ranveer Chandra (Cornell University) Lili Qiu, Kamal Jain and Mohammad Mahdian (Microsoft.

Global Approximate Inference Eran Segal Weizmann Institute.

Lecture 5: Learning models using EM

Data Mining Association Analysis: Basic Concepts and Algorithms

Computing Trust in Social Networks

SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.

Fast Algorithms for Association Rule Mining

Computer vision: models, learning and inference Chapter 10 Graphical Models.

Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)

Active Learning for Probabilistic Models Lee Wee Sun Department of Computer Science National University of Singapore LARC-IMS Workshop.

Some Surprises in the Theory of Generalized Belief Propagation Jonathan Yedidia Mitsubishi Electric Research Labs (MERL) Collaborators: Bill Freeman (MIT)

Mean Field Inference in Dependency Networks: An Empirical Study Daniel Lowd and Arash Shamaei University of Oregon.

1 Naïve Bayes Models for Probability Estimation Daniel Lowd University of Washington (Joint work with Pedro Domingos)

Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Charu Aggarwal + * Department of Computer Science, University of Texas at Dallas + IBM T. J. Watson.

Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.

Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.

A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.

Chapter 9 – Classification and Regression Trees

A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.

Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.

Learning With Bayesian Networks Markus Kalisch ETH Zürich.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.

An Introduction to Variational Methods for Graphical Models

Approximate Inference: Decomposition Methods with Applications to Computer Vision Kyomin Jung ( KAIST ) Joint work with Pushmeet Kohli (Microsoft Research)

1 Mean Field and Variational Methods finishing off Graphical Models – Carlos Guestrin Carnegie Mellon University November 5 th, 2008 Readings: K&F:

Belief Propagation and its Generalizations Shane Oldenburger.

Wei Sun and KC Chang George Mason University March 2008 Convergence Study of Message Passing In Arbitrary Continuous Bayesian.

Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:

1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.

Sporadic model building for efficiency enhancement of the hierarchical BOA Genetic Programming and Evolvable Machines (2008) 9: Martin Pelikan, Kumara.

Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.

Markov Networks: Theory and Applications Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.

Clustering Data Streams A presentation by George Toderici.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12

Inference in Bayesian Networks

A paper on Join Synopses for Approximate Query Answering

Frequent Pattern Mining

Probabilistic Data Management

A Consensus-Based Clustering Method

Exact Inference Continued

Data Mining Association Analysis: Basic Concepts and Algorithms

Data Mining Association Analysis: Basic Concepts and Algorithms

Markov Networks.

Association Rule Mining

Data Mining Association Analysis: Basic Concepts and Algorithms

Exact Inference Continued

Expectation-Maximization & Belief Propagation

Presentation transcript:

Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science & Engineering The Ohio State University Advisor: Prof. Srinivasan Parthasarathy Supported by: NSF Career Award IIS

Copyright 2006, Data Mining Research Lab Outline Introduction –Selectivity estimation –Probabilistic graphical model Querying transaction database Probabilistic model-based itemset summarization Querying XML database Conclusion

Copyright 2006, Data Mining Research Lab Introduction

Copyright 2006, Data Mining Research Lab Introduction Database querying Selectivity estimation –Estimation of a query result size in database systems –Usage: for query optimizer to choose an efficient execution plan Rely on probabilistic graphical models

Copyright 2006, Data Mining Research Lab Probabilistic Graphical Models Marriage of graph theory and probability theory Special cases of the basic algorithms discovered in many (dis)guises: –Statistical physics –Hidden Markov models –Genetics –Statistics –… Numerous applications –Bioinformatics –Speech –Vision, –Robotics, –Optimization –…

Copyright 2006, Data Mining Research Lab p(x 1,x 2,x 3,x 4,x 5,x 6 ) = p(x 1 )p(x 2 |x 1 ) p(x 3 |x 1 )p(x 4 |x 2 )p(x 5 |x 3 )p(x 6 |x 2,x 5 ) Directed Graphical Models (Bayesian Network) x1x1 x2x2 x4x4 x6x6 x3x3 x5x5

Copyright 2006, Data Mining Research Lab p(x 1,x 2,x 3,x 4,x 5,x 6 ) = (1/Z)Φ(x 1,x 2 ) Φ(x 1,x 3 )Φ(x 2,x 4 )Φ(x 3,x 5 )Φ(x 2,x 5,x 6 ) Undirected Graphical Models (Markov Random Field (MRF)) x1x1 x2x2 x4x4 x3x3 x5x5 x6x6

Copyright 2006, Data Mining Research Lab Inference – Computing Conditional Probabilities x1x1 x2x2 x4x4 x3x3 x5x5 x6x6 Conditioning Marginalization: Conditional probabilities

Copyright 2006, Data Mining Research Lab Querying Transaction Database

Copyright 2006, Data Mining Research Lab Transaction Database Consist of records of interactions among entities Two examples: –Market-basket data Each basket is a transaction consisting of items –Co-authorship data Each paper is a transaction consisting of “author” items

Copyright 2006, Data Mining Research Lab Querying Transaction Database Rely on frequent itemsets to learn graphical models Rely on the model to solve the selectivity estimation problem –Given a conjunctive query Q, estimate the size of the answer set, i.e., how many transactions satisfy Q

Copyright 2006, Data Mining Research Lab Frequent Itemset Mining Market-Basket Analysis A B C D

Copyright 2006, Data Mining Research Lab Frequent Itemset Mining Support(I): number of transactions “containing I”

Copyright 2006, Data Mining Research Lab Frequent Itemset Mining Problem Given D, minsup Find all itemsets I with support(I) ≥ minsup

Copyright 2006, Data Mining Research Lab Using Frequent Itemsets to Learn an MRF A k-itemset can be viewed as a constraint on the underlying distribution generating the data Given a set of itemsets, we compute a distribution satisfying them and having a Maximum Entropy (ME) This maximum entropy distribution is equivalent to an MRF

Copyright 2006, Data Mining Research Lab An ME Distribution Example Frequent Itemsets X1X1 X2X2 X3X3 X4X4 X5X5 X 1 X 2 X 1 X 3 X 2 X 3 X 3 X 4 X 4 X 5 X 1 X 2 X 3 The maximum entropy distribution has the following product form: Where I(.) is an indication function for the corresponding itemset constraint and the constants u 0, u 1, …, u 11 are estimated from the data.

Copyright 2006, Data Mining Research Lab An MRF Example X1X1 X2X2 X3X3 X4X4 X5X5 C1C1 C 2 C 3

Copyright 2006, Data Mining Research Lab Iterative Scaling Algorithm Time complexity Runs for k iterations, m itemset constraints and t is the average inference time  O(k * M * t) Efficient inference is crucial !

Copyright 2006, Data Mining Research Lab Junction Tree Algorithm Exact inference algorithm Time complexity is exponential in the treewidth (tw) of the model –Treewidth = (maximum clique size in the graph formed by triangulating the model – 1) Real world models, tw is often well above 20, thus intractable

Copyright 2006, Data Mining Research Lab Approximate Inference Algorithm Gibbs sampling –Simulating samples from posterior distributions –Sum over samples to evaluate marginal probabilities Mean field algorithm –Convert the inference problem to an optimization problem, and solve the relaxed optimization problem Loopy belief propagation –Apply Pearl’s belief propagation directly to loopy graphs –Works quite well in practice Will the iterative scaling algorithm still converge (when subjected to approximate inference algorithms) ?

Copyright 2006, Data Mining Research Lab Graph Partitioning-Based Approximate MRF Learning For all disjoint vertex subsets a, b and c in an MRF, whenever b and c are separated by a in the graph, then the variables associated with b, c are independent given the variables associated with a alone. Lemma:

Copyright 2006, Data Mining Research Lab Graph Partitioning-Based Approximate MRF Learning Cluster variables based on graph partitioning Interaction importance and treewidth based variable-cluster augmentation Learn an exact local MRF on a variable- cluster and combine all local models to derive an approximate global MRF

Copyright 2006, Data Mining Research Lab Clustering Variables k-MinCut –Partition the graph into k equal parts –Minimize the number of edges of E whose incident vertices belong to different partitions –Weighted graphs: Minimize the sum of weights of all edges across different partitions

Copyright 2006, Data Mining Research Lab Accumulative Edge Weighting Scheme ItemsetsSupport X 1 X 2 3 X 1 X 3 4 X 2 X 3 2 X 3 X 4 2 X 4 X 5 6 X 1 X 2 X = Edge weight should reflect the correlation strength

Copyright 2006, Data Mining Research Lab Clustering Variables The k-MinCut partitioning scheme yields disjoint partitions. However, there exist edges across different partitions. In other words, different partitions are correlated to each other. So how do we account for the correlations across different partitions?

Copyright 2006, Data Mining Research Lab Interaction Importance and Treewidth Based Variable-Cluster Augmentation Augmenting variable-cluster –Add back most significant incident edges to a variable-cluster Optimization –Take into consideration model complexity Keep track of treewidth of the augmented variable-clusters 1-hop neighboring nodes first, then 2-hop nodes, …, and so on

Copyright 2006, Data Mining Research Lab Treewidth Based Augmentation Variable-cluster 1-hop neighboring nodes 2-hop neighboring nodes …

Copyright 2006, Data Mining Research Lab Interaction Importance and Treewidth Based Variable-Cluster Augmentation

Copyright 2006, Data Mining Research Lab Approximate Global MRFs For each augmented variable-cluster, collect related itemsets and learn an exact local MRF All local MRFs together offer an approximate global MRF

Copyright 2006, Data Mining Research Lab Learning Algorithm

Copyright 2006, Data Mining Research Lab A Greedy Inference Algorithm Given the global model consisting of a set of local MRFs, how do we make inference? –Case 1: all query variables are covered by a single MRF, evaluate the marginal probability directly –Case 2: use a greedy decomposition scheme to compute First, pick a local model that has the largest intersection with the current query (i.e., cover most variables) Then pick the next local model covering most uncovered query variables, and so on Overlapped decomposition

Copyright 2006, Data Mining Research Lab A Greedy Inference Algorithm Qx = X 1 X 2 X 3 X 4 X 5 X1X2X3X6X7X1X2X3X6X7 X3X4X6X8X3X4X6X8 X 5 X 9 X 10 M1 M2 M3

Copyright 2006, Data Mining Research Lab Discussions The greedy inference scheme is a heuristic Global model is not globally consistent; However, we expect that the global model is nearly consistent ( Heckerman et al. 2000) A generalized belief propagation style approach is currently under investigation to force the local consistency across the local models, thereby offering a globally consistent model

Copyright 2006, Data Mining Research Lab Experimental Results C++ implementation. The Junction tree algorithm is implemented based on Intel’s Open-Source Probabilistic Networks library (C++) Use Apriori algorithm to collect frequent itemsets Use Metis for graph partitioning

Copyright 2006, Data Mining Research Lab Experimental Setup Datasets –Microsoft Anonymous Web, |D|=32711, | I |=294 –BMS-Webview1, |D|=59602, | I |=497 Query workloads –Conjunctive queries, e.g., X 1 & ¬X 2 & X 4 Performance metrics –Time: online estimating time and offline learning time –Error: average absolute relative error Varying –k, the no. of clusters –g, the no. of vertices used during the augmentation –tw, the treewidth threshold when using treewidth based augmentation optimization

Copyright 2006, Data Mining Research Lab Results on the Web Data Support threshold = 20, results in 9901 frequent itemsets Treewidth = 28 according to Maximum Cardinality Search (MCS)-ordering heuristic

Copyright 2006, Data Mining Research Lab Varying k (g = 5): Estimation accuracy Online time Online Time Offline Time

Copyright 2006, Data Mining Research Lab Varying g (k = 20): Estimation Accuracy Online time Online Time Offline Time

Copyright 2006, Data Mining Research Lab Estimation Accuracy Online Time Offline Time Varying tw (k = 25):

Copyright 2006, Data Mining Research Lab Using Non-Redundant Itemsets There exist redundancies in a collection of frequent itemsets Select non-redundant patterns to learn probabilistic models Closely related to pattern summarization

Copyright 2006, Data Mining Research Lab Probabilistic Model-Based Itemset Summarization

Copyright 2006, Data Mining Research Lab Non-Derivable Itemsets Based on redundancies –How do supports relate? What information about unknown supports can we derive from known supports? –Concise representation: only store non- redundant information

Copyright 2006, Data Mining Research Lab The Inclusion-Exclusion Principle

Copyright 2006, Data Mining Research Lab Deduction Rules via Inclusion- Exclusion Let A, B, C, … be items Let A’ correspond to the set { transactions t | t contains A } (AB)’ = (A)’ ∩ (B)’ Then supp(AB) = | (AB)’|

Copyright 2006, Data Mining Research Lab Deduction Rules via Inclusion- Exclusion Inclusion-exclusion principle: |A’ U B’ U C’| = |A’| + |B’| + |C’| - |(AB)’| - |(AC)’| - |(BC)’| + |(ABC)’| Thus, since |A’ U B’ U C’| ≤ n, Supp(ABC) ≤ s(AB) + s(AC) + s(BC) - s(A) - s(B) - s(C) + n

Copyright 2006, Data Mining Research Lab Complete Set for Supp(ABC) 0s ABC ≥ 0 1s ABC ≤ s AB s ABC ≤ s AC s ABC ≤ s BC 2s ABC ≥ s AB + s AC - s A s ABC ≥ s AB + s BC – s B s ABC ≥ s AC + s BC – s C 3s ABC ≤ s AB + s AC + s BC - s A - s B - s C + n

Copyright 2006, Data Mining Research Lab Derivable Itemsets Given: Supp(I) for all I  J  Lower bound on Supp(J) = L Upper bound on Supp(J) = U Without counting: Supp(J)  [L, U] J is a derivable itemset (DI) iff L = U We know Supp(J) exactly without counting!

Copyright 2006, Data Mining Research Lab Derivable Itemsets J is a derivable itemset: –No need to count Supp(J) –No need to store Supp(J) We can use the deduction rules –Concise representation: C = { (J, Supp(J) ) | J not derivable from Supp(I), I  J }

Copyright 2006, Data Mining Research Lab Probabilistic Model Based Itemset Summarization We can learn the MRF from non-derivable itemsets alone Lemma: Given a transaction dataset D, the MRF M constructed from all of its σ-frequent itemsets is equivalent to M’, the MRF constructed from only its σ-frequent non- derivable itemsets Can we do better? –Further compress the patterns

Copyright 2006, Data Mining Research Lab Probabilistic Model Based Itemset Summarization Use smaller itemsets to learn an MRF Use this model to infer the supports of larger itemsets Use those itemsets whose occurrence can not be explained (by some error threshold) by the model to augment the model

Copyright 2006, Data Mining Research Lab Itemset Summarization Algorithm

Copyright 2006, Data Mining Research Lab Generalized Non-Derivable Itemsets All the itemsets in the final summary are non-derivable Relax the requirement for an itemset to be derivable

Copyright 2006, Data Mining Research Lab Experimental Results Experimental Setup –Datasets: –Performance metrics: Summarization accuracy (restoration error) Summary size Summarizing time

Copyright 2006, Data Mining Research Lab Results on the Chess Dataset Estimation accuracy Summary size Summarizing time minSup = 2000  frequent itemsets 1276 non-derivable

Copyright 2006, Data Mining Research Lab Results on the Chess Dataset Skewed itemset distribution when varying error threshold

Copyright 2006, Data Mining Research Lab Results on the Mushroom Dataset Estimation accuracy Summary size Summarizing time minSup = 2031 (25%)  5545 frequent itemsets 534 non-derivable

Copyright 2006, Data Mining Research Lab Results on the Mushroom Dataset Skewed itemset distribution when varying error threshold

Copyright 2006, Data Mining Research Lab Result Summary and Discussions There do exist redundancies in a collection of itemsets, and the probabilistic model based summarization scheme can effectively eliminate such redundancies –When datasets are dense and largely satisfy conditional independence assumption, our summarization approach is extremely efficient –When datasets become sparse and do not satisfy the conditional independence assumption, the summarization task becomes more difficult (need more time and space) Itemsets-based MRF learning and MRF-based itemset summarization are two interactive procedures

Copyright 2006, Data Mining Research Lab Query XML Database – Exploiting Independence Structure from Complex Structural Patterns

Copyright 2006, Data Mining Research Lab Querying XML Database XML is becoming the standard for data exchange We need to query the structure and text data of XML documents XML twig query: –an important query mechanism –a structural query with small branches Optimizing these queries requires estimating the selectivity of the twig queries

Copyright 2006, Data Mining Research Lab Querying XML Database An XML document example: DBLP.xml (Digital Bibliography & Library Project)

Copyright 2006, Data Mining Research Lab Querying XML Database A twig example: FOR all books IN document(“DBLP.xml") WHERE publisher = "Morgan Kaufmann" RETURN title bpt b: book p: publisher t : title

Copyright 2006, Data Mining Research Lab Querying XML Database b p t b: book p: publisher t : title selectivity = 2

Copyright 2006, Data Mining Research Lab Problem Statement The goal is to accurately estimate the selectivity of twig queries with limited memory –Need a structure to store relevant statistics of the data –Then estimate selectivity from these statistics

Copyright 2006, Data Mining Research Lab Our Approach (TreeLattice) Key idea: store the occurrence statistics of small twigs in the summary –The summary is a lattice consisting of small trees, thus called TreeLattice Then based on these statistics to estimate the selectivity of the large twigs

Copyright 2006, Data Mining Research Lab Challenges How to estimate the selectivity for a given twig given the selectivity information of its sub-twigs? How to decompose a large twig into smaller twigs? What statistics to store in the lattice summary?

Copyright 2006, Data Mining Research Lab Estimation Procedure T y e2e2 x e1e1 T1 T2 x Augmenting T with e1 to get T1 y Augmenting T with e2 to get T2 Lemma: If these two tree augmentations are conditionally independent (conditioned on T), then we have: : selectivity

Copyright 2006, Data Mining Research Lab Decomposition Strategies How to decompose a large twig into smaller sub-twigs? –Recursive decomposition with or without voting –Fixed-sized decomposition –Hybrid decomposition

Copyright 2006, Data Mining Research Lab Recursive Decomposition a b c df e g a b df e g a b c df g a b d f g a b d f e a b d f g a b c d f Recursively applying the estimation formula.  It’s possible there exist multiple feasible decompositions. Rely on voting to obtain the best estimate as we can Much more accurate than without voting Estimating process slows down

Copyright 2006, Data Mining Research Lab Fixed-sized Decomposition a b cd a b cdf e g a b cd a b cd b cd e b cd e b cd e + b df e + b df g + b df e b df e b df g b df g Very fast, but can not be applied directly

Copyright 2006, Data Mining Research Lab Hybrid Decomposition a b cdf e g … recursive decomposition with voting a b d a b c b d a b b c a b a b cd b cd e + b df e + b df g + fixed-sized decomposition

Copyright 2006, Data Mining Research Lab Summary Statistics What to store in lattice summary? –Store important statistics –Store non-redundant information –How to achieve this? Store non-derivable patterns only!

Copyright 2006, Data Mining Research Lab Summary Statistics A twig pattern is δ- derivable if and only if its true selectivity is within an error tolerance of δ to its expected selectivity according to TreeLattice. –0-derivable (δ=0) patterns are those patterns whose selectivity can be estimated exactly. Pruning 0-derivable patterns –No loss of accuracy

Copyright 2006, Data Mining Research Lab Summary Statistics Level-wise lattice summary construction –Add all twigs of size 1&2 to the summary (base) –Then add larger non-derivable frequent twigs into the summary, until the memory budget is depleted

Copyright 2006, Data Mining Research Lab Experimental Methodology Datasets: NASA, PSD, IMDB and XMark Workloads: 1000 frequent twig queries of size between 4 and 9. Error metric: Mean absolute relative error

Copyright 2006, Data Mining Research Lab Accuracy of Estimators NASA Recursive decomposition with voting yields best estimates The quality of estimation degrades as the twig size increases due to error propagation

Copyright 2006, Data Mining Research Lab Varying Summary Size NASA The larger the summary, the better the estimations TreeLattice makes more efficient use of the memory budget

Copyright 2006, Data Mining Research Lab Estimation Time NASA TreeLattice is very fast when processing relative small twigs Recursive decomposition with voting slows down a lot as the twig size increases. Overall, fast decomposition is best.

Copyright 2006, Data Mining Research Lab δ-derivable Pruning The proportion of 0-derivable patterns is very high on NASA, PSD and XMark –Tree growing conditional independence assumption holds well –TreeLattice works very well Assumption does not hold that well on IMDB. How to improve the estimations on IMDB?

Copyright 2006, Data Mining Research Lab δ-derivable Pruning Larger δ is good for large twigs, at the cost of sacrificing estimation accuracy for small twigs. IMDB TreeSketches

Copyright 2006, Data Mining Research Lab Discussions TreeLattice is effective in estimating the selectivity of XML twig queries –Compares favorably with the state-of-the-art approach –The lattice summary construction is fast –The online estimation is fast

Copyright 2006, Data Mining Research Lab Conclusion

Copyright 2006, Data Mining Research Lab Conclusion Conditional independence structure is common in the real world Graphical models are effective to capture such structures and solve the selectivity estimation problem for database querying Model structured data (sequence/tree/graph) using probabilistic models Model streaming/incremental data