Dependency-Based Histogram Synopses for High-dimensional Data Amol Deshpande, UC Berkeley Minos Garofalakis, Bell Labs Rajeev Rastogi, Bell Labs.

Slides:



Advertisements
Similar presentations
Variational Methods for Graphical Models Micheal I. Jordan Zoubin Ghahramani Tommi S. Jaakkola Lawrence K. Saul Presented by: Afsaneh Shirazi.
Advertisements

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
DECISION TREES. Decision trees  One possible representation for hypotheses.
Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference,
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Fast Algorithms For Hierarchical Range Histogram Constructions
Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.
STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research.
Probabilistic Histograms for Probabilistic Data Graham Cormode AT&T Labs-Research Antonios Deligiannakis Technical University of Crete Minos Garofalakis.
Dynamic Bayesian Networks (DBNs)
Exploiting Sparse Markov and Covariance Structure in Multiresolution Models Presenter: Zhe Chen ECE / CMR Tennessee Technological University October 22,
February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.
BOAT - Optimistic Decision Tree Construction Gehrke, J. Ganti V., Ramakrishnan R., Loh, W.
Optimal Workload-Based Weighted Wavelet Synopsis
Deterministic Wavelet Thresholding for Maximum-Error Metrics Minos Garofalakis Bell Laboratories Lucent Technologies 600 Mountain Avenue Murray Hill, NJ.
Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)
Recent Development on Elimination Ordering Group 1.
Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang.
Exploiting Correlated Attributes in Acquisitional Query Processing Amol Deshpande University of Maryland Joint work with Carlos Sam
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
Parametric Query Generation Student: Dilys Thomas Mentor: Nico Bruno Manager: Surajit Chaudhuri.
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
What I am doing Amol Deshpande. Selection Ordering  Given a set of selection predicates and correlations between them, find the optimal ordering : Not.
Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)
Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester.
If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Join Synopses for Approximate Query Answering Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy.
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis
Multidimensional Indexes Applications: geographical databases, data cubes. Types of queries: –partial match (give only a subset of the dimensions) –range.
A Novel Approach for Approximate Aggregations Over Arrays SSDBM 2015 June 29 th, San Diego, California 1 Yi Wang, Yu Su, Gagan Agrawal The Ohio State University.
End-biased Samples for Join Cardinality Estimation Cristian Estan, Jeffrey F. Naughton Computer Sciences Department University of Wisconsin-Madison.
Data Streams Part 3: Approximate Query Evaluation Reynold Cheng 23 rd July, 2002.
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
Index Interactions in Physical Design Tuning Modeling, Analysis, and Applications Karl Schnaitter, UC Santa Cruz Neoklis Polyzotis, UC Santa Cruz Lise.
The Haar + Tree: A Refined Synopsis Data Structure Panagiotis Karras HKU, September 7 th, 2006.
The Impact of Duality on Data Synopsis Problems Panagiotis Karras KDD, San Jose, August 13 th, 2007 work with Dimitris Sacharidis and Nikos Mamoulis.
Histograms for Selectivity Estimation
Applications of Dynamic Programming and Heuristics to the Traveling Salesman Problem ERIC SALMON & JOSEPH SEWELL.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Manoranjan.
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
Indexing Correlated Probabilistic Databases Bhargav Kanagal, Amol Deshpande University of Maryland, College Park, USA SIGMOD Presented.
The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007.
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing.
One-Pass Wavelet Synopses for Maximum-Error Metrics Panagiotis Karras Trondheim, August 31st, 2005 Research at HKU with Nikos Mamoulis.
HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas.
Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms.
XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
University of Texas at Arlington Presented By Srikanth Vadada Fall CSE rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
AQAX: Approximate Query Answering for XML Josh Spiegel, M. Pontikakis, S. Budalakoti, N. Polyzotis Univ. of California Santa Cruz.
Dense-Region Based Compact Data Cube
Data Transformation: Normalization
A paper on Join Synopses for Approximate Query Answering
RE-Tree: An Efficient Index Structure for Regular Expressions
Structure and Value Synopses for XML Data Graphs
Multidimensional Indexes
Markov Random Fields Presented by: Vladan Radosavljevic.
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Data Transformations targeted at minimizing experimental variance
Presentation transcript:

Dependency-Based Histogram Synopses for High-dimensional Data Amol Deshpande, UC Berkeley Minos Garofalakis, Bell Labs Rajeev Rastogi, Bell Labs

Why Synopses ??? Selectivity estimation for query optimization Approximate querying Useful when not feasible to query the entire database Prevalent techniques : Histograms, wavelets Suffer from “Curse of dimensionality” Random Sampling Very few matches for selection in sparse high- dimensional data

Problem Statement Given a “counts table”, find an approximate answer to an aggregate range sum query The counts table can be thought of as a joint probability distribution Evaluating an aggregate range sum query equivalent to finding a joint probability distribution ABcount B A /  3

Problem Statement Given a “counts table”, find an approximate answer to an aggregate range sum query The counts table can be thought of as a joint probability distribution Evaluating an aggregate range sum query equivalent to finding a joint probability distribution ABcount B A /  3

Problem Statement Given a “counts table”, find an approximate answer to an aggregate range sum query The counts table can be thought of as a joint probability distribution Evaluating an aggregate range sum query equivalent to finding a joint probability distribution ABcount B A /  3

Histograms for High Dimensions Assume “attribute independence” and build per- attribute one-dimensional histograms Simple to build and maintain Highly inaccurate in presence of correlations “Multi-dimensional” histograms [PI’97, MVW’98, GKT’00] Expensive to build and maintain Large number of buckets required for reasonable accuracy in high dimensions Not suitable for queries on lower-dimensional subsets of attributes Extremes in terms of the underlying correlations!!

Our Approach : Dependency Based (DB) Histograms Build a statistical model on the attributes of data Based on model, build a set of low dimensional histograms Use this collection of histograms to provide approximate answers

Outline Motivation Decomposable Models Building a collection of histograms Query evaluation Experimental evaluation

Decomposable Models Specify correlations between attributes Examples : Partial Independence : p(salary = s, height = h, weight = w) = p(salary = s)p(height = h, weight = w) Conditional Independence : p(salary = s, age = a | YPE = y) = p(salary = s | YPE = y) p(age = a | YPE = y) Advantages of Decomposable Models: Closed form estimates for the joint probability exist Interpretation in terms of partial and conditional independence statements Can be represented as a graph

Decomposable Models - Example Interpretation Attributes A and D are conditionally independent given attributes B and C, i.e., p(AD|BC) = p(A|BC)p(D|BC) Graphical Representation : Markov Property : If T separates U and V, then p(UV|T) = p(U|T)p(V|T) Joint Probability Distribution p(ABCD) = p(ABC)p(DBC)/p(BC) A B C D

What to do with the Model ? Build clique histograms on marginals corresponding to the maximal cliques of the model A B C D p(ABCD) = p(ABC)p(BCD)/p(BC) Histograms on : ABC, BCD A B C D p(ABCD) = p(AC)p(BC)p(DC)/p(C) Histograms on : AC, BC, DC 2

Searching for the Best Model NP-hard to find the best model Heuristic forward selection : Start with full independence assumption and grow the model greedily Growing a model : Need to stay in the space of decomposable models Naïve approach : Try every possible extension of the current model Works for small number of attributes Developed a more sophisticated algorithm [DGJ01]

Choosing among Models Kullback-Leibler Information Divergence A measure of “distance” between two probability distributions Choosing among possible extensions Maximize the increase in approximation accuracy due to increase in complexity Maximize ratio of increase approximation accuracy and the increase in total state space When to stop ? Limit the maximum dimensionality of a histogram

Outline Motivation Decomposable Models Building a collection of histograms Query evaluation Experimental evaluation

Building Clique Histograms MHIST approach [PI97] : Partition the space to be covered through recursive splits. Split Tree representation of MHISTs MHIST projection and multiplication : Performed directly on the Split Tree representation

Storage Space Allocation Minimize total error for given storage space Can be solved in time O(CB ) : C is # histograms and B is total space available More efficient heuristic : Greedily allocate additional buckets to the histogram that maximizes the decrease in error per unit space Optimal if the individual histogram error functions follow the law of diminishing returns 2

Outline Motivation Decomposable Models Building a collection of histograms Query evaluation Experimental evaluation

Query Evaluation Compute joint probability distribution and project More efficient evaluation algorithm : Use “Junction Tree” of the model graph Nodes : Maximal cliques of the model Edges : An edge between two nodes A and B only if S = A B separates A - S and B - S. Minimize the number of operations for computing any marginal probability distribution Operation ordering is related to join order optimization

Junction Trees A BD CE ABC BCD CDE Computing p(AD) Computing p(AE)

Outline Motivation Decomposable Models Building the collection of histograms Query evaluation Experimental evaluation

Experimental Evaluation Census Data : Census-6 : citizenship, native country of father, native country of mother, native country of the sample person, occupation Code, age Census-12 : industry code, hours worked, education, state, county, race Error Metrics : Absolute Relative Error : |correct – approx|/correct Multiplicative Error : max{correct, approx}/min{correct,approx}

Methods Compared MHIST : Multi-dimensional histogram on all the attributes IND : Per attribute one-dimensional histograms Dependency-based histograms : DB1 : Model selected based on statistical significance DB2 : Model selected with the goal of minimizing the ratio of approximation accuracy and total state space

Results on Census-6

Results on Census-12

Summary of Results Decomposable Models are Effective Good approximations with models of small complexity Better Approximate Answer Quality As much as 5 times lower errors Storage Efficient Fairly accurate answers with less than 1% space

Conclusions Proposed an approach to building synopses by explicitly identifying and using correlations present in the data Developed an efficient forward selection procedure Developed efficient algorithms for building and using collections of histograms General methodology presented applicable to other synopsis methods as well

Future Work Maintenance Additional problem of maintaining the underlying model Error Guarantees Applicability to other synopsis techniques Exploiting more general class of models for storing clique marginals

References [MD88] Muralikrishna and Dewitt; Equi-depth histograms for estimating selectivity factors for multi-dimensional queries; SIGMOD’88 [PI97] Poosala and Ioannidis; Selectivity Estimation Without the Attribute Value Independence Assumption; VLDB’97 [BFH75] Bishop, Fienberg and Holland; Discrete Multivariate Analysis; MIT Press, 1975 [MVW98] Matias, Vitter and Wang; Wavelet-Based Histograms for Selectivity Estimation; SIGMOD’98 [DGJ01] Deshpande, Garofalakis and Jordan; Efficient Stepwise Selection in Decomposable Models; UAI’01

Census-6 : Model Found CTZNC-S NC-MNC-F Occ Age

Previous Work Approximate Querying Maintaining a collection of histograms to answer queries [PK00] Based on the query workload. Use “Iterative Proportional Fitting” to answer queries. [PG99] Driven by a pre-specified error bound. Issues not addressed : Selection of Histograms Answering higher-dimensional queries using lower-dimensional histogram.

References [PG99] Poosala and Ganti; Fast approximate answers to aggregate queries on a data cube; SSDBM’99 [PK00] Palpanas and Koudos; Entropy based approximate querying and exploration of datacubes; TR, Univ of Toronto, 2000

What to do with the Model ? Build clique histograms on the probability marginals corresponding to the maximal cliques of the model Example : Build histograms on the attribute sets ABC and BCD Full probability distribution can be recovered from clique marginals p(ABCD) = p(ABC)p(BCD)/p(BC) A B C D

Operations on Mhists Our algorithms perform required operations directly on the Split Tree representation Operations : Projection Multiplication