Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.

Slides:



Advertisements
Similar presentations
AP Statistics Course Review.
Advertisements

A Privacy Preserving Index for Range Queries
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
CS4432: Database Systems II
CSE544 Database Statistics Tuesday, February 15 th, 2011 Dan Suciu , Winter
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Fast Algorithms For Hierarchical Range Histogram Constructions
Introduction to Histograms Presented By: Laukik Chitnis
STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research.
Fast Incremental Maintenance of Approximate histograms : Phillip B. Gibbons (Intel Research Pittsburgh) Yossi Matias (Tel Aviv University) Viswanath Poosala.
Brian Babcock Surajit Chaudhuri Gautam Das at the 2003 ACM SIGMOD International Conference By Shashank Kamble Gnanoba.
Fan Qi Database Lab 1, com1 #01-08 CS3223 Tutorial 8.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu.
Processing Data-Stream Joins Using Skimmed Sketches Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies Joint work.
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
Anatomy: Simple and Effective Privacy Preservation Israel Chernyak DB Seminar (winter 2009)
Cost-Based Plan Selection Choosing an Order for Joins Chapter 16.5 and16.6 by:- Vikas Vittal Rao ID: 124/227 Chiu Luk ID: 210.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
Parametric Query Generation Student: Dilys Thomas Mentor: Nico Bruno Manager: Surajit Chaudhuri.
Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
STA Lecture 161 STA 291 Lecture 16 Normal distributions: ( mean and SD ) use table or web page. The sampling distribution of and are both (approximately)
Access Path Selection in a Relational Database Management System Selinger et al.
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
A Model and Algorithms for Pricing Queries Tang Ruiming, Wu Huayu, Bao Zhifeng, Stephane Bressan, Patrick Valduriez.
© 2009 IBM Corporation 1 Improving Consolidation of Virtual Machines with Risk-aware Bandwidth Oversubscription in Compute Clouds Amir Epstein Joint work.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
A Novel Approach for Approximate Aggregations Over Arrays SSDBM 2015 June 29 th, San Diego, California 1 Yi Wang, Yu Su, Gagan Agrawal The Ohio State University.
End-biased Samples for Join Cardinality Estimation Cristian Estan, Jeffrey F. Naughton Computer Sciences Department University of Wisconsin-Madison.
A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.
Histograms for Selectivity Estimation
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
Practical LFU implementation for Web Caching George KarakostasTelcordia Dimitrios N. Serpanos University of Patras.
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
Monitoring k-NN Queries over Moving Objects Xiaohui Yu University of Toronto Joint work with Ken Pu and Nick Koudas.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Introduction to Query Optimization Chapter 13.
Presented By Anirban Maiti Chandrashekar Vijayarenu
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP BY QUERIES Swaroop Acharya,Philip B Gibbons, VishwanathPoosala By Agasthya Padisala Anusha Reddy.
HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas.
Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms.
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
University of Texas at Arlington Presented By Srikanth Vadada Fall CSE rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing.
Chapter 13 Query Optimization Yonsei University 1 st Semester, 2015 Sanghyun Park.
Query Optimization. overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g., SAP admin) DBA,
Computer Science and Engineering Jianye Yang 1, Ying Zhang 2, Wenjie Zhang 1, Xuemin Lin 1 Influence based Cost Optimization on User Preference 1 The University.
Efficient Similarity Search : Arbitrary Similarity Measures, Arbitrary Composition Date: 2011/10/31 Source: Dustin Lange et. al (CIKM’11) Speaker:Chiang,guang-ting.
A paper on Join Synopses for Approximate Query Answering
Finding Frequent Items in Data Streams
ICICLES: Self-tuning Samples for Approximate Query Answering
Query-Friendly Compression of Graph Streams
Data Integration with Dependent Sources
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Lecture 7 Sampling and Sampling Distributions
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Self-organizing Tuple Reconstruction in Column-stores
Presentation transcript:

Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto 2 IBM Toronto Lab

November 3, 2005CIKM Distinct value combinations CountryCityHotel Name GermanyBremenHilton GermanyBremenBest Western GermanyFrankfurtInterCity CanadaTorontoFour Seasons CanadaTorontoIntercontinental 3 distinct value combinations COLSCARD (COlumn Set CARDinality) = 3 The problem: estimating COLSCARD for a given set of attributes

November 3, 2005CIKM Motivation  Cardinality estimation for query optimization, e.g., Estimating the size of Estimating the size of the aggregation  Approximate query answering, e.g., COUNT queries SELECT sales_date, sales_person, SUM(sales_quantity) AS unit_sold FROM sales GROUP BY sales_date, sales_person

November 3, 2005CIKM Roadmap  Related work  Estimation with known marginal distributions Upper/lower bounds An estimator  Estimation with histograms  Experimental results  Conclusions

November 3, 2005CIKM Related work  Previous work has focused on the case of single attribute. [H Ö T88],[H Ö T89],[HNSS ’ 95],[HS ’ 98],[CCMN ’ 00]  Sampling approach is used. Estimation through sampling is difficult [CCMN ’ 00]  No existing statistical information is exploited.

November 3, 2005CIKM Our solution  Considering multiple-attributes  Utilizing existing statistics on individual attributes Readily available in most database systems Does not require access to the data  Granularity of statistics Exact marginal frequency distributions Approximate distributions: histograms etc.

November 3, 2005CIKM Estimation with known marginals  Number of distinct values in attribute Ai,  frequency vector CountryCityHotel Name GermanyBremenHilton GermanyBremenBest Western GermanyFrankfurtInterCity CanadaTorontoFour Seasons CanadaTorontoIntercontinental

November 3, 2005CIKM The na ï ve estimator COLSCARD = Number of possible value combinations d i : the number of distinct values in attribute A i Sanity bound: COLSCARD cannot be greater than the table size The problem: Some value combinations with low occurrence probabilities may not appear in the table!

November 3, 2005CIKM Upper/Lower bounds  Trivial bounds Upper bound: (the na ï ve estimator) Lower bound:  Tighter bounds? In the case of two attributes, tighter bounds are available.

November 3, 2005CIKM Tighter bounds N = d e f a b c A2A2 A1A1 Naïve bounds: 3, 9Lower bound = = value freq valuefreq [2, 3] Upper bound = = 5

November 3, 2005CIKM Expected number of combinations  Assumptions 1.The data distributions of individual columns are independent 2.The occurrence of each combination in the table is independent   Each element of f represents the frequency of a specific value combination. An estimate of the probability of occurrence

November 3, 2005CIKM Estimator The probability of the i-th combination not appearing in a particular tuple is The probability of the i-th combination not appearing in the table (of size N) is The expected number of value combinations is

November 3, 2005CIKM Example revisited  Estimate the COLSCARD for attribute set (A 1, A 2, A 3 ), given New estimate: 5.94 Na ï ve estimate: 3*2*2 = 12

November 3, 2005CIKM Roadmap  Related work  Estimation with known marginal distributions Upper/lower bounds An estimator  Estimation with histograms  Experimental results  Conclusions

November 3, 2005CIKM Estimation with histograms  Histograms exist on individual attributes  Two classes of histograms Partition-based End-biased  Marginals can be (approximately) reconstructed from histograms  Optimal histograms in each class?

November 3, 2005CIKM Optimal histograms  Minimizing the error incurred by histograms ERR = |EST hist – EST exact |  Partition-based histograms A dynamic programming algorithm similar to that for V-optimal histogram construction [Jagadish et al. 98] can be used.

November 3, 2005CIKM Optimal end-biased histograms  An end-biased histogram with B buckets stores The exact frequencies of B-1 attribute values The average of the remaining values  Which B-1 values to store exactly?  Most widely used end-biased histograms store the frequencies of the most frequent values Not always optimal for COLSCARD estimation!!

November 3, 2005CIKM Example Attributes (A1, A2) Choose 1 frequency to store exactly Index of the frequency stored Error table N=10

November 3, 2005CIKM Optimal end-biased histograms  Exhaustive search takes time proportional to  We prove that the optimal choices can be one of the following Most frequent values Least frequent values A combination of most frequent and least frequent values  Only need to search both ends Cost is linear in B, independent of d j !

November 3, 2005CIKM Roadmap  Related work  Estimation with known marginal distributions Upper/lower bounds An estimator  Estimation with histograms  Experimental results  Conclusions

November 3, 2005CIKM Experiments – Data sets  Synthetic data Skew: Zipfian parameter z=0 (uniform) to 4 (highly skewed) Number of tuples: 10K to 1M  Real data Cover Type: 581,012 tuples, 10 attributes Census Income: 32,561 tuples, 14 attributes  Error measure: ratio error ERR = max{true/est-1, est/true-1}

November 3, 2005CIKM Effect of data skew N=100K di=1k

November 3, 2005CIKM Effect of number of tuples

November 3, 2005CIKM Results on real data 45 pairs91 pairs

November 3, 2005CIKM Accuracy of end-biased histograms Results on the “ capital-gain ” attribute of Census Income data set

November 3, 2005CIKM Conclusions  Utilizing existing knowledge maintained in database systems  Proposed upper/lower bounds as well as an estimator  Considered two cases exact marginal frequencies Histograms: optimal histograms  Experimental results show the effectiveness of the proposed method