O VERCOMING L IMITATIONS OF S AMPLING FOR A GGREGATION Q UERIES Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.

Slides:

Advertisements

Similar presentations

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

Advertisements

Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.

Introduction Simple Random Sampling Stratified Random Sampling

Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research.

Statistical Techniques I EXST7005 Start here Measures of Dispersion.

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.

16.4 Estimating the Cost of Operations Project GuidePrepared By Dr. T. Y. LinVinayan Verenkar Computer Science Dept San Jose State University.

Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Fast Algorithms For Hierarchical Range Histogram Constructions

Introduction to Histograms Presented By: Laukik Chitnis

Brian Babcock Surajit Chaudhuri Gautam Das at the 2003 ACM SIGMOD International Conference By Shashank Kamble Gnanoba.

February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.

Written By Surajit Chaudhuri, Gautam Das, Vivek Marasayya (Microsoft Research, Washington) Presented By Melissa J Fernandes.

M ATH IN SQL. 222 A GGREGATION O PERATORS Operators on sets of tuples. Significant extension of relational algebra. SUM ( [DISTINCT] A): the sum of all.

Active Learning and Collaborative Filtering

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed.

Database Systems: A Practical Approach to Design, Implementation and Management International Computer Science S. Carolyn Begg, Thomas Connolly Lecture.

Sampling Methods and Sampling Distributions Chapter.

Chapter 11 Multiple Regression.

1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.

Privacy Preserving OLAP Rakesh Agrawal, IBM Almaden Ramakrishnan Srikant, IBM Almaden Dilys Thomas, Stanford University.

Experimental Evaluation

Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.

CS 580S Sensor Networks and Systems Professor Kyoung Don Kang Lecture 7 February 13, 2006.

On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.

Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.

Measuring Returns Converting Dollar Returns to Percentage Returns

Lecture 5 slides on Central Limit Theorem Stratified Sampling How to acquire random sample Prepared by Amrita Tamrakar.

CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,

Chap 20-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 20 Sampling: Additional Topics in Sampling Statistics for Business.

Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.

Chapter 16 Methodology – Physical Database Design for Relational Databases.

Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

Join Synopses for Approximate Query Answering Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy.

Lesley Charles November 23, 2009.

Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.

Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.

End-biased Samples for Join Cardinality Estimation Cristian Estan, Jeffrey F. Naughton Computer Sciences Department University of Wisconsin-Madison.

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented by Sushanth.

1 Chapter 7 Sampling Distributions. 2 Chapter Outline  Selecting A Sample  Point Estimation  Introduction to Sampling Distributions  Sampling Distribution.

Methodology – Physical Database Design for Relational Databases.

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

1 CSCE Database Systems Anxiao (Andrew) Jiang The Database Language SQL.

Statistics 1: Introduction to Probability and Statistics Section 3-2.

Presented By Anirban Maiti Chandrashekar Vijayarenu

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented By: Vivek Tanneeru.

Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.

CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP BY QUERIES Swaroop Acharya,Philip B Gibbons, VishwanathPoosala By Agasthya Padisala Anusha Reddy.

Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.

1 A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Proceedings of the.

Sampling Design and Analysis MTH 494 Lecture-21 Ossam Chohan Assistant Professor CIIT Abbottabad.

Measurements and Their Analysis. Introduction Note that in this chapter, we are talking about multiple measurements of the same quantity Numerical analysis.

University of Texas at Arlington Presented By Srikanth Vadada Fall CSE rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing.

Two-Sample-Means-1 Two Independent Populations (Chapter 6) Develop a confidence interval for the difference in means between two independent normal populations.

Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.

ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan Shruti P. Gopinath CSE 6339.

Virtual University of Pakistan

Chapter 7 (b) – Point Estimation and Sampling Distributions

Point and interval estimations of parameters of the normally up-diffused sign. Concept of statistical evaluation.

A paper on Join Synopses for Approximate Query Answering

Methodology – Physical Database Design for Relational Databases

Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang

Overcoming Limitations of Sampling for Aggregation Queries

Spatial Online Sampling and Aggregation

Reading Report 6 Yin Chen 5 Mar 2004

Chap. 7 Regularization for Deep Learning (7.8~7.12 )

DATABASE HISTOGRAMS E0 261 Jayant Haritsa

Presentation transcript:

O VERCOMING L IMITATIONS OF S AMPLING FOR A GGREGATION Q UERIES Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University Rajeev MotwaniStanford University Vivek NarasayyaMicrosoft Research Presented by :- Vinit Asher Deep Pancholi

I NTRODUCTION Why sampling? Why uniform random sampling? Uniform Random Samples and factors responsible for significant error introduction Skewed database i.e. characterized by presence of Outlier values Low selectivity of queries (related to aggregate queries). How do we overcome the problems?

T ECHNIQUES SUGGESTED IN PAPER For problems due to skew in data, the paper recommends to isolate the values in the dataset that contribute heavily to error in sampling. Exploit the workload information to overcome limitations in answering queries with low selectivity.

K EY CONTRIBUTIONS OF THE PAPER Experimental evaluation of the proposed techniques based on implementation on Microsoft SQL Server Paper also demonstrates that a combination of outlier indexing and weighted sampling results in significant error reduction compared to Uniform Random Sampling.

E XAMPLE OF D ATA S KEW AND ITS ADVERSE EFFECT Suppose N=10,000 such that 99% tuples have a value of 1, remaining 1% have a value of Hence, Sum of this table is ,000=109,900 Suppose we take a Uniform Random Sample of size 1%. Hence, n=100 Now it is possible that all 100 tuples in sample have a value 1 giving us a sum of 100*100=10,000 which is way way less than correct answer. If we get 99 tuples with value 1 and 1 tuple with 1000, sum is 1099*100=109,900. Similarly for 98 tuples of 1 and 2 tuples of 1000, sum=209,800 This shows that if we get a value of 1000 in sample more than once, we are going to get a huge error. And not getting any 1000 value also causes a huge error. However, the probability to get just 1 value of 1000 in sample is mere 0.37!!! It means the probability to get an erroneous result is 0.63.

E FFECT OF DATA SKEW (C ONTINUED ) Similar arguments also hold for the aggregate average. Tuples deviant from the rest of the values in terms of their contributions to aggregrate are known as outliers.

E FFECT OF D ATA S KEW (C ONTINUED ) Theorem 1 : For a relation R with elements {y 1, y 2,…,y N }, let U be the uniform random sample of size n. Then actual sum will be and the unbiased estimate of the actual sum will be Y e =(N ╱ n)∑ yi ∈ U y i with a standard error as follows: Where S is the standard deviation of the values in the relation defined as This shows that if there are outliers in the data, then S could be very large. In this case, for a given error bound, we will need to increase the sample size n.

E FFECT OF LOW SELECTIVITY AND S MALL GROUPS What happens if the selectivity of a query is low? It adversely impacts the accuracy of the sampling based estimation A selection query partitions the relation into two sub- relations as follows: Tuples that satisfy the condition of select query Tuples that don’t satisfy the condition. When we sample uniformly, the number of tuples that are sampled from the relevant sub-relation are proportional to its size. Hence, if the relevant sample size is low, it can lead to huge errors. For Uniform random sampling to perform well, the relevant sub-relation should be large in size, which is not the case in general.

H ANDLING D ATA S KEW : O UTLIER I NDEXES Outliers/deviants in data Large Variance High errors Hence we identify tuples with outlier values and store them in separate sub relation. The proposed technique would result in a very accurate estimation of the aggregate. In Outlier indexing method, a given relation R is partitioned as R O (Outliers) and R NO (no-outliers). The Query Q now can be run as a union of two sub-queries one on R O and another on R NO.

H ANDLING D ATA S KEW : O UTLIER I NDEXES (E XAMPLE ) Preprocessing steps Determine Outliers – Specify sub-relation R 0 of R to be the set of outliers Sample Non-Outliers – Select a uniform random sample T of the relation R NO Query processing steps Aggregate outliers – Apply the query to outliers in R 0 Aggregate non-outliers – Apply the query to sample T and extrapolate to obtain an estimate of the query result for R NO Combine Aggregates – Combine the appoximate result for R NO with the exact result for R 0 to obtain an appoximate result for R

S ELECTION OF O UTLIERS In this method of Outlier-indexing, query error is solely because of error in estimating non-outlier aggregation. However, there is additional overhead for maintaining and accessing an outlier index. Theorem 2: Consider a multiset R = {y 1, y 2,…,y N } in sorted order. Let R O R be the subset such that Here is the allocated memory for outlier indexing.

A LGORITHM FOR O UTLIER I NDEX (R,C, ) Alogirithm Outlier-Index(R,C, τ): Let the values in column C be sorted in relation R For i = 1 to τ+1, compute E(i) = S({yi, yi+1,…yN- τ+i-1}) Let i’ be the value of i where E(i) is minimum. Outlier-index is the tuples that correspond to the set of values {yj|1 ≤ j≤ τ’ }U{yj|(N+ τ’ +1- τ) ≤j ≤N} i.e. lowest 1 to τ’ values and highest τ - τ’ values. where τ’ = i’ - 1 It is possible to give standard error (probabilistic) guarantee of our estimated answer using Theorem1.

S TORAGE A LLOCATION FOR O UTLIER I NDEXING Given sufficient space to store m tuples, how do we allocate storage between samples and outlier- index so as to get minimum error? Suppose S(t) denotes standard deviation in non- outliers for optimal outlier index of size t. From Theorem 1, error is proportional to where t tuples are in outlier index and m-t tuples in the sample.

E XTENSION OF OUTLIER INDEXING TO OTHER AGGREGATES For count aggregate the outlier indexing is not beneficial since there is no variance among the data values. In case of aggregate avg (average), during query processing, an avg query is estimated as sum/count Outlier-indexing is also not beneficial for the aggregates that depend upon the rank of tuples rather that their actual values (such as min, max or median )

F INAL N OTES ON O UTLIER I NDEXING Outlier indexing technique works best for aggregations on a single table that don’t involve foreign-key joins. It is beneficial for sum and average aggregations but doesn’t work for rank-order based queries such as max, min or count. As future work, investigation is going on whether to create separate outlier indexes for frequently occurring functions depending on the workload information.

H ANDLING L OW S ELECTIVITY AND S MALL G ROUPS In this case, we want to use weighted sampling. In other words, we want to sample more from subsets of data that are small in size but are important (i.e. having high usage). We select a representative workload (i.e. a set of queries) and tune the sample so that we can answer the queries posed to database more accurately. The technique mentioned in the paper is for precomputed samples only. However, research is being done to apply this solution to online sampling.

E XPLOITING W ORKLOAD I NFORMATION Workload Collection: It means obtaining a workload consisting of representative queries that are posed to the database. Tools like profiler component in the Microsoft SQL Server allow logging of queries posed to the Database. Tracing Query Patterns: It basically involves analyzing the workload to obtain parsed information. Tracing Tuple Usage: It involves tracing number of times a tuple was accessed, number of times it satisfies the query, number of times tuple didn’t satisfy the condition, etc. Weighted Sampling: It involves sampling by taking into account weights of tuples into consideration.

W EIGHTED S AMPLING : D EEPER I NSIGHT Same as stratified sampling done in class. All the tuple are assigned some weightage depending on traces tuple usage Whenever we want to get a sample, a tuple is associated in the sample with probability p i =n*w i ’ Now, as discussed in class, the inverse of p i is the multiplication factor. Each aggregate computed over this tuple gets multiplied by the multiplication factor to answer the query. As the uniform random sample has equal probability, the multiplication factor is (N/n). This method works well only if we have a workload that is a good representation of the actual queries posed in future and also if access pattern of the queries is local in nature.

I MPLEMENTATION AND E XPERIMENTAL S ETUP Databases: The database used for experimentation is the TPC-R benchmark database. The data generation was modified a bit to get varying degrees of skew. The modified program generates data based on Zipfian distribution. Parameters: Skew of the data (z) varied over 1, 1.5, 2, 2.5 and 3 Sampling fraction (f) varied from 1% to 100% Storage for Outlier index varies 1%, 5%, 10% and 20%.

I MPLEMENTATION AND E XPERIMENTAL S ETUP (2) The comparisons were done on three options: 1) Uniform Sampling 2) Weighted Sampling 3) Weighted Sampling + Outlier Indexing The results depicted here are for which the storage size for outlier-indexing was only 10% of the size of the data set.

E XPERIMENTAL R ESULTS

E XPERIMENTAL R ESULTS (2)

E XPERIMENTAL R ESULTS (3)

C ONCLUSION We can easily conclude that skew in the data can lead to considerable large errors. However, Outlier indexing addresses this problem at a small additional overhead. The problem lies in creating a single outlier index that works for any query on the database. Low selectivity of queries is addressed by the weighted load of sampling. Some important references: - for more info on TPC-R standard database.

D IRECTION OF FUTURE WORK Building of a single outlier-index for different aggregates and aggregate expressions Tuning the selection of outlier-index using workload information Extension of the techniques discussed to a wider class of join queries