Overcoming Limitations of Sampling for Aggregation Queries

Overcoming Limitations of Sampling for Aggregation Queries
Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Mayur Datar, Stanford University Rajiv Motwani, Stanford University Vivek Narasayya, Microsoft Research Presented by Ranjan Dash

Outline Abstract Introduction Related Work Study of Limitations
Handling Data skew Handling low selectivity & small groups Implementation and experimental result Conclusion and future work 9/20/2018 Ranjan Dash

Abstract Approximate aggregation query processing using sampling
Uniform sampling performs poorly when - Distribution of aggregated attribute is skewed Queries with low selectivity Solution proposed by the paper - Outlier indexing Weighted sampling based on workload information A Combined approach significantly reduces the approximation error. 9/20/2018 Ranjan Dash

Introduction Approximate query processing used extensively by OLAP and Data mining tools. Use of uniform random sampling is error prone because of - Presence of data skew – presence of outlier values that are significantly different from the rest in terms of their contribution to the aggregate Low selectivity – Very few or no tuples may satisfy the query predicate. Extrapolating from such a small set of tuples led to error. Two techniques to overcome these limitations of URS. Isolate values in the data that could contribute heavily to the error in sampling – Outlier-indexing Exploit the workload information for queries with low selectivity. Single table queries involving selection and group by queries with sum aggregate – can be extended to other aggregation functions and foreign key joins. Combination of outlier-indexing and weighted sampling based on workload info results in significant error reduction. 9/20/2018 Ranjan Dash

Related Work Related research in approximately answering aggregation queries – Use of online sampling technique for data with little or no variability – [12] Use of pre-computed samples or synopsis in answering aggregation queries – [2] Techniques for fast incremental maintenance of summery statistics to provide approximate query answer – [9] Proposal of a weighted sampling scheme – [1] Use of weighted sampling to continuously tuning representative sample of data Concept of detecting and removing deviants in time series data – [14] 9/20/2018 Ranjan Dash

Limitations of uniform random sampling – an example
Adverse impact of data skew on aggregation queries – Relation R with 10,000 tuples – aggregate on column ‘C’ 99%(9900) tuples 1 – 9900 1%(100) tuples 1000 – 100, 000 1% URS of R – Sample size = 100 result = sum of sample x scale factor(100) Quite likely sample would not include any tuple of value 1000 – estimated result = 100x100 = 10, 000 2 or more tuples of value 1000 gets included in sample – estimated result > 209, 800 Reasonable result only when we get exactly one tuple of value 1000 per sample – This event has probability 0.37. With probability of 0.63, we would get a large error in the estimate. 109, 900 Estimated result >> true sum 109, 900 9/20/2018 Ranjan Dash

Limitations of uniform random sampling – skewed data
Contribution of outliers to the error in estimating sum via uniform sampling - Relation R of size N having values {y1, y2, ……yN}. Let U be a uniform random sample of yi’s of size n. Then Unbiased estimator Actual sum Standard error Where S is the standard deviation of the values in the relation defined as If there are outliers in the data then S could be very large. For a given error bound we need to increase the sample size n. Theorem 1 Ye = (N/n) yi 9/20/2018 Ranjan Dash

Limitations of uniform random sampling – low selectivity
Selection query partitions relation into 2 sub-relation – Tuples that satisfy the condition – relevant sub-relation Tuples that do not satisfy the condition Number of tuples sampled from relevant sub-relation is proportional to its size. If this relevant sample size is small due to low selectivity of the query, it may lead to larger error Group by queries partition the relation into many sub-relations Success of uniform sampling depends on the size of the relevant sub-relation 9/20/2018 Ranjan Dash

Handling data skew – outlier indexes
Large variance in aggregate column due to presence of outliers or deviants lead to large error. Natural idea is to deal with it separately – outlier indexing Example: selection query (Q) with sum aggregate Separate the outliers into a separate sub-relation (RO) Apply the query to this sub-relation of outliers and get the true result of this query. Pick a uniform random sample from the non-outlier part of the relation (RNO) and estimate the approximate true result using theorem 1. Combine the above two to get overall estimate of query’s true result. 9/20/2018 Ranjan Dash

Handling data skew – outlier indexes – an example
Aggregate Query Q - Select sum (sales) from lineitem Preprocessing 1. Partition the relation into RO (outliers) and RNO (non-outliers) 2. Ro is lineitem_outlier a view 3. select a uniform random sample T from relation RNO and materialize it in a table called lineitem_samp Query processing 1. Aggregate outliers – sum(sales) from lineitem_outlier 2. Aggregate non-outliers – apply the query to sample T and extrapolate – sum(sales) for lineitem_samp x scale factor 3. Combine aggregates – approximate result from RNO + exact result for RO = approximate result for R 9/20/2018 Ranjan Dash

Handling data skew – outlier indexes – selection of outliers
Query error is solely due to error in estimating the aggregate from non-outliers Additional overhead of maintaining and accessing the outlier index – Ro contains at most τtuples Paper proposes an algorithm based on a theorem to decide the optimal size of outlier-index An optimal outlier-index Ro(R,C,τ) is defined as a sub-relation Ro c R such that |Ro| ≤ τ E(R\ Ro) = min R’C R ,|R’|≤ τ{E(R\R’)} Defines outlier index as an optimal sub-relation Ro that leads to the minimum possible sampling error, subject to the constraint that Ro has at most τ tuples in R Definition 1 9/20/2018 Ranjan Dash

Theorem 2 Consider a multiset R = {y1, y2,…yN} in sorted order. Ro C R such that |Ro| ≤ τ S(R\ Ro) = min R’C R ,|R’|≤ τ{S(R\R’)} There exists some 0 ≤ τ’ ≤ τ such that Ro = {yi|1 ≤ i≤ τ’ }U{yi|(N+ τ’ +1- τ) ≤i ≤N} States that the subset that minimizes the standard deviation over the remaining set consists of the leftmost τ’ elements and the right most τ- τ’ elements from the multiset R when elements are arranged in sorted order. 9/20/2018 Ranjan Dash

Alogirithm Outlier-Index(R,C, τ): Let the values in col C be sorted. For i = 1 to τ+1, compute E(i) = S({yi, yi+1,…yN- τ+i-1}) Let i’ be the value of i where E(i) is minimum. Outlier-index is the tuples that correspond to the set of values {yj|1 ≤ j≤ τ’ }U{yj|(N+ τ’ +1- τ) ≤j ≤N} where τ’ = i’ - 1 9/20/2018 Ranjan Dash

Handling data skew – outlier indexes – storage allocation
Given sufficient space to store m tuples, how do we allocate storage between samples and outlier-index inorder to minimize error? S(τ) is the SD in non-outliers for an optimal outlier-index of τ. If we allocate space in m such that τ tuples in the outlier and m- τ in the sample. Then error will be proportional to S(τ)/√(m- τ). Then the optimal allocation will be the value of τ for which S(τ)/√(m- τ) will be minimum. 9/20/2018 Ranjan Dash

Use of outlier-indexing – extension to other aggregates
Count aggregate: outlier-indexing not beneficial as there is no variance among the data values Avg aggregate: same as sum aggregate Extensible to class of aggregates that satisfy certain algebraic property Real valued function aggregate: like sum(price*quantity) Rank order dependent aggregates: like min, max etc – outlier not useful Extension to foreign-key join Outlier index and samples are computed over the relation having the aggregation column and then joined with other relations at query processing time. 9/20/2018 Ranjan Dash

Handling low selectivity and small groups
Problem of low selectivity queries and small groups in group-by queries. Use weighted sampling instead of uniform sampling with the help of workload information Sample more from the subsets that are small in size but are important having high usage Usage of a database is typically characterized by considerable locality in the access pattern i.e queries against the database access certain parts more than others. Tune the sample to a representative workload This technique uses pre-computed samples 9/20/2018 Ranjan Dash

Handling low selectivity and small groups – Exploit workload info
Workload collection – collect workload consisting of representative queries from query profilers and query logs Trace query patterns – set of selection condition Trace tuple usage – usage of specific tuples like frequency of access to each tuple tuple ti has weight wi if the tuple ti is required to answer wi of the queries in the workload. Weighted sampling – perform sampling taking into account weights of the tuples The weights are normalized 9/20/2018 Ranjan Dash

Handling low selectivity and small groups – Exploit workload info – an example
Use of weighted sample to answer aggregation query Let the weight of tuple ti be wi Normalized weight be wi’ = wi/ Σwi probability of acceptance of this tuple in the sample pi = n.wi’ where n is the sample size How to answer aggregation queries approximately? For each tuple included in the sample, store the corresponding pi. Each aggregate computed over this tuple gets multiplied by the inverse of pi (multiplication factor). This works well if Access pattern is local i.e. most queries access a small part of relation Workload is representative of actual queries 9/20/2018 Ranjan Dash

Implementation Outlier-indexing
Table: lineitem, Aggregation column: I_extendedprice CREATE VIEW I-extendedprice-otl-idx AS SELECT * from lineitem WHERE (I-extendedprice ) OR (I-extendedprice ) predicate is determined using the algorithm and storage allocated for the outlier-index. Module that automatically rewrites a, query to use the outlier-index and the sample rather than the fact table. Uniform and weighted sampling. Modify the execution tree generated by the SQL Server optimizer For uniform sampling accepts tuples with the specified probability (i.e., the sampling fraction) and stores the accepted tuples in a table. For weighted sampling, the probability of accepting a tuple is proportional to the weight associated with the tuple. Calculated exact weights for each tuple for a given workload and store in an additional column in fact table (lineitem) . Converted a SELECT query in the workload into the corresponding UPDATE statement that increments the weight of all the tuples satisfying the selection predicates. Automatically substitute the sample table for the fact table of an incoming query. 9/20/2018 Ranjan Dash

Implementation & Experimental Result
Goal is to compare performance of uniform sampling, weighted sampling and weighted sampling + outlier-indexing Platform. Microsoft SQL Server Dell Precision 610 system with a Pentium I11 Xeon 450 Mhz processor with 128 MB RAM and an external 23GB hard drive. Databases. TPC-R benchmark that supports data generation from a uniform distribution. Modified the TPC-R data generation program to generate data with varying degree of skew – using varying Zipfian parameter (z) The ratio of the maximum value to the minimum value of the aggregation column varied between 76 and 106. Workloads. generated using a random query generation program. The program generates queries with foreign key joins between tables, aggregations on the fact table (lineitem) grouping selection. The sum aggregation function is used. 9/20/2018 Ranjan Dash

Parameters (1) skew of the data (z): over 1, 1.5, 2, 2.5, and 3 (2) the sampling fraction (f) - from 1% to loo% (3) the storage for the outlier-index - 1%, 5%, I0%, and 20%. Error metric For each query compute the relative error by dividing the difference between the approximate estimate for the aggregate (sum) and the accurate value of the aggregate by the latter. For queries with a group-by clause: Consider a query that has k groups in the answer obtained from actually executing the query. Build a k dimensional vector where the ith dimension contains the relative error in the aggregate expression for that ith group. The error is the mean of all the points in the vector. Thus, the average relative error over all groups is reported. The error metric for a workload is average error over all queries in the workload. 9/20/2018 Ranjan Dash

Uniform sampling weighted sampling (WSAMP) weighted sampling + outlier-indexing (WSAMP+OTLIDX). 9/20/2018 Ranjan Dash

Experimental Result - Varying the data skew
Vary the data skew, while keeping the sampling fraction constant (5%). Weighted sampling does consistently better than uniform sampling across all data skews Weighted sampling + outlier-indexing performs significantly better than weighted sampling alone. 9/20/2018 Ranjan Dash

Experimental Result - Varying the sampling fraction
varied the sampling fraction with fixed data skew (z=2). weighted sampling gives lower error than uniform sampling across all sampling fractions. greatest benefit occurs at low sampling fractions (e.g., 1%). use of outlier-indexing in addition to weighted sampling improves accuracy significantly. 9/20/2018 Ranjan Dash

Experimental Result - Varying the selectivity of queries
vary selectivity of queries between 1 % and 100% with fixed data skew (z=2) and the sampling fraction (f=l%). weighted sampling better than uniform sampling at very low selectivity. when the workload references large portions of the data (e.g., at 100% selectivity) uniform and weighted sampling are not significantly different. Weighted sampling + outlier-indexing performs well across different selectivities. Its relative improvement over weighted sampling is smaller at lower selectivities. 9/20/2018 Ranjan Dash

Conclusion and Open Problems
Explored problems encountered when using uniform sampling as a means for approximate query answering. Observations: Skew in the aggregation attribute can lead to large errors - proposed outlier-indexing to improve the accuracy. Demonstrated that this technique improves accuracy significantly Problem of low selectivity of queries - Outlined approaches based on workload information. Combination of outlier-indexing and weighted sampling based on workload information has proved to be a significant step forward. Future works: Investigating the problem of building a single outlier-index for different aggregates and aggregate expressions. Tuning the selection of the outlier-index using the workload information is another interesting issue. Investigating extensions of this techniques to a wider class of join queries. 9/20/2018 Ranjan Dash

References [l] Acharya S., Gibbons P., and Poosala V. Congressional samples for approximate answering of group-by queries. In Proceedings of the ACM SIGMOD Conference, , 2000. [2] Acharya S., Gibbons P., Poosala V., and Ramaswamy S. Join synopses for approximate query answering. In Proceedings of the ACM SIGMOD Conference, 1999. [3] Bamett V. and Lewis T. Outliers in Statistical Data. John Wiley, 3rd edition, 1994. [4] Chatfield C. The Analysis of Time Series. Chapman and Hall, 1984. [5] Chaudhuri S., Motwani R., and Narasayya V. On random sampling over joins. In Proceedings of the ACM SIGMOD Conference, 1998. [6] Chaudhuri S. and Narasayya V. Program for TPC-D data generation with skew. ftp.research.microsoft.comlpub/users/viveknar/tpcdskew. [7] Cochran W. G. Sampling Techniques. John Wiley & Sons, New York, third edition, 1977. [8] Ganti V., Lee M. L., and Ramakrishnan R. ICICLES: selftuning samples for approximate query answering. In Proceedings of 26th lntemational Conference Very Large Data Bases, 2000. [9] Gibbons P., and Matias Y. New sampling-based summary statistics for improving approximate query answers. In Proceedings of the ACM SIGMOD Conference, , 1998. [IO] Haas P. and Hellerstein J. Ripple joins for online aggregation. In Proceedings of the ACM SIGMOD Conference, , 1999. [11] Hawkins D. Identification of Outliers. Chapman and Hall, london, 1980. 9/20/2018 Ranjan Dash

References [I2] Hellerstein J., Haas P., and Wang H. Online aggregation. In Proceedings of the ACM SIGMOD Conference, 1997. [I3] Hou W., Ozsoyoglu G., and Dogdu E. Error-constrained COUNT query evaluation in relational databases. In Proceedings of the ACM SIGMOD Conference, , 1991. [I4] Jagdish H., Koudas N. and Muthukrishnan S. Mining deviants in times series database. In Proceedings of 25th International Conference Very Large Data Bases, , 1999. [15] Knorr E. and Ng R. Algorithms for mining distance-based outliers in large datasets. In Proceedings of 24th International Conference Very Large Data Bases, , 1998. [I6] Ramaswamy S., Rastogi R., and Shim K. Efficient algorithms for mining outliers from large data sets. In Proceedings of the ACM SIGMOD Conference, , 2000. [I7] Manku G., Rajagopalan S., and Lindsay B. Random sampling techniques for space efficient online computation of order statistics of large datasets. In Proceedings of the ACM SIGMOD Conference, , 1999. [ 18] Olken E Random sampling from databases - bibliography. olken/mendel/sampling/bibliography.html. [19] Seshadri P. and Swami A. Generalized partial indexes. In Proceedings of the I Ith International Conference on Data Engineering, , 1995. [20] Zipf G. E. Human Behavior and the Principle of Least Effort. Addison-Wesley Press, Inc, 1949. 9/20/2018 Ranjan Dash

Questions? 9/20/2018 Ranjan Dash

Overcoming Limitations of Sampling for Aggregation Queries

Similar presentations

Presentation on theme: "Overcoming Limitations of Sampling for Aggregation Queries"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Overcoming Limitations of Sampling for Aggregation Queries

Similar presentations

Presentation on theme: "Overcoming Limitations of Sampling for Aggregation Queries"— Presentation transcript:

Similar presentations

About project

Feedback