A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented By: Vivek Tanneeru.

Slides:



Advertisements
Similar presentations
Introduction Simple Random Sampling Stratified Random Sampling
Advertisements

Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research.
Module B-4: Processing ICT survey data TRAINING COURSE ON THE PRODUCTION OF STATISTICS ON THE INFORMATION ECONOMY Module B-4 Processing ICT Survey data.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA PRESENTED BY, JEEVAN KUMAR GOGINENI SARANYA GOTTIPATI.
Fast Algorithms For Hierarchical Range Histogram Constructions
STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
Variance reduction techniques. 2 Introduction Simulation models should be coded such that they are efficient. Efficiency in terms of programming ensures.
Experimental Design, Response Surface Analysis, and Optimization
Brian Babcock Surajit Chaudhuri Gautam Das at the 2003 ACM SIGMOD International Conference By Shashank Kamble Gnanoba.
February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.
Written By Surajit Chaudhuri, Gautam Das, Vivek Marasayya (Microsoft Research, Washington) Presented By Melissa J Fernandes.
Ch 4: Stratified Random Sampling (STS)
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed.
1 Primitives for Workload Summarization and Implications for SQL Prasanna Ganesan* Stanford University Surajit Chaudhuri Vivek Narasayya Microsoft Research.
Fall 2006 – Fundamentals of Business Statistics 1 Chapter 6 Introduction to Sampling Distributions.
Ensemble Learning: An Introduction
Evaluating Hypotheses
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
STAT262: Lecture 5 (Ratio estimation)
Experimental Evaluation
A new sampling method: stratified sampling
Stratified Simple Random Sampling (Chapter 5, Textbook, Barnett, V
STAT 4060 Design and Analysis of Surveys Exam: 60% Mid Test: 20% Mini Project: 10% Continuous assessment: 10%
The Neymann-Pearson Lemma Suppose that the data x 1, …, x n has joint density function f(x 1, …, x n ;  ) where  is either  1 or  2. Let g(x 1, …,
O VERCOMING L IMITATIONS OF S AMPLING FOR A GGREGATION Q UERIES Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Simulation Output Analysis
Lecture 5 slides on Central Limit Theorem Stratified Sampling How to acquire random sample Prepared by Amrita Tamrakar.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
by B. Zadrozny and C. Elkan
Estimation of Statistical Parameters
Chap 20-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 20 Sampling: Additional Topics in Sampling Statistics for Business.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.
Chapter 7: Sample Variability Empirical Distribution of Sample Means.
Sampling Design and Analysis MTH 494 LECTURE-12 Ossam Chohan Assistant Professor CIIT Abbottabad.
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented by Sushanth.
OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.
Simulation Using computers to simulate real- world observations.
ICCS 2009 IDB Workshop, 18 th February 2010, Madrid 1 Training Workshop on the ICCS 2009 database Weighting and Variance Estimation picture.
Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.
Presented By Anirban Maiti Chandrashekar Vijayarenu
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP BY QUERIES Swaroop Acharya,Philip B Gibbons, VishwanathPoosala By Agasthya Padisala Anusha Reddy.
ICCS 2009 IDB Seminar – Nov 24-26, 2010 – IEA DPC, Hamburg, Germany Training Workshop on the ICCS 2009 database Weights and Variance Estimation picture.
1 A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Proceedings of the.
Sampling Design and Analysis MTH 494 Lecture-21 Ossam Chohan Assistant Professor CIIT Abbottabad.
Measurements and Their Analysis. Introduction Note that in this chapter, we are talking about multiple measurements of the same quantity Numerical analysis.
University of Texas at Arlington Presented By Srikanth Vadada Fall CSE rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
Chapter 6: Descriptive Statistics. Learning Objectives Describe statistical measures used in descriptive statistics Compute measures of central tendency.
The accuracy of averages We learned how to make inference from the sample to the population: Counting the percentages. Here we begin to learn how to make.
Chapter 6 Sampling and Sampling Distributions
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
CSCI5570 Large Scale Data Processing Systems
Rutgers Intelligent Transportation Systems (RITS) Laboratory
Overcoming Limitations of Sampling for Aggregation Queries
Reading Report 6 Yin Chen 5 Mar 2004
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Stratified Sampling for Data Mining on the Deep Web
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Parametric Methods Berlin Chen, 2005 References:
Presentation transcript:

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented By: Vivek Tanneeru Venkata Dinesh Jammula

Outline 1.Introduction 2.Objective 3.Drawbacks of Previous work 4.Related Work 5.Architecture for Approximate Query Processing 6.Classical Sampling Techniques 7.Special Case of a Fixed Load 8.Lifting Workload to Query Distributions 9.Relational for Stratified Sampling 10.Solution for Single-Table Selection Queries with Aggregation 11.Extensions for General Work Load 12.Comparisons 13.Experimental Results 14.Summary 15.References

1. Introduction Decision Support applications - OLAP and data mining for analyzing large databases Approximate answers to queries given accurately and efficiently benefit the scalability of these applications Workload information in picking samples of the data

2. Objective Pre-compute a sample as an optimization problem Minimize error in estimation of aggregates Implemented on Microsoft SQL Server 2000, for an effective solution to be deployed in Commercial DBMS

3. Drawbacks of Previous work Lack of rigorous problem formulations lead to solutions that are difficult to evaluate theoretically Does not deal with uncertainty in expected workload Ignores the variance in data distribution of aggregated columns

4. Related Work Weighted Sampling Outlier Index Congressional Sampling On the fly Sampling Histograms

5. Architecture for Approximate Query Processing Preliminaries: Consider Queries with selections, foreign-key joins and GROUP BY, containing aggregation functions such as COUNT, SUM and AVG. Assume a pre-designated amount of storage space is available for selecting samples from the database Selecting samples can be randomized or deterministic

Architecture

Error Metrics  If correct answer for query Q is y while approximate answer is y’ Relative error : E(Q) = |y - y’| / y Squared error : SE(Q) = (|y - y’| / y)²  If correct answer for the ith group is y i while approximate answer is y i ’ Squared error in answering a GROUP BY query Q : SE(Q) = (1/g) Σ i ((y i – y i ’)/ y i )²  Given a probability distribution of queries p w Mean squared error for the distribution: MSE(p w ) = Σ Q p w (Q)*SE(Q), (where p w (Q) is probability of query Q) Root mean squared error (L 2 ): RMSE(p w ) = √ MSE(p w )  Other error metrics L 1 metric : the expected relative error over all queries in workload L ∞ metric : the max error over all queries

6. Classical Sampling Techniques Uniform Sampling: LEMMA 1 (a) μ is an unbiased estimator for y, namely, E[μ] = y; (b) μ· n is an unbiased estimator for Y namely E[μ· n] = Y ; (c) the variance (or standard error) in estimating y is E[(μ− y) 2 ] = S 2 /k; (d) the variance in estimating Y is E[(μ·n−Y ) 2 ] = n 2 S 2 /k; and (e) the relative squared error in estimating Y is E[((μ·n − Y )/Y ) 2 ] = n 2 S 2 /Y 2 k.

Classical Sampling Techniques Stratified Sampling: LEMMA 2 (a) μ is an unbiased estimator for y, namely, E[μ] = y; (b) μ · n is an unbiased estimator for Y, namely, E[μ · n] = Y ; (c) the variance in estimating y is E[(μ − y) 2 ] = 1/ n 2 ∑ j n j 2 S j 2 / k j ; (d) the variance in estimating Y is E[(μ· n−Y ) 2 ] = ∑ j n j 2 S j 2 / k j ; and (e) the relative squared errorin estimating Y is E[((μ · n − Y )/Y ) 2 ] = 1/ Y 2 ∑ j n j 2 S j 2 /k j.

Classical Sampling Techniques Neyman Allocation: LEMMA 3 Given a population R = {y1,..., yn}, k and r, the optimal way to form r strata and allocate k samples among all strata is to first sort R and select strata boundaries so that ∑ j n j S j is minimized, and then, for the j th strata, to set the number of samples k j as k j = k(n j S j / ∑ j n j S j )

Classical Sampling Techniques Multivariate Stratified Sampling Weighted Sampling Error Estimation and Confidence Intervals

7. Special Case: Fixed Workload Problem: FIXEDSAMP Input: R, W, k Output: A sample of k records (with appropriate additional columns) such that MSE( W) is minimized.

Fundamental Regions Fundamental Regions: For a given relation R and workload W, consider partitioning the records in R into a minimum number of regions R 1, R 2, …, R r such that for any region R j, each query in W selects either all records in R j or none.

Solution for FIXEDSAMP Step 1. Identify Fundamental Regions – Case A. r <= k – Case B. r > k Step 2 Pick Sample Records Step 3 Assign values to additional columns

8. Lifting Workload to Query Distributions Resilient to the situation when incoming query is “similar” but not identical to queries in the workload P w : lifted workload, probability distribution P w (Q’) : Related to the amount of similarity of Q’ to the workload Not concerned with syntactic similarity of query expressions

Lifted workload (Cont.) Two parameters δ (½ ≤ δ ≤1) and γ (0 ≤ γ ≤ ½) define the degree to which the workload “influences” the query distribution. For any given record inside (resp. outside) R Q, the parameter δ (resp. γ) represents the probability that an incoming query will select this record. P {Q} (R’) is the probability of occurrence of any query that selects exactly the set of records R’.

Lifted workload (Cont.) n1, n2, n3, and n4 are the counts of records in the regions. n2 or n4 large (large overlap), P {Q} (R’) is high n1 or n3 large (small overlap), P {Q} (R’) is low We elaborate on this issue by analyzing the effects of (four) different boundary settings of these parameters. 1. δ → 1 and γ → 0: implies that incoming queries are identical to workload queries. 2. δ → 1 and γ → ½: implies that incoming queries are supersets of workload queries. 3. δ → ½ and γ → 0: implies that incoming queries are subsets of workload queries. 4. δ → ½ and γ → ½: implies that incoming queries are unrestricted.

9. Rationale for Stratified Sampling Consider a population, i.e. a set of numbers R = {y1,.,yn}. Let the average be y, the sum be Y and the variance be S 2. Suppose we uniformly sample k numbers. Let the mean of the sample be μ. The quantity μ is an unbiased estimator for y, i.e. E[μ] = y the variance (i.e., squared error) in estimating y is E[(μ-y) 2 ] = S 2 /k.

Stratified Sampling (Cont… ) Product ID Revenue Query Q1 : SELECT COUNT(*) FROM R WHERE PRODUCTID IN (3,4); Population POPQ1 = {0,0,1,1} Thus, a stratified sampling scheme partitions R into r strata containing n1,., nr records (where Σnj = n), with k1, …, kr records uniformly sampled from each stratum (where Σkj = k).

10. Solution for single-table selection queries with Aggregation Stratification a.) How many strata r to partition relation R into, b.) Records from R that belong to each strata Allocation how to divide k( the number of records available for the sample) into integers k1, …, kr across r strata such that Σkj = k Sampling uniformly samples kj records from stratum Rj to form the final sample of k records

Solution for COUNT aggregate Stratification: From Lemma 1. Lemma 1: For a workload W consisting of COUNT queries, the fundamental regions represent an optimal stratification. Allocation: We want to minimize the error over queries in p w. k 1, … k r are unknown variables such that Σk j = k. From Equation (2) on earlier slide, MSE(p W ) can be expressed as a weighted sum of the MSE of each query in the workload: Lemma 2: MSE(p W ) = Σ i w i MSE(p{Q})

Allocation (cont…) For any Q ε W, we express MSE(p {Q} ) as a function of the k j ’s Lemma 3 : For a COUNT query Q in W, Let ApproxMSE(p {Q} ) = Then,

Outline of Proof:  Since we have an (approximate) formula for MSE(p {Q} ), we can express MSE(p w ) as a function of the k j ’s variables. Corollary 1 : MSE(p w ) = Σ j (α j / k j ), where each α j is a function of n 1,…,n r, δ, and γ. α j captures the “importance” of a region; it is positively correlated with n j as well as the frequency of queries in the workload that access R j.  Now we can minimize MSE(p w ). Lemma 4: Σ j (α j / k j ) is minimized subject to Σ j k j = k if k j = k * ( sqrt(α j ) / Σ i sqrt(α i ) ) This provides a closed-form and computationally inexpensive solution to the allocation problem since α j depends only on δ, γ and the number of tuples in each fundamental region.

Stratification: Bucketing Technique We further divide fundamental regions with large variance into a set of finer regions, each of which has significantly lower internal variance. Treat each region as strata From optimal Neyman Allocation Technique, We have: h*r finer strata Good to have a large h, but h is set to value 6. Solution for SUM aggregate

Cont… Allocation: Like COUNT, we express an optimization problem with h*r unknowns k 1,…, k h*r. Unlike COUNT, the specific values of the aggregate column in each region (as well as the variance of values in each region) influence MSE(p {Q} ). Let y j (Y j ) be the average (sum) of the aggregate column values of all records in region R j. Since the variance within each region is small, each value within the region can be approximated as simply y j. Thus to express MSE(p {Q} ) as a function of the k j ’s for a SUM query Q in W:

Pragmatic Issues Identifying Fundamental Regions Handling Large Number of Fundamental Regions Obtaining Integer Solutions Obtaining an Unbiased Estimator

Putting all together

11. Extensions GROUP BY JOIN Other Extensions

12. Comparisons Weighted Sampling Records that are accessed more frequently have a greater chance of being included into the sample Assumes fixed workload Outlier Indexing Form their own stratum that is sampled in its entirety Assumes fixed workload

Comparisons (cont…) Congressional Sampling Allocation of samples between two strata To minimize MSE,

13. Experimental Results PREVIOUS WORKS: USAMP – uniform random sampling WSAMP – weighted sampling OTLIDX – outlier indexing combined with weighted sampling CONG – Congressional sampling

Experimental Setup Databases: Used the popular TPC-R benchmark for experiments Workloads: Generated several workloads over TCP-R schema using an automatic query generation program Parameters: Varied the parameters like, – Skew of the data – Sampling fraction between 0.1 % - 10 % – Workload size was varied between queries Error Metric: Report the average error over all queries in the workload

Training Set vs Test Set  The basic idea is to split the available workload into two sets: – the training workload and – the test workload  Training Set: The workload used to determine the sample  Test Set: The workload used to estimate the error

Results : Quality vs Sampling Fraction

Cont…

Quality vs Overlap between Training Set and Test Set

Quality vs Data Skew

Cont…

14. Summary A comprehensive solution to the problem of identifying samples for approximately answering aggregation queries Its implementation on a database system With a novel technique for lifting a workload, we make our solution robust enough to work well even for workloads that are similar but not identical to the given workload. Handles the problems of data variance, heterogeneous mixes of queries, GROUP BY and foreign-key joins.

15. References Surajit Chaudhuri, Gautam Das, Vivek Narasayya: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries. SIGMOD Conference Surajit Chaudhuri, Gautam Das, Vivek Narasayya. Optimized Stratified Sampling for Approximate Query Processing. ACM Transactions on Database Systems (TODS), 32(2): 9 (2007)

Thank You Questions ? Presented By: Vivek Tanneeru Venkata Jammula