CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,

Slides:



Advertisements
Similar presentations
Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.
Advertisements

Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference,
Fast Incremental Maintenance of Approximate histograms : Phillip B. Gibbons (Intel Research Pittsburgh) Yossi Matias (Tel Aviv University) Viswanath Poosala.
Brian Babcock Surajit Chaudhuri Gautam Das at the 2003 ACM SIGMOD International Conference By Shashank Kamble Gnanoba.
Harikrishnan Karunakaran Sulabha Balan CSE  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion.
February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
Chapter 7 Introduction to Sampling Distributions
Chapter 7 Sampling Distributions
Deterministic Wavelet Thresholding for Maximum-Error Metrics Minos Garofalakis Bell Laboratories Lucent Technologies 600 Mountain Avenue Murray Hill, NJ.
Topic 7 Sampling And Sampling Distributions. The term Population represents everything we want to study, bearing in mind that the population is ever changing.
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 6-1 Introduction to Statistics Chapter 7 Sampling Distributions.
Evaluating Hypotheses
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
7-1 Introduction The field of statistical inference consists of those methods used to make decisions or to draw conclusions about a population. These.
Course Content Introduction to the Research Process
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Chapter 1: Introduction to Statistics
Chapter Nine Copyright © 2006 McGraw-Hill/Irwin Sampling: Theory, Designs and Issues in Marketing Research.
STA Lecture 161 STA 291 Lecture 16 Normal distributions: ( mean and SD ) use table or web page. The sampling distribution of and are both (approximately)
Random Sampling, Point Estimation and Maximum Likelihood.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Join Synopses for Approximate Query Answering Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy.
7-1 Introduction The field of statistical inference consists of those methods used to make decisions or to draw conclusions about a population. These.
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
1 Estimation From Sample Data Chapter 08. Chapter 8 - Learning Objectives Explain the difference between a point and an interval estimate. Construct and.
© 2013 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Chapter 7 Estimation Procedures. Basic Logic  In estimation procedures, statistics calculated from random samples are used to estimate the value of population.
February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented by Sushanth.
Understanding Sampling
Join Synopses for Approximate Query Answering Swarup Acharya, Philip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy By Vladimir Gamaley.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 7-1 Chapter 7 Sampling Distributions Basic Business Statistics.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha ( ) Deepak Anand ( ) By:
Presented By Anirban Maiti Chandrashekar Vijayarenu
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
1 Mean Analysis. 2 Introduction l If we use sample mean (the mean of the sample) to approximate the population mean (the mean of the population), errors.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP BY QUERIES Swaroop Acharya,Philip B Gibbons, VishwanathPoosala By Agasthya Padisala Anusha Reddy.
STATISTICAL DATA GATHERING: Sampling a Population.
HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas.
Lecture 5 Introduction to Sampling Distributions.
POLS 7000X STATISTICS IN POLITICAL SCIENCE CLASS 5 BROOKLYN COLLEGE-CUNY SHANG E. HA Leon-Guerrero and Frankfort-Nachmias, Essentials of Statistics for.
Chapter 7 Introduction to Sampling Distributions Business Statistics: QMIS 220, by Dr. M. Zainal.
SF-Tree and Its Application to OLAP Speaker: Ho Wai Shing.
By: Peter J. Haas and Joseph M. Hellerstein published in June 1999 : Presented By: Sthuti Kripanidhi 9/28/20101 CSE Data Exploration.
University of Texas at Arlington Presented By Srikanth Vadada Fall CSE rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
Chapter 9 Day 2. Warm-up  If students picked numbers completely at random from the numbers 1 to 20, the proportion of times that the number 7 would be.
Storage Estimation for Multidimensional Aggregates in the Presence of Hierarchies 병렬 분산 컴퓨팅 연구실 석사 1 학기 김남희.
ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan Shruti P. Gopinath CSE 6339.
Parallel Databases.
7-1 Introduction The field of statistical inference consists of those methods used to make decisions or to draw conclusions about a population. These.
A paper on Join Synopses for Approximate Query Answering
Anthony Okorodudu CSE ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan.
Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang
ICICLES: Self-tuning Samples for Approximate Query Answering
Statistics in Applied Science and Technology
Spatial Online Sampling and Aggregation
AQUA: Approximate Query Answering
Virtual University of Pakistan
Fast and Exact K-Means Clustering
Introduction to Sampling Distributions
Presentation transcript:

CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center, Bell Labs, New Jersey ) Divya Rao

Outline  Introduction  Background  Aqua System  Problem Formulation  Solutions  Query Rewriting strategies  Experiment  Conclusion

Introduction  Group-by queries- most important class of queries in decision support systems.  Congressional Samples- A hybrid union of uniform and biased samples  Seek to propose techniques for obtaining fast, highly-accurate answers for Group-by queries

Background  Uniform random sampling is not effective for group- by queries.  Ex: A group by query on the US Census database to determine the per-capita income of every state.  Huge discrepancies in the sizes of different groups like California is 70 times more populated than Wyoming.  This leads to poor accuracy of answers of those groups which have fewer number of tuples than the larger ones as accuracy is highly dependent on the number of sample tuples that belong to that group.

Background  Uniform Random Sampling are more appropriate only when the utility of data to the user mirrors the data distribution  Multi-table query: When different data have equal representation but their utility to the user is skewed  Ex: Data warehouses where the usefulness of data degrades with time  This means the approximate sample has to collect more samples from the recent data which cannot be achieved through uniform random sample over the entire warehouse.

Biased Sampling  Use precomputed samples to address the problem of unbiased query  Advantages of using precomputed biased samples:  Queries can be answered without accessing the original data at query run time  Storing queries in disk blocks avoids the overhead of random scanning  Disadvantage: Biased samples must commit to the sample before seeing the query. Hence not suitable for user controlled progressive refinement.

Aqua System  Aqua is an efficient decision support system providing approximate answers to queries

Aqua System(Contd.)  Aqua is a Middleware tool that can sit atop any DBMS managing a data warehouse  Aqua maintains statistical summaries of data in Synopses and uses them to answer queries  The aqua system provides probabilistic error/ confidence bounds on the answer

Aqua System(Contd.)

Problem Formulation  Main aim is to provide accurate answers to group- by queries in an approx. query answering system  If ci and ci' be the exact and apprx. aggregate values in the group gi. Then error is the percentage relative error ε in the estimation of ci is ε = ( ci – ci' )/ci * 100

Solutions Theorem: Divide the sample space X equally among the groups and take uniform random sample within each group. Map this theorem to various classes of group-by queries with arbitrary mixes of groupings. Ex: US Congress HOUSE SENATE

House  The House has representatives from each state proportional to the state's population  Applying theorem T to the House we have,  For the aggregate operation, the quality of approx. answers increases with the query selectivity  Answers to the queries with the same aggregate and equal selectivities will typically have similar quality guarantees.

Senate  Senate has equal number of representatives from each state  Applying the theorem to the Senate we have,  Each group in the sample will have atleast as many sample points as any other group in the entire sample

Problems with House and Senate  Using Samples from House would result in very few sample points for smaller groups  Senate allocates fewer tuples to the larger groups compared to the House.  Hence we have another technique called the “Congress”-collect both the House and the Senate samples

Basic Congress  Apply the theorem to the aggregate queries containing group-by queries on a set of attributes and queries with no group-bys at all.  Collect both the House and Senate samples  Reduce this factor by 2

Congress For the sample space X, the final sample size allocated to each group is given by, Where the expected sample space allocated to g is

Query rewriting  Scaling up the aggregate expressions  Deriving error bounds on the estimate  Generating unbiased answers using tuples in the biased sample:  Scale factor is the inverse of sampling rate

Rewriting Strategies The key step in scaling is to efficiently associate each tuple with its corresponding scale factor a) Store the scale factor with each tuple i) Integrated Rewriting ii) Nested-integrated Rewriting b) Use a separate table to store the scale factor iii) Normalized Rewriting iv) Key-normalized Rewriting

Experiments Experimental Testbed: Aqua system with Oracle v7 as the back-end DBMS ParameterRange of valuesDefault value Table size(T)100k-6M tuples1M Sample Percentage(SP) 1%-75%7% Num.groups10-200k1000 Group-size skew(z) Experimental Parameters

Experiment(Contd.)  Study to identify a scheme that can provide consistently good performance

Performance of various allocation strategies Performance of Different query sets:  Queries with no group-bys: House performs well Congress technique performs consistently the best for queries of all types  Queries with three group-bys: Senate has low errors  Queries with two group-bys:Both senate and House perform poorly in this case  Congress performs close to best for queries of all types. Other techniques perform well only in a limited part of the spectrum

Performance of different sample sizes: The errors in Congress drop as the sample space increases

Performance of group count:  Integrated and Nested-integrated perform better than Normalized and Key-normalized due to the absence of a join operation  Nested-integrated performs better than Integrated due to significantly fewer multiplications.

Conclusions  Demonstrated that uniform samples are not enough to accurately answer all group-by queries  Proposed new techniques based on biased sampling  Congressional sampling concept was introduced and the sampling strategies were validated experimentally to produce accurate estimates to group-by queries and in their execution efficiency  All the techniques have been incorporated into the Aqua System.

Questions??