CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP BY QUERIES Swaroop Acharya,Philip B Gibbons, VishwanathPoosala By Agasthya Padisala Anusha Reddy.

Slides:



Advertisements
Similar presentations
Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.
Advertisements

Introduction Simple Random Sampling Stratified Random Sampling
Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference,
CS4432: Database Systems II
A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA PRESENTED BY, JEEVAN KUMAR GOGINENI SARANYA GOTTIPATI.
Fast Algorithms For Hierarchical Range Histogram Constructions
General Statistics Ch En 475 Unit Operations. Quantifying variables (i.e. answering a question with a number) 1. Directly measure the variable. - referred.
Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.
Brian Babcock Surajit Chaudhuri Gautam Das at the 2003 ACM SIGMOD International Conference By Shashank Kamble Gnanoba.
Harikrishnan Karunakaran Sulabha Balan CSE  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion.
February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.
Population Sampling in Research PE 357. Participants? The research question will dictate the type of participants selected for the study Also need to.
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
Chapter 7 Introduction to Sampling Distributions
Chapter 7 Sampling Distributions
Deterministic Wavelet Thresholding for Maximum-Error Metrics Minos Garofalakis Bell Laboratories Lucent Technologies 600 Mountain Avenue Murray Hill, NJ.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
The Excel NORMDIST Function Computes the cumulative probability to the value X Business Statistics: A First Course, 5e © 2009 Prentice-Hall, Inc
Sampling Designs Avery and Burkhart, Chapter 3 Source: J. Hollenbeck.
STRATIFIED SAMPLING DEFINITION Strata: groups of members that share common characteristics Stratified sampling: the population is divided into subpopulations.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Determining Sample Size
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
Estimation Using a Single Sample
Chap 20-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 20 Sampling: Additional Topics in Sampling Statistics for Business.
Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Ronda Hilton.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Join Synopses for Approximate Query Answering Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy.
Copyright 2010, The World Bank Group. All Rights Reserved. Managing processes Core business of the NSO Part 2 Strengthening Statistics Produced in Collaboration.
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
© 2013 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Chapter 7 Sampling Distributions.
February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented by Sushanth.
Stratified K-means Clustering Over A Deep Web Data Source Tantan Liu, Gagan Agrawal Dept. of Computer Science & Engineering Ohio State University Aug.
Understanding Sampling
New Sampling-Based Estimators for OLAP Queries Ruoming Jin, Kent State University Leo Glimcher, The Ohio State University Chris Jermaine, University of.
Join Synopses for Approximate Query Answering Swarup Acharya, Philip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy By Vladimir Gamaley.
Load Shedding Techniques for Data Stream Systems Brian Babcock Mayur Datar Rajeev Motwani Stanford University.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 7-1 Chapter 7 Sampling Distributions Basic Business Statistics.
Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha ( ) Deepak Anand ( ) By:
Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Z. Joseph, CSE-UT Arlington.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)
Presented By Anirban Maiti Chandrashekar Vijayarenu
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented By: Vivek Tanneeru.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 7-1 Chapter 7 Sampling and Sampling Distributions Basic Business Statistics 11 th Edition.
1 Mean Analysis. 2 Introduction l If we use sample mean (the mean of the sample) to approximate the population mean (the mean of the population), errors.
Basic Business Statistics
HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas.
1 A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Proceedings of the.
Chapter 7 Introduction to Sampling Distributions Business Statistics: QMIS 220, by Dr. M. Zainal.
University of Texas at Arlington Presented By Srikanth Vadada Fall CSE rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.
ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan Shruti P. Gopinath CSE 6339.
Chapter 6 Inferences Based on a Single Sample: Estimation with Confidence Intervals Slides for Optional Sections Section 7.5 Finite Population Correction.
Parallel Databases.
A paper on Join Synopses for Approximate Query Answering
Anthony Okorodudu CSE ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan.
Associative Query Answering via Query Feature Similarity
ICICLES: Self-tuning Samples for Approximate Query Answering
Load Shedding Techniques for Data Stream Systems
AQUA: Approximate Query Answering
Stratified Sampling for Data Mining on the Deep Web
Presented by: Mariam John CSE /14/2006
Presentation transcript:

CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP BY QUERIES Swaroop Acharya,Philip B Gibbons, VishwanathPoosala By Agasthya Padisala Anusha Reddy Rachapalli Muni

OUTLINE: 1.INTRODUCTION 2.AQUA SYSTEM 3.REQUIREMENTS ON GROUP BY ANSWERS 4.SOLUTIONS 5.REWRITING 6.EXPERIMENTS 7.EXTENSIONS 8.RELATED WORK 9.CONCLUSION

INTRODUCTION:  LIMITATIONS OF UNIFORM SAMPLING:  Not suitable for group by query.  For Example, Group by query on the U.S.census database could be used to determine the per capita income per state.  There can be large discrepancy in size of groups. The size of one state can be more than the size of other state, e.g., the state of California has nearly 70 times the population of Wyoming.  Uniform random sample of the relation will contain disproportionately fewer tuples from the smaller groups (states), which leads to poor accuracy for reliable answers of a group.

 BIASED SAMPLING FOR GROUP BYQUERIES:  In order to get an unbiased answer for groupby queries we use biased sample.  Briefly, our techniques involve taking group-sizes into consideration while sampling, in order to provide highly-accurate answers  The techniques in this paper are tailored to precomputedor materialized samples, such as used in Aqua.

AQUA SYSTEM :  Aqua maintains smaller-sized statistical summaries of the data called synopses, and uses them to answer queries.  A key feature of Aqua is that the system provides probabilistic error/confidence bounds on the answer, based on the Hoeffding and Chebyshev formulas.

REQUIREMENTS ON GROUP BY ANSWERS  The user has two requirements on the approximate answer to a group-by query. 1. The approximate answer should contain all the groups that appear in the exact answer. 2. The estimated answer for every group should be close to the exact answer for that group.

SOLUTIONS:  Congressional samples are hybrid union of uniform and biased samples.  The strategy adopted is to divide the available sample space X equally among the g groups, and take a uniform random sample within each group.

 Consider US Congress which is hybrid of House and Senate.  House has representative from each state in proportion to its population. So, it represents a uniform random sampling for entire relation.  Senate has equal number of representative from each state. So, it represents a sample having an equal number of tuples for each group.

 BASIC CONGRESS: Let X be the sample size then, the final sample size allocated to group g is given by :

 CONGRESS: Let X be the sample size then, the final sample size allocated to group g is given by :

REWRITING :  Query rewriting involves two key steps: a) scaling up the aggregate expressions and b) deriving error bounds on the estimate.  For each tuple, let its scale factor Scale Factor be the inverse sampling rate for its strata.  There are two approaches to doing this: a) store the Scale Factor(SF) with each tuple in sample relation- Integrated b) use a separate table to store the Scale Factors for the groups- Normalized, Key-normalized, Nested-integrated

EXPERIMENTS:  Testbed: On Aqua with Oracle(v7) as the backend DBMS  Accuracy of Sample allocation strategies: Performance of different query sets (queries with no Group-bys,Three Group-bys,Two Group-bys) are given below:

 Effect of Sample Size:  Error is inversely proportional to sample size. Congress –error drops rapidly with increasing sample size and provide high accuracy even for arbitrary group-bys

Performance of Rewriting strategies:

EXTENSIONS:  Generalization to multiple criteria. The congressional samples framework can be even extended to support grouping attributes with weight vectors.  Generalization to Other Queries This can be achieved by replacing the values in grouping column.

RELATED WORK :  Online Aggregation scheme  Histograms  Wavelets  Stratified Sampling

CONCLUSION :  congressional samples will minimize errors over queries on a set of possible group-by relations.  New strategies were validated to produce accurate estimates to group-by queries and has good execution efficiency.

REFERENCES  &coll=ACM&dl=ACM&CFID = &CFTOKEN= &coll=ACM&dl=ACM&CFID = &CFTOKEN=  mary?doi= mary?doi=

QUESTIONS???