February 14, 2006CS6392 - DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.

Slides:



Advertisements
Similar presentations
Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.
Advertisements

Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research.
UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Determining How to Select a Sample
Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.
Fast Incremental Maintenance of Approximate histograms : Phillip B. Gibbons (Intel Research Pittsburgh) Yossi Matias (Tel Aviv University) Viswanath Poosala.
Sampling: Final and Initial Sample Size Determination
Linked Bernoulli Synopses Sampling Along Foreign Keys Rainer Gemulla, Philipp Rösch, Wolfgang Lehner Technische Universität Dresden Faculty of Computer.
Brian Babcock Surajit Chaudhuri Gautam Das at the 2003 ACM SIGMOD International Conference By Shashank Kamble Gnanoba.
Harikrishnan Karunakaran Sulabha Balan CSE  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion.
February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed.
Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.
THE MEANING OF STATISTICAL SIGNIFICANCE: STANDARD ERRORS AND CONFIDENCE INTERVALS.
Chapter Six Sampling Distributions McGraw-Hill/Irwin Copyright © 2004 by The McGraw-Hill Companies, Inc. All rights reserved.
Why sample? Diversity in populations Practicality and cost.
Sampling and Sampling Distributions: Part 2 Sample size and the sampling distribution of Sampling distribution of Sampling methods.
7-1 Chapter Seven SAMPLING DESIGN. 7-2 Sampling What is it? –Drawing a conclusion about the entire population from selection of limited elements in a.
QMS 6351 Statistics and Research Methods Chapter 7 Sampling and Sampling Distributions Prof. Vera Adamchik.
Population Proportion The fraction of values in a population which have a specific attribute p = Population proportion X = Number of items having the attribute.
1. Homework #2 2. Inferential Statistics 3. Review for Exam.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
Sampling The sampling errors are: for sample mean
Chapter 7 Sampling Distribution
Chapter 11: Estimation Estimation Defined Confidence Levels
Chapter 8: Confidence Intervals
Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Ronda Hilton.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Join Synopses for Approximate Query Answering Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy.
Chapter 18 Additional Topics in Sampling ©. Steps in Sampling Study Step 1: Information Required? Step 2: Relevant Population? Step 3: Sample Selection?
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Chapter 7 Sampling Distributions.
Lecture 4. Sampling is the process of selecting a small number of elements from a larger defined target group of elements such that the information gathered.
5.4 Sampling Distributions and the Central Limit Theorem Key Concepts: –How to find sampling distributions and verify their properties –The Central Limit.
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented by Sushanth.
Histograms for Selectivity Estimation
Join Synopses for Approximate Query Answering Swarup Acharya, Philip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy By Vladimir Gamaley.
Load Shedding Techniques for Data Stream Systems Brian Babcock Mayur Datar Rajeev Motwani Stanford University.
CpSc 881: Machine Learning Evaluating Hypotheses.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 1 In an observational study, the researcher observes values of the response variable and explanatory.
Chapter 6: 1 Sampling. Introduction Sampling - the process of selecting observations Often not possible to collect information from all persons or other.
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
Presented By Anirban Maiti Chandrashekar Vijayarenu
1 Mean Analysis. 2 Introduction l If we use sample mean (the mean of the sample) to approximate the population mean (the mean of the population), errors.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP BY QUERIES Swaroop Acharya,Philip B Gibbons, VishwanathPoosala By Agasthya Padisala Anusha Reddy.
Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms.
1 A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Proceedings of the.
University of Texas at Arlington Presented By Srikanth Vadada Fall CSE rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
1.3 Experimental Design. What is the goal of every statistical Study?  Collect data  Use data to make a decision If the process to collect data is flawed,
ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan Shruti P. Gopinath CSE 6339.
Section 7.3 Day 2.
Chapter 6 Inferences Based on a Single Sample: Estimation with Confidence Intervals Slides for Optional Sections Section 7.5 Finite Population Correction.
Chapter 7 (b) – Point Estimation and Sampling Distributions
Basic Business Statistics (8th Edition)
A paper on Join Synopses for Approximate Query Answering
Anthony Okorodudu CSE ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan.
Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang
Slides by JOHN LOUCKS St. Edward’s University.
Associative Query Answering via Query Feature Similarity
ICICLES: Self-tuning Samples for Approximate Query Answering
Load Shedding Techniques for Data Stream Systems
Population Proportion
Differential Equations
CSCI N317 Computation for Scientific Applications Unit Weka
Introduction to Sampling Distributions
Chapter 7 Sampling and Sampling Distributions
Sampling Distributions
CSE 6392 – Data Exploration and Analysis in Relational Databases
Presentation transcript:

February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath Poosala Presented By: Daniel Kuang

February 14, 2006CS DB Exploration2 Outline Problems with Group-By queries Congressional sampling Rewriting Performance Conclusion

February 14, 2006CS DB Exploration3 Problems with Group-By Queries Decision support queries routinely segment the data into groups. For example, a group-by query on the U.S. census database could be used to determine the per capita income per state. However,there can be a huge discrepancy in the sizes of different groups, e.g., the state of California has nearly 70 times the population of Wyoming. As a result, a uniform random sample of the relation will contain disproportionately fewer tuples from the smaller groups, which leads to poor accuracy for answers on those groups because accuracy is highly dependent on the number of sample tuples that belong to that group. Standard error is inversely proportional to √n for uniform sample. n is the uniform sample random size.

February 14, 2006CS DB Exploration4 Solution (Congressional Sampling) Consider US Congress which is hybrid of House and Senate. House has representative from each state in proportion to its population. Senate has equal number of representative from each state. Then apply House and Senate scenario for representing different groups. House sample: Uniform random sampling from each group. Senate sample: Sample an equal number of tuples from each group.

February 14, 2006CS DB Exploration5 Solution (Congressional Sampling) Consider a relation R with two grouping attributes A, and B Number of tuples for the groups (a1, b1) – 3000, (a1, b2) – 3000, (a1, b3) – 1500, (a2, b3) Basic Congress (sample size = 100) ABHouse Sg,0 Senate Sg,AB Basic Congress before scaling Basic Congress a1b a1b a1b a2b

February 14, 2006CS DB Exploration6 Solution (Congressional Sampling) ABHouse Sg,0 Senate Sg,AB Basic Congress before scaling Basic Congress a1b a1b a1b a2b Sg,ASg,BCongress before scaling Congress 20 (of 50) (of 50) (of 50)12.5 (of 33.3) (of 33.3)5035.3

February 14, 2006CS DB Exploration7 Congressional Sampling Basic congress sample size allocated to each group Congress sample size allocated to each group

February 14, 2006CS DB Exploration8 Rewriting Query rewriting involves two key steps: a) scaling up the aggregate expressions and b) deriving error bounds on the estimate. ScaleFactor be the inverse sampling rate for its strata. How to associate each tuple with its scalefactor: a) store the ScaleFactor(SF) with each tuple in sample relation b) use a separate table to store the ScaleFactors for the groups KeyGrouping columnAggregate column KABCQ k1a1b1c1q1 k2a1b1c2q2 Select A, B, sum(Q) From Rel Group by A, B Relation Rel with two example tuples

February 14, 2006CS DB Exploration9 Rewriting (Integrated Rewriting)

February 14, 2006CS DB Exploration10 Normalized Rewriting

February 14, 2006CS DB Exploration11 Key-normalized Rewriting

February 14, 2006CS DB Exploration12 Nested-integrated Rewriting

February 14, 2006CS DB Exploration13 Performance Three Queries Grouping on returnflag, linestatus, shipdate skewed group sizes z = 1.5 Sample Percentage at 7%

February 14, 2006CS DB Exploration14 Performance

February 14, 2006CS DB Exploration15 Performance

February 14, 2006CS DB Exploration16 Performance

February 14, 2006CS DB Exploration17 Performance Times taken for different sample percentages Actual query time = 40sec

February 14, 2006CS DB Exploration18 Conclusions Congressional samples are effective for group-by queries with arbitrary group-bys (including none)