February 14, 2006CS6392 - DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.

Slides:

Advertisements

Similar presentations

Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.

Advertisements

Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research.

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.

Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.

Fast Algorithms For Hierarchical Range Histogram Constructions

Determining How to Select a Sample

Introduction to Histograms Presented By: Laukik Chitnis

STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research.

Fast Incremental Maintenance of Approximate histograms : Phillip B. Gibbons (Intel Research Pittsburgh) Yossi Matias (Tel Aviv University) Viswanath Poosala.

Sampling: Final and Initial Sample Size Determination

Brian Babcock Surajit Chaudhuri Gautam Das at the 2003 ACM SIGMOD International Conference By Shashank Kamble Gnanoba.

Harikrishnan Karunakaran Sulabha Balan CSE  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion.

Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.

Chapter 10: Sampling and Sampling Distributions

Selection of Research Participants: Sampling Procedures

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed.

New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.

Deterministic Wavelet Thresholding for Maximum-Error Metrics Minos Garofalakis Bell Laboratories Lucent Technologies 600 Mountain Avenue Murray Hill, NJ.

Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

Evaluating Hypotheses

7-1 Chapter Seven SAMPLING DESIGN. 7-2 Sampling What is it? –Drawing a conclusion about the entire population from selection of limited elements in a.

1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.

Population Proportion The fraction of values in a population which have a specific attribute p = Population proportion X = Number of items having the attribute.

1 Machine Learning: Lecture 5 Experimental Evaluation of Learning Algorithms (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)

Determining Sample Size

CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,

Experimental Design making causal inferences Richard Lambert, Ph.D.

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

Join Synopses for Approximate Query Answering Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy.

OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :

Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.

CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )

Lecture 4. Sampling is the process of selecting a small number of elements from a larger defined target group of elements such that the information gathered.

February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented by Sushanth.

Histograms for Selectivity Estimation

Learning Objectives Explain the role of sampling in the research process Distinguish between probability and nonprobability sampling Understand the factors.

Join Synopses for Approximate Query Answering Swarup Acharya, Philip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy By Vladimir Gamaley.

Load Shedding Techniques for Data Stream Systems Brian Babcock Mayur Datar Rajeev Motwani Stanford University.

CpSc 881: Machine Learning Evaluating Hypotheses.

1 CSI5388 Current Approaches to Evaluation (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)

Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.

Presented By Anirban Maiti Chandrashekar Vijayarenu

Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.

1 Mean Analysis. 2 Introduction l If we use sample mean (the mean of the sample) to approximate the population mean (the mean of the population), errors.

CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP BY QUERIES Swaroop Acharya,Philip B Gibbons, VishwanathPoosala By Agasthya Padisala Anusha Reddy.

Lecture 14- Parallel Databases Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch

1 A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Proceedings of the.

University of Texas at Arlington Presented By Srikanth Vadada Fall CSE rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing.

Chapter 13 Query Optimization Yonsei University 1 st Semester, 2015 Sanghyun Park.

병렬분산컴퓨팅연구실 1 Cubing Algorithms, Storage Estimation, and Storage and Processing Alternatives for OLAP 병렬 분산 컴퓨팅 연구실 석사 1 학기 이 은 정

1. 2 DRAWING SIMPLE RANDOM SAMPLING 1.Use random # table 2.Assign each element a # 3.Use random # table to select elements in a sample.

Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.

Sampling Concepts Nursing Research. Population  Population the group you are ultimately interested in knowing more about “entire aggregation of cases.

Storage Estimation for Multidimensional Aggregates in the Presence of Hierarchies 병렬 분산 컴퓨팅 연구실 석사 1 학기 김남희.

ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan Shruti P. Gopinath CSE 6339.

Dense-Region Based Compact Data Cube

Parallel Databases.

A paper on Join Synopses for Approximate Query Answering

Anthony Okorodudu CSE ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan.

ICICLES: Self-tuning Samples for Approximate Query Answering

Chapter 15 QUERY EXECUTION.

Load Shedding Techniques for Data Stream Systems

AQUA: Approximate Query Answering

Population Proportion

Fast and Exact K-Means Clustering

Query Execution Presented by Jiten Oswal CS 257 Chapter 15

DATABASE HISTOGRAMS E0 261 Jayant Haritsa

Presentation transcript:

February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath Poosala Presented By: Muhammed Z. Miah

February 14, 2006CS DB Exploration2 Introduction Limitations of Uniform Sampling Presence of skewed data in aggregate values Effect of low selectivity in selection queries Presence of small groups in group-by queries Biased Sampling for Group-By Queries (Precomputed) Biased sampling – hybrid union of biased and uniform sampling

February 14, 2006CS DB Exploration3 Aqua System (Architecture)

February 14, 2006CS DB Exploration4 Problems with Group-By Queries Decision support queries routinely segment the data into groups. For example, a group-by query on the U.S. census database could be used to determine the per capita income per state. However,there can be a huge discrepancy in the sizes of different groups, e.g., the state of California has nearly 70 times the population of Wyoming. As a result, a uniform random sample of the relation will contain disproportionately fewer tuples from the smaller groups, which leads to poor accuracy for answers on those groups because accuracy is highly dependent on the number of sample tuples that belong to that group. Standard error is inversely proportional to √n for uniform sample. n is the uniform sample random size.

February 14, 2006CS DB Exploration5 Solution (Congressional Sampling) Congressional samples are hybrid union of uniform and biased samples. The strategy adopted is to divide the available sample space X equally among the g groups, and take a uniform random sample within each group. Consider US Congress which is hybrid of House and Senate. House has representative from each state in proportion to its population. Senate has equal number of representative from each state. Then apply House and Senate scenario for representing different groups. House sample: Uniform random sampling from each group. Senate sample: Sample an equal number of tuples from each group.

February 14, 2006CS DB Exploration6 Solution (Congressional Sampling) Define a strategy S1 as following : Divide the available sample space X equally among the g groups, and take a uniform random sample within each group Congressional approach : In this approach consider the entire set of possible group by queries over a relation R. Let be the set of non-empty groups under the grouping G. The grouping G partitions the relation R according to the cross-product of all the grouping attributes; this is the finest possible partitioning for group-bys on R. Any group h on any other grouping T G is the union of one or more groups g from. Constructing Congress, 1. Apply S1 on each T G. 2. Let be the set of non-empty groups under the grouping T, and let the number of such groups. 3. By S1, each of the non-empty groups in T should get a uniform random sample of X/m T tuples from the group.

February 14, 2006CS DB Exploration7 Solution (Congressional Sampling) Constructing Congress, 4. Thus for each subgroup g in of a group h in T, the expected space allocated to g is simply 5. Then, for each group g, take the maximum over all T of S g,T, as the sample size for g, and scale it down to limit the space used to X. The final formula is: Sample Size (g) = 6. For each group g in, select a uniform random sample of size Sample Size(g). Thus we have a stratified, biased sample in which each group at the finest partitioning is its own strata. Thus Congress essentially guarantees that both large and small groups in all groupings will have a reasonable number of samples. where n g and n h are the number of tuples in g and h respectively.

February 14, 2006CS DB Exploration8 Rewriting Query rewriting involves two key steps: a) scaling up the aggregate expressions and b) deriving error bounds on the estimate. For each tuple, let its scale factor ScaleFactor be the inverse sampling rate for its strata. All the sample tuples belonging to a group will have the same ScaleFactor. Thus key step in scaling is efficiently associate each tuple with its corresponding ScaleFactor. There are two approaches to doing this: a) store the ScaleFactor(SF) with each tuple in sample relation - Integrated b) use a separate table to store the ScaleFactors for the groups - Normalized, Key-normalized, Nested-integrated Each approach has its pros and cons.

February 14, 2006CS DB Exploration9 Computation and Maintenance One Pass Algorithm [AGP99b] S. Acharya, P. B. Gibbons, and V. Poosala. Congressional samples for approximate answering of group-by queries. Technical report, Bell Laboratories, Murray Hill, New Jersey, November 1999

February 14, 2006CS DB Exploration10 Experiments Testbed On Aqua, with Oracle (v7) Accuracy of Sample Allocation Strategies Performance for Different Query Sets Queries w/ No Group-bys, Three group-bys, Two group-bys Effect of Sample Size Error drops as more space is allocated to store the samples Congress – drops error rapidly w/ increasing sample size and provide high accuracy even for arbitrary group-bys Performance of Rewriting Strategies

February 14, 2006CS DB Exploration11 Extensions Generalization to Multiple Criteria Generalization to Other Queries

February 14, 2006CS DB Exploration12 Related Work Online Aggregation Histograms Wavelets Biased Sampling (Stratified Sampling)

February 14, 2006CS DB Exploration13 Conclusions Congressional samples are effective for group-by queries with arbitrary group-bys (including none) New strategies were validated experimentally for both in their ability to produce accurate estimates to group-by queries and in their execution efficiency

February 14, 2006CS DB Exploration14 THANK YOU Happy Valentines