End-biased Samples for Join Cardinality Estimation Cristian Estan, Jeffrey F. Naughton Computer Sciences Department University of Wisconsin-Madison.

Slides:



Advertisements
Similar presentations
Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.
Advertisements

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
16.4 Estimating the Cost of Operations Project GuidePrepared By Dr. T. Y. LinVinayan Verenkar Computer Science Dept San Jose State University.
CS4432: Database Systems II
A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA PRESENTED BY, JEEVAN KUMAR GOGINENI SARANYA GOTTIPATI.
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.
Fast, Memory-Efficient Traffic Estimation by Coincidence Counting Fang Hao 1, Murali Kodialam 1, T. V. Lakshman 1, Hui Zhang 2, 1 Bell Labs, Lucent Technologies.
Harikrishnan Karunakaran Sulabha Balan CSE  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion.
N.D.GagunashviliUniversity of Akureyri, Iceland Pearson´s χ 2 Test Modifications for Comparison of Unweighted and Weighted Histograms and Two Weighted.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
Selectivity-Based Partitioning Alkis Polyzotis UC Santa Cruz.
Streaming Algorithms for Robust, Real- Time Detection of DDoS Attacks S. Ganguly, M. Garofalakis, R. Rastogi, K. Sabnani Krishan Sabnani Bell Labs Research.
1 Reversible Sketches for Efficient and Accurate Change Detection over Network Data Streams Robert Schweller Ashish Gupta Elliot Parsons Yan Chen Computer.
Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Processing Data-Stream Joins Using Skimmed Sketches Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies Joint work.
Ensemble Learning: An Introduction
Evaluating Hypotheses
Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
What I am doing Amol Deshpande. Selection Ordering  Given a set of selection predicates and correlations between them, find the optimal ordering : Not.
Lecture 10 Comparison and Evaluation of Alternative System Designs.
Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer.
Summary of Quantitative Analysis Neuman and Robson Ch. 11
Multiple Regression Dr. Andy Field.
Data and Data Collection Quantitative – Numbers, tests, counting, measuring Fundamentally--2 types of data Qualitative – Words, images, observations, conversations,
Example of Simple and Multiple Regression
Regression Analysis Regression analysis is a statistical technique that is very useful for exploring the relationships between two or more variables (one.
Analysis of Monte Carlo Integration Fall 2012 By Yaohang Li, Ph.D.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
by B. Zadrozny and C. Elkan
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Topics: Statistics & Experimental Design The Human Visual System Color Science Light Sources: Radiometry/Photometry Geometric Optics Tone-transfer Function.
EN : Adv. Storage and TP Systems Cost-Based Query Optimization.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
7.4 – Sampling Distribution Statistic: a numerical descriptive measure of a sample Parameter: a numerical descriptive measure of a population.
1 STAT 500 – Statistics for Managers STAT 500 Statistics for Managers.
A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.
Histograms for Selectivity Estimation
Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.
EDCI 696 Dr. D. Brown Presented by: Kim Bassa. Targeted Topics Analysis of dependent variables and different types of data Selecting the appropriate statistic.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
Introduction to Basic Statistical Tools for Research OCED 5443 Interpreting Research in OCED Dr. Ausburn OCED 5443 Interpreting Research in OCED Dr. Ausburn.
Data Analysis.
STATISTICS AND OPTIMIZATION Dr. Asawer A. Alwasiti.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP BY QUERIES Swaroop Acharya,Philip B Gibbons, VishwanathPoosala By Agasthya Padisala Anusha Reddy.
Variance Reduction Fall 2012
Practice Page 128 –#6.7 –#6.8 Practice Page 128 –#6.7 =.0668 = test scores are normally distributed –#6.8 a =.0832 b =.2912 c =.4778.
HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas.
Sporadic model building for efficiency enhancement of the hierarchical BOA Genetic Programming and Evolvable Machines (2008) 9: Martin Pelikan, Kumara.
March 7, Using Pattern Recognition Techniques to Derive a Formal Analysis of Why Heuristic Functions Work B. John Oommen A Joint Work with Luis.
Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department of Operations Research and Financial Engineering.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
Data Analysis. Qualitative vs. Quantitative Data collection methods can be roughly divided into two groups. It is essential to understand the difference.
Random Variables and Probability Distribution (2)
Modeling and Simulation CS 313
A paper on Join Synopses for Approximate Query Answering
Finding Frequent Items in Data Streams
ICICLES: Self-tuning Samples for Approximate Query Answering
Data Integration with Dependent Sources
Tuning the top-k view update process
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
≠ Particle-based Variational Inference for Continuous Systems
Data and Data Collection
Descriptive Statistics
Presentation transcript:

End-biased Samples for Join Cardinality Estimation Cristian Estan, Jeffrey F. Naughton Computer Sciences Department University of Wisconsin-Madison

Problem description Estimating join size Not restricted to key-foreign key joins Based on summaries of the two tables computed separately Two main contributions of this paper Proposing a new type of summaries based on a special type of sampling Extensive experimental comparison of many types of summaries

We can get more accurate estimates! [AGMS99] showed that on certain data sets All summaries give inaccurate estimates Estimates based on random sampling are within constant factor of bound We show that On other data sets, our estimates significantly more accurate than those with random sampling No known summaries give estimates more accurate than all others for every data set

Overview End-biased samples Theoretical comparison against other sampling-based methods Experimental comparison against sketches and histograms

Building the end-biased samples If frequency of every value known for both tables → exact join size We keep a sample of this data Sampling probability proportional to frequency [DLT01] Sampling decisions correlated by using a shared hash function [F90],[DG00],[EKMV04] Frequency of values of join attribute in table A Frequency of values of join attribute in table B (c,10) (g,1) (m,2) (s,5) (t,1) (d,1) (g,1) (m,1) (r,7) (z,1) Sampling threshold T=4 p=1 p=0.25 p=0.5 p=1 p=0.25 p=1 p=0.25

Estimating join size Let a v be the frequency of value v in table A, b v in B and p v the probability that v is selected into both samples Sum contribution of values in both samples (a v b v /p v ) to estimate join size If a v ≥T a and b v ≥T b, p v =1 If a v ≥T a and b v <T b, p v =b v /T b If a v <T a and b v ≥T b, p v =a v /T a If a v <T a and b v <T b, p v =min(a v /T a,b v /T b )

Why correlate the samples? Example: tables with 1000 values appearing once, 50 values common to both tables We sample with probability 1/10 Sample size ~ 100 for each table Comparison p v Common values sampled Join size estimate CorrelatedUncorrel ~ 4, 5 or 6 ~ 0 or 1 40,50 or 600 or 100

Comparison of sampling methods Type of values dominating the join Accuracy of estimates of join size Random sampling Counting samples End-biased samples Frequent in both relations GoodVery goodPerfect Frequent in one relation Bad Infrequent in both relations Bad Good

Overview End-biased samples Theoretical comparison against other sampling-based methods Experimental comparison against sketches and histograms

Experimental methodology Randomly generated tables with ~ 1,000,000 tuples Explored multiple configurations Varied the “peakedness” of the distribution Varied memory budget from 204 to 659,456 words Varied the amount of correlation between tables Uncorrelated – tables generated independently Positively correlated – frequent values likely same in both tables Negatively correlated – unlikely frequent values same in the two tables 1,000 runs for each configuration

Summaries compared End-biased samples End-biased equi-depth histograms [PC84] Sketches [AGMS99],[DGGR02],[GGR04] Concise samples [GM98] Counting samples [GM98]

Comparison with histograms

Comparison with sketches

Memory comparison

Qualitative comparison AdvantageSketches End-biased samples Streaming updates X Simple configuration X Selection on join attribute X

Conclusions End-biased samples and sketches are the best summaries for the join size estimation problem addressed in this paper End-biased samples are compelling if Selections on the join attribute are required Summaries must be very concise The frequencies of join attributes in the two tables are strongly correlated

Questions? Thank you! Scripts and results for experiments available at

Estimating the join size

Related work – sampling methods [GM98] concise samples, counting samples [DLT01] smart sampling [F90],[EKMV04] using a hash function to select values used as summary of data

Related work – join size estimation Histograms Multidimensional histograms [GG02],[GK04] Wavelets [AGMS99],[DGGR02],[GGR04] Sketches

Variance of join size estimate No slide, point to the paper.