HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas.

Slides:



Advertisements
Similar presentations
Introduction Simple Random Sampling Stratified Random Sampling
Advertisements

Multiple Regression Analysis
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
The Simple Regression Model
CS4432: Database Systems II
Welcome to PHYS 225a Lab Introduction, class rules, error analysis Julia Velkovska.
Fast Algorithms For Hierarchical Range Histogram Constructions
Chapter 6 Sampling and Sampling Distributions
Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.
Sampling Distributions and Sample Proportions
Linked Bernoulli Synopses Sampling Along Foreign Keys Rainer Gemulla, Philipp Rösch, Wolfgang Lehner Technische Universität Dresden Faculty of Computer.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
Random effects estimation RANDOM EFFECTS REGRESSIONS When the observed variables of interest are constant for each individual, a fixed effects regression.
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
Visual Recognition Tutorial
Chapter 7 Sampling Distributions
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Chapter 7 Sampling and Sampling Distributions
Chapter 10 Sampling and Sampling Distributions
Evaluating Hypotheses
Why sample? Diversity in populations Practicality and cost.
Exploiting Correlated Attributes in Acquisitional Query Processing Amol Deshpande University of Maryland Joint work with Carlos Sam
Parametric Query Generation Student: Dilys Thomas Mentor: Nico Bruno Manager: Surajit Chaudhuri.
Part III: Inference Topic 6 Sampling and Sampling Distributions
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 6-1 Chapter 6 The Normal Distribution and Other Continuous Distributions.
Privacy Preserving OLAP Rakesh Agrawal, IBM Almaden Ramakrishnan Srikant, IBM Almaden Dilys Thomas, Stanford University.
7-2 Estimating a Population Proportion
A new sampling method: stratified sampling
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer.
1 BA 555 Practical Business Analysis Review of Statistics Confidence Interval Estimation Hypothesis Testing Linear Regression Analysis Introduction Case.
Regression and Correlation Methods Judy Zhong Ph.D.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
by B. Zadrozny and C. Elkan
STA Lecture 161 STA 291 Lecture 16 Normal distributions: ( mean and SD ) use table or web page. The sampling distribution of and are both (approximately)
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
10.1: Confidence Intervals – The Basics. Introduction Is caffeine dependence real? What proportion of college students engage in binge drinking? How do.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
A Novel Approach for Approximate Aggregations Over Arrays SSDBM 2015 June 29 th, San Diego, California 1 Yi Wang, Yu Su, Gagan Agrawal The Ohio State University.
Distributions of the Sample Mean
1 Chapter 7 Sampling Distributions. 2 Chapter Outline  Selecting A Sample  Point Estimation  Introduction to Sampling Distributions  Sampling Distribution.
Stat 112: Notes 2 Today’s class: Section 3.3. –Full description of simple linear regression model. –Checking the assumptions of the simple linear regression.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Eurostat Statistical matching when samples are drawn according to complex survey designs Training Course «Statistical Matching» Rome, 6-8 November 2013.
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 13-1 Introduction to Regression Analysis Regression analysis is used.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 7-1 Chapter 7 Sampling Distributions Basic Business Statistics.
Eurostat Weighting and Estimation. Presented by Loredana Di Consiglio Istituto Nazionale di Statistica, ISTAT.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Manoranjan.
Summarizing Risk Analysis Results To quantify the risk of an output variable, 3 properties must be estimated: A measure of central tendency (e.g. µ ) A.
BASIC STATISTICAL CONCEPTS Statistical Moments & Probability Density Functions Ocean is not “stationary” “Stationary” - statistical properties remain constant.
From the population to the sample The sampling distribution FETP India.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University.
Chapter 8 Estimation ©. Estimator and Estimate estimator estimate An estimator of a population parameter is a random variable that depends on the sample.
R. Kass/W03 P416 Lecture 5 l Suppose we are trying to measure the true value of some quantity (x T ). u We make repeated measurements of this quantity.
Chapter 7: The Distribution of Sample Means
A Binary Linear Programming Formulation of the Graph Edit Distance Presented by Shihao Ji Duke University Machine Learning Group July 17, 2006 Authors:
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Statistics for Business and Economics Module 1:Probability Theory and Statistical Inference Spring 2010 Lecture 4: Estimating parameters with confidence.
Chapter 6 Sampling and Sampling Distributions
STA248 week 121 Bootstrap Test for Pairs of Means of a Non-Normal Population – small samples Suppose X 1, …, X n are iid from some distribution independent.
Chapter 4 Basic Estimation Techniques
Data Transformation: Normalization
Private Data Management with Verification
Basic Estimation Techniques
Ratio and regression estimation STAT262, Fall 2017
Basic Estimation Techniques
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Statistical Thinking and Applications
Presentation transcript:

HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas (University of Toronto) and Calisto Zuzarte (IBM Toronto Lab)

Outline  Background Motivation Related Work  HASE Estimator Algorithms Bounds Experiments  Conclusions

Query Optimization  Execution plans differ in costs Difference can be huge (1 sec vs. 1 hour) Which Plan to Choose??  Query Optimization Estimate the costs of different plans Choose the plan with the least cost  Cost Estimation Factors: run-time environments, data properties, …

Selectivity  Important factor in costing: selectivity Fraction of records satisfying the predicate (s) E.g., 100 out of 10,000 records having salary > 3000 s = 100/10000 = 0.01  Selectivity can make a big difference Selectivity (s) cost Plan 1: Table scan Plan 2: Index scan Cost = s * const 2 Cost = const 1 s = 0.01

Related Work  Two streams Synopsis-based Sampling-based  Synopses Capture the characteristics of data Obtained off-line, used on-line E.g., Histograms

Histograms Salary Q: Selectivity of salary>3000? A: # of records in red / total # of records Estimate = ( ) / 5000 = … Salary … ……… … … … … … … … …

Synopses: pros and cons  Pros: Built offline; can be used many times minimal overhead at selectivity estimation time  Cons: Difficult to capture all useful information in a limited space Correlation between attributes

Sampling Number of records in the table: 10,000 Sample size: 100 Number of records having age > 50 and salary > 5500 : 12 Selectivity estimate = 12/100 = 0.12 True selectivity = 0.09

Sampling: pros and cons  The good: Provides correlation info through the sample  The bad: Cost, cost … Accurate results require a large portion of the data to be accessed Random access is much slower than sequential access

Summary SynopsesSampling Runtime cost  Correlation information  Take the best of both worlds? Capture correlation + reduce sampling rate

Outline  Background Motivation Related Work  Our approach: HASE Estimator Algorithms Bounds Experiments  Conclusions

HASE  Hybrid approach to selectivity estimation Salary Goal: Consistent utilization of both sources of information Benefits: 1.Correlation is captured (sampling) 2.Sample size can be significantly smaller (histograms)

Problem setting Conjuncts of predicates Q = P1^P2^P3 ^… (age>50)^(salary>5500)^(hire_date>” ”) P1 P2 P3  Selectivities of individual predicates (obtained from synopses) s1 = 0.1, s2 = 0.2, s3 = 0.05  A Sample S of n records Inclusion probability of record j :  j For simple random sampling (SRS)  j = n/N Query: Available info: Data: Table of size N Goal Estimate the selectivity s of the query Q

Example Table R with 10,000 records Query Q = P1^P2 on two attributes Suppose 500 records satisfy both predicates True Selectivity s = 500/10000 = 0.05

Histogram-based estimate Assuming independence between attributes Selectivity estimate Based on the histograms, s1 = 0.6, s2 = 0.3 Relative error = |0.18 – 0.05 | /0.05 = 260%

Sampling-based estimate Sample weight of j : d j = 1/  j Indicator variable Selectivity Estimate (HT estimator) Take a SRS of size 100  d j = 10000/100 = records satisfy Q Estimate = 9*100/10000 = 0.09 Relative error = | 0.05 – 0.09 | / 0.05 = 80%

A new estimator Known selectivities (through histograms) s1, s2, … w j : (1) reproduce known selectivities of individual predicates (2) as close to d j as possible Original weights New weights Calibration estimator

Consistency with known selectivities P2=trueP2=false- P1=true P1=false Observed frequencies from sample 100 sample records from 10,000 records in the table  d j = 100 s1 = 0.6

Calibration estimator Why do we want w j to be as close as d j as possible? d j have the property of producing unbiased estimates w j remain nearly unbiased Keep w j as close to d j as possible

Constrained optimization problem Distance function D(x) (x = w j /d j ) Minimize Subject to w.r.t. w j j satisfies Pi? Yes: 1 No: 0 (As close to d j as possible) (reproduce known selectivities)

An algorithm based on Newton ’ s method Method of Lagrange multipliers Minimize w.r.t. where Can be solved using Newton’s method via an iterative procedure.   w j 

An alternative algorithm

Example P2=trueP2=false P1=true P1=false Observed frequencies from sample

Distance measures  Requirements on the distance function (1) D is positive and strictly convex (2) D(1) = D ’ (1) = 0 (3) D ’’ (1) = 1  Linear function only one iteration required  fast! w j < 0 possible  negative estimates  Multiplicative function Converges after a few iterations (typically two) w j > 0 always

Error bounds  Probabilistic bounds = Pr ( both j and l are in the sample )

 Synthetic data Skew: Zipfian distribution (z=0,1,2,3) Correlation: corr. coef. between attributes: [0, 1]  Real data Census-Income data from UCI KDD Archive Population surveys by the US Census Bureau. ~200,000 records, 40 attributes  Queries Range queries: attribute<= constant Equality queries: attribute = constant Experiments

Effect of correlation

Effect of data skew

Effect of sample rate

Effect of number of attributes

Conclusions Synopsis-based estimationSampling-based estimation Selectivity Estimation HASE The calibrated estimator Algorithms Probabilistic bounds on errors Experimental results Benefits: 1.Capturing correlation (sampling) 2.Sample size can be significantly smaller (histograms)