New Sampling-Based Estimators for OLAP Queries Ruoming Jin, Kent State University Leo Glimcher, The Ohio State University Chris Jermaine, University of.

Slides:



Advertisements
Similar presentations
Introduction Simple Random Sampling Stratified Random Sampling
Advertisements

Managerial Economics in a Global Economy
General Linear Model With correlated error terms  =  2 V ≠  2 I.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
ELEC 303 – Random Signals Lecture 18 – Statistics, Confidence Intervals Dr. Farinaz Koushanfar ECE Dept., Rice University Nov 10, 2009.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Chapter 10 Simple Regression.
Topics: Inferential Statistics
Statistical Inference Chapter 12/13. COMP 5340/6340 Statistical Inference2 Statistical Inference Given a sample of observations from a population, the.
Statistics for Business and Economics
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Evaluation.
Evaluating Hypotheses
2008 Chingchun 1 Bootstrap Chingchun Huang ( 黃敬群 ) Vision Lab, NCTU.
Chapter 11 Multiple Regression.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Statistical inference Population - collection of all subjects or objects of interest (not necessarily people) Sample - subset of the population used to.
Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data Lesson6-1 Lesson 6: Sampling Methods and the Central Limit Theorem.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Statistical Methods For Engineers ChE 477 (UO Lab) Larry Baxter & Stan Harding Brigham Young University.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.
Analysis of Monte Carlo Integration Fall 2012 By Yaohang Li, Ph.D.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
F OUNDATIONS OF S TATISTICAL I NFERENCE. D EFINITIONS Statistical inference is the process of reaching conclusions about characteristics of an entire.
Chap 20-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 20 Sampling: Additional Topics in Sampling Statistics for Business.
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
Random Sampling, Point Estimation and Maximum Likelihood.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
Introduction to Inferential Statistics. Introduction  Researchers most often have a population that is too large to test, so have to draw a sample from.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Section 8.1 Estimating  When  is Known In this section, we develop techniques for estimating the population mean μ using sample data. We assume that.
© 2001 Prentice-Hall, Inc. Statistics for Business and Economics Simple Linear Regression Chapter 10.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Lab 3b: Distribution of the mean
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
Statistical Methods II&III: Confidence Intervals ChE 477 (UO Lab) Lecture 5 Larry Baxter, William Hecker, & Ron Terry Brigham Young University.
ELEC 303 – Random Signals Lecture 18 – Classical Statistical Inference, Dr. Farinaz Koushanfar ECE Dept., Rice University Nov 4, 2010.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Physics 270 – Experimental Physics. Let say we are given a functional relationship between several measured variables Q(x, y, …) x ±  x and x ±  y What.
Ka-fu Wong © 2003 Chap 6- 1 Dr. Ka-fu Wong ECON1003 Analysis of Economic Data.
Regression Analysis: Part 2 Inference Dummies / Interactions Multicollinearity / Heteroscedasticity Residual Analysis / Outliers.
Presented By Anirban Maiti Chandrashekar Vijayarenu
Sampling and estimation Petter Mostad
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
Chapter 5 Sampling Distributions. The Concept of Sampling Distributions Parameter – numerical descriptive measure of a population. It is usually unknown.
Classification Ensemble Methods 1
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP BY QUERIES Swaroop Acharya,Philip B Gibbons, VishwanathPoosala By Agasthya Padisala Anusha Reddy.
HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University.
Chapter Eleven Sample Size Determination Chapter Eleven.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate its.
Chapter 6: Random Errors in Chemical Analysis. 6A The nature of random errors Random, or indeterminate, errors can never be totally eliminated and are.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
Estimating standard error using bootstrap
Confidence Intervals Cont.
Data Transformation: Normalization
Chapter 7. Classification and Prediction
Data Mining: Concepts and Techniques
STATISTICAL INFERENCE PART IV
Fast and Exact K-Means Clustering
15.1 The Role of Statistics in the Research Process
Parametric Methods Berlin Chen, 2005 References:
Statistical Thinking and Applications
Presentation transcript:

New Sampling-Based Estimators for OLAP Queries Ruoming Jin, Kent State University Leo Glimcher, The Ohio State University Chris Jermaine, University of Florida Gagan Agrawal, The Ohio State University

Approximate Query Processing AQP is an active area of DM research The goal is to provide accurate estimation of queries without access the entire databases Especially useful and important for data warehouse and OLAP Consider you have a total of 10,000 disks, each with 200GB (2PB) Takes 1 hour to scan Answering a single, simple aggregate query may need an hour –Unacceptable to analysts/end-users If each disk cost $1000 year to maintain One simple query can cost –$1572=10,000  $1000/ (365  24) –inhibitive cost

OLAP Queries Querying the Large Relational Tables composed of Dimensional Attributes –Categorical data (Most) –Sex, Country, State, City, Product Code, Department, Color, … Measure Attributes – Numerical data –Salary, Sales, Price, Number of Complaints, … Aggregate Queries Most AQP tailored to numerical data Wavelets, kernels, histograms Problematic for categorical data and high-dimensionality Random Sampling Well studied in statistical theory Can handle high-dimension category data Provide estimates of the query results as well as the estimate accuracy

Confidence Interval The measure for accuracy COMPLAINTS(PROF, SEMESTER, NUM_COMPLAINTS) SELECT SUM (NUM_COMPLAINTS) FROM COMPLAINS WHERE PROF = ‘Smith’ AND SEMESTER = ‘Fa03’ A Confidence Bound: –With a probability of.95, Prof. Smith received 27 to 29 complaints in the Fall of 2003 Accuracy levelInterval width =2

How to estimate the confidence interval? Uniform Sampling Central limit theorem (CLT) Delta Methods Assuming the distribution of an estimator ŷ of an aggregate query result y is approximately normal with mean E(ŷ), and variance V(ŷ) for a large sample, an approximate 95% confidence interval for the estimator is given by [ŷ-1.96SE(ŷ), ŷ+1.96SE(ŷ)] where 1.96 is the th percentile of the standard normal distribution, and SE(ŷ) is the standard error (the square root of the variance V(ŷ) ). Accuracy levelInterval width = 3.92SE(ŷ)

How to (cont’d) Unequal Probability Sampling Stratified Sampling Separate Samples for Each Measure (Numerical) Attribute Re-Sampling Bootstrapping Computational Intensive Distribution-free –Chebyshev and Hoeffding bound –Loose bound

Problem studied in this presentation How to provide an accurate confidence interval together with an estimation? Boosting the accuracy level Reducing the interval width Key idea: Ensemble Estimates Find multiple (unbiased) estimators for each OLAP query Linearly combine the individual estimators and derive the optimal coefficients to reduce the global variance Handle the correlation among the individual estimators

Example Database describing student complaints Prof.SemesterComplaintsProf.SemesterComplaints AdamsFa 023SmithSu 017 JonesFa 022SmithSp 018 AdamsSp 029AdamsFa 004 JonesSp 022SmithFa 0033 SmithSp 0221SmithSu 0016 SmithFa 0136AdamsSu 003 JonesSu 011JonesSu 000 AdamsSu 012JonesSp 991

Example We sample the database… Prof.SemesterComplaintsProf.SemesterComplaints AdamsFa 023SmithSu 017 JonesFa 022SmithSp 018 AdamsSp 029AdamsFa 004 JonesSp 022SmithFa 0033 SmithSp 0221SmithSu 0016 SmithFa 0136AdamsSu 003 JonesSu 011JonesSu 000 AdamsSu 012JonesSp 991

Example And ask: How many complaints for Smith? Prof.SemesterComplaintsProf.SemesterComplaints AdamsFa 023SmithSu 017 JonesFa 022SmithSp 018 AdamsSp 029AdamsFa 004 JonesSp 022SmithFa 0033 SmithSp 0221SmithSu 0016 SmithFa 0136AdamsSu 003 JonesSu 011JonesSu 000 AdamsSu 012JonesSp 991 Est: (21+7+8)/8×16=72; Answer: 121

Why So Bad? We missed two important records Prof.SemesterComplaintsProf.SemesterComplaints AdamsFa 023SmithSu 017 JonesFa 022SmithSp 018 AdamsSp 029AdamsFa 004 JonesSp 022SmithFa 0033 SmithSp 0221SmithSu 0016 SmithFa 0136AdamsSu 003 JonesSu 011JonesSu 000 AdamsSu 012JonesSp 991 Oops!

How we know something went wrong? What if we know the total complaints of the entire table: SUM(NUM_COMPLAINTS) Compare with the estimated total complaints of the entire table Est: ( )/8 × 16 = 92, Answer: 148 One of the key ideas in the APA approach Pre-aggregation of the low-dimensional aggregates 0-dimensional fact: SUM(NUM_COMPLAINTS) =148 1-dimensional fact, for example, on SEMESTER SELECT SUM(NUM_COMPLAINTS) FROM COMPLAINTS GROUP-BY SEMESTER Or higher, depending on the cost of such pre-aggregation In our example, assuming only the 0-dimensional fact is know!

How we can pull ourselves out? APA use Maximal Likelihood Estimation (MLE) Break data space based on relational selection predicates; 2 m Quadrants Compute aggregate for each quadrant Characterize the error of the estimates using normal PDF (justification: CLT) Pretend estimates are independent Adjust the means to max likelihood Subject to known facts about the data Shows to be very accurate in various datasets, significantly better than plain sampling and stratified sampling In our example, the New Estimation is (answer was 121, the original estimation is 72) However, loss of analytic guarantees on accuracy!

Let us go back to the plain sampling For the query: How many complaints for Smith? Prof.SemesterComplaintsProf.SemesterComplaints AdamsFa 023SmithSu 017 JonesFa 022SmithSp 018 AdamsSp 029AdamsFa 004 JonesSp 022SmithFa 0033 SmithSp 0221SmithSu 0016 SmithFa 0136AdamsSu 003 JonesSu 011JonesSu 000 AdamsSu 012JonesSp 991 Est: (21+7+8)/8×16=72 (Answer: 121); The standard error (SE) is [ŷ-1.96SE(ŷ), ŷ+1.96SE(ŷ)]

New Estimator: The Negative One To answer the query: How many complaints for Smith? (Answer:121) We first ask: How many complaints NOT for Smith? Prof.SemesterComplaintsProf.SemesterComplaints AdamsFa 023SmithSu 017 JonesFa 022SmithSp 018 AdamsSp 029AdamsFa 004 JonesSp 022SmithFa 0033 SmithSp 0221SmithSu 0016 SmithFa 0136AdamsSu 003 JonesSu 011JonesSu 000 AdamsSu 012JonesSp 991 Est: ( )/8×16=20, The Negative Estimator: =128, Standard Error (SE) = 13.4

How two is always better than one: The Ensemble Estimator Linearly combining the direct (positive) estimator and the negative estimator Est new = α Est direct + (1- α ) Est negative (  0  α  1) Note since both the direct estimators and negative estimators are unbiased estimators, the ensemble estimator is also unbiased. Choose the parameter α to minimize the variance the ensemble estimator The ensemble estimator always is always more accurate If the individual estimators are independent, the optimal value of the parameter α is V(Est direct )/(V(Est direct )+V(Est negative )) In our example, α=0.0373, Est new =125.95, Standard Error (SE) = 13.1

What if we have higher-dimensional facts? Image we have the relational table EMPLOYEE(NAME, SEX,DEPARTMENT,JOB_TYPE, SALARY) Query: SELECT SUM (SALARY) FROM EMPLOYEE WHERE SEX=‘M’ AND DEPARTMENT=‘ACCOUNT’ AND JOB_TYPE=‘SUPERVISOR’ Pre-Aggregation 1-dimesional facts

More negative estimators SELECT SUM (SALARY) FROM EMPLOYEE WHERE SEX=‘M’ AND DEPARTMENT=‘ACCOUNT’ AND JOB_TYPE=‘SUPERVISOR’ b1 b2 b3 SEX JOB_TYPE DEPARTMENT b1^b2^b3 b1^b2^b3, or SEX  ‘M’ AND DEPARTMENT  ‘ACCOUNT’ AND JOB_TYPE  ‘SUPERVISOR’

More negative estimators SELECT SUM (SALARY) FROM EMPLOYEE WHERE SEX=‘M’ AND DEPARTMENT=‘ACCOUNT’ AND JOB_TYPE=‘SUPERVISOR’ b1 b2 b3 SEX JOB_TYPE DEPARTMENT b1^b2^b3 b1^b2^b3, or SEX  ‘M’ AND DEPARTMENT  ‘ACCOUNT’ AND JOB_TYPE  ‘SUPERVISOR’ b1^b2^b3

More negative estimators (cont’d) SELECT SUM (SALARY) FROM EMPLOYEE WHERE SEX=‘M’ AND DEPARTMENT=‘ACCOUNT’ AND JOB_TYPE=‘SUPERVISOR’ b1 b2 b3 SEX JOB_TYPE DEPARTMENT b1^b2^b3 b1^b2^b3, or SEX  ‘M’ AND DEPARTMENT  ‘ACCOUNT’ AND JOB_TYPE  ‘SUPERVISOR’ b1^b2^b3

More negative estimators (cont’d) SELECT SUM (SALARY) FROM EMPLOYEE WHERE SEX=‘M’ AND DEPARTMENT=‘ACCOUNT’ AND JOB_TYPE=‘SUPERVISOR’ b1 b2 b3 SEX JOB_TYPE DEPARTMENT b1^b2^b3 b1^b2^b3, or SEX  ‘M’ AND DEPARTMENT  ‘ACCOUNT’ AND JOB_TYPE  ‘SUPERVISOR’ b1^b2^b3

Combining Positive and Negative Estimators in APA1+ We will have multiple negative estimators Est new = α0 Est direct + α1 Est negative1 + α2 Est negative2 +… 0  αi  1, α0+ α1+ α2+… = 1 Decompose the negative estimators into the cell representations Each cell in the cube correspond to a direct estimation The variance of the cell can be estimated We can use Lagrange multipliers to optimize all the parameters (αi) We assume the direct estimations for each cell is independent This procedure usually involve a linear solver for a linear equation

Actually, the estimators are correlated Fortunately, we are able to capture such correlation analytically If each individual estimator is approximately normal, and they are independent, the combined estimator is also approximately normal However, the correction effect results in a slightly different distribution Analytically very close to the spherically symmetric distribution, of which normal distribution is a special case. Empirically, it shows strong tendency to normal We use normal distribution to derive the confidence interval

Empirical Distribution of the Ensemble Estimators Empirical distribution of APA0+ Empirical distribution of APA1+

Experimental Evaluation Four datasets Forest Cover data (from UCI KDD archive) River Flow data William Shakespeare data Image Feature vector Approximation techniques Simple Random Sampling Stratified Sampling APA0+ APA1+ Queries 2000 queries for each dataset

Measure the estimated confidence interval We generate 95% confidence intervals of all estimation techniques for each query Accuracy level What are the real chances the correct answers actually fall in the confidence intervals? Interval width How tight are the bounds of the confidence intervals?

How good are the new estimators? Accuracy of the confidence intervals (Expected: 95%) APA1+ average around 90%, which was 23.2% higher than simple random sampling (the next best alternative in terms of accuracy) –The accuracy of APA0+, random sampling, and stratified sample are comparable, all less than 70% in average Confidence interval width The width of the confidence interval produced by APA1+ is only 1/2 the size of one from random sampling Compared with stratified sampling, APA1+ is at least 20% smaller The width of the confidence interval produced by APA0+ is around 15% smaller than random sampling

Discussion Overall, the new estimators work pretty well! It’s very simple! Significantly better than the random sampling Significantly better than the stratified sampling APA1+ is the only estimator which provides the confidence interval close to the theoretically expected accuracy and with much smaller width! Suitable for both categorical, numerical data APA0+, and APA1 unaffected by high dimensions! Future work How to apply this idea to work with more complicated aggregation functions?

Thanks!!

Roadmap Approximate Query Processing and Confidence Interval Motivating Example Generalization and Handling Correlation Experimental Results Conclusions Inspired by Chris’s original APA approach (how to find multiple estimators) Ensemble Classifiers in Statistical Learning