New Sampling-Based Estimators for OLAP Queries Ruoming Jin, Kent State University Leo Glimcher, The Ohio State University Chris Jermaine, University of.

New Sampling-Based Estimators for OLAP Queries Ruoming Jin, Kent State University Leo Glimcher, The Ohio State University Chris Jermaine, University of Florida Gagan Agrawal, The Ohio State University

Approximate Query Processing AQP is an active area of DM research The goal is to provide accurate estimation of queries without access the entire databases Especially useful and important for data warehouse and OLAP Consider you have a total of 10,000 disks, each with 200GB (2PB) Takes 1 hour to scan Answering a single, simple aggregate query may need an hour –Unacceptable to analysts/end-users If each disk cost $1000 year to maintain One simple query can cost –$1572=10,000  $1000/ (365  24) –inhibitive cost

OLAP Queries Querying the Large Relational Tables composed of Dimensional Attributes –Categorical data (Most) –Sex, Country, State, City, Product Code, Department, Color, … Measure Attributes – Numerical data –Salary, Sales, Price, Number of Complaints, … Aggregate Queries Most AQP tailored to numerical data Wavelets, kernels, histograms Problematic for categorical data and high-dimensionality Random Sampling Well studied in statistical theory Can handle high-dimension category data Provide estimates of the query results as well as the estimate accuracy

Confidence Interval The measure for accuracy COMPLAINTS(PROF, SEMESTER, NUM_COMPLAINTS) SELECT SUM (NUM_COMPLAINTS) FROM COMPLAINS WHERE PROF = ‘Smith’ AND SEMESTER = ‘Fa03’ A Confidence Bound: –With a probability of.95, Prof. Smith received 27 to 29 complaints in the Fall of 2003 Accuracy levelInterval width =2

How to estimate the confidence interval? Uniform Sampling Central limit theorem (CLT) Delta Methods Assuming the distribution of an estimator ŷ of an aggregate query result y is approximately normal with mean E(ŷ), and variance V(ŷ) for a large sample, an approximate 95% confidence interval for the estimator is given by [ŷ-1.96SE(ŷ), ŷ+1.96SE(ŷ)] where 1.96 is the 0.975 th percentile of the standard normal distribution, and SE(ŷ) is the standard error (the square root of the variance V(ŷ) ). Accuracy levelInterval width = 3.92SE(ŷ)

How to (cont’d) Unequal Probability Sampling Stratified Sampling Separate Samples for Each Measure (Numerical) Attribute Re-Sampling Bootstrapping Computational Intensive Distribution-free –Chebyshev and Hoeffding bound –Loose bound

Problem studied in this presentation How to provide an accurate confidence interval together with an estimation? Boosting the accuracy level Reducing the interval width Key idea: Ensemble Estimates Find multiple (unbiased) estimators for each OLAP query Linearly combine the individual estimators and derive the optimal coefficients to reduce the global variance Handle the correlation among the individual estimators

Example Database describing student complaints Prof.SemesterComplaintsProf.SemesterComplaints AdamsFa 023SmithSu 017 JonesFa 022SmithSp 018 AdamsSp 029AdamsFa 004 JonesSp 022SmithFa 0033 SmithSp 0221SmithSu 0016 SmithFa 0136AdamsSu 003 JonesSu 011JonesSu 000 AdamsSu 012JonesSp 991

Example We sample the database… Prof.SemesterComplaintsProf.SemesterComplaints AdamsFa 023SmithSu 017 JonesFa 022SmithSp 018 AdamsSp 029AdamsFa 004 JonesSp 022SmithFa 0033 SmithSp 0221SmithSu 0016 SmithFa 0136AdamsSu 003 JonesSu 011JonesSu 000 AdamsSu 012JonesSp 991

Example And ask: How many complaints for Smith? Prof.SemesterComplaintsProf.SemesterComplaints AdamsFa 023SmithSu 017 JonesFa 022SmithSp 018 AdamsSp 029AdamsFa 004 JonesSp 022SmithFa 0033 SmithSp 0221SmithSu 0016 SmithFa 0136AdamsSu 003 JonesSu 011JonesSu 000 AdamsSu 012JonesSp 991 Est: (21+7+8)/8×16=72; Answer: 121

Why So Bad? We missed two important records Prof.SemesterComplaintsProf.SemesterComplaints AdamsFa 023SmithSu 017 JonesFa 022SmithSp 018 AdamsSp 029AdamsFa 004 JonesSp 022SmithFa 0033 SmithSp 0221SmithSu 0016 SmithFa 0136AdamsSu 003 JonesSu 011JonesSu 000 AdamsSu 012JonesSp 991 Oops!

How we know something went wrong? What if we know the total complaints of the entire table: SUM(NUM_COMPLAINTS) Compare with the estimated total complaints of the entire table Est: (2+21+1+7+8+4+3+0)/8 × 16 = 92, Answer: 148 One of the key ideas in the APA approach Pre-aggregation of the low-dimensional aggregates 0-dimensional fact: SUM(NUM_COMPLAINTS) =148 1-dimensional fact, for example, on SEMESTER SELECT SUM(NUM_COMPLAINTS) FROM COMPLAINTS GROUP-BY SEMESTER Or higher, depending on the cost of such pre-aggregation In our example, assuming only the 0-dimensional fact is know!

How we can pull ourselves out? APA use Maximal Likelihood Estimation (MLE) Break data space based on relational selection predicates; 2 m Quadrants Compute aggregate for each quadrant Characterize the error of the estimates using normal PDF (justification: CLT) Pretend estimates are independent Adjust the means to max likelihood Subject to known facts about the data Shows to be very accurate in various datasets, significantly better than plain sampling and stratified sampling In our example, the New Estimation is 136.3 (answer was 121, the original estimation is 72) However, loss of analytic guarantees on accuracy!

Let us go back to the plain sampling For the query: How many complaints for Smith? Prof.SemesterComplaintsProf.SemesterComplaints AdamsFa 023SmithSu 017 JonesFa 022SmithSp 018 AdamsSp 029AdamsFa 004 JonesSp 022SmithFa 0033 SmithSp 0221SmithSu 0016 SmithFa 0136AdamsSu 003 JonesSu 011JonesSu 000 AdamsSu 012JonesSp 991 Est: (21+7+8)/8×16=72 (Answer: 121); The standard error (SE) is 68. 2 [ŷ-1.96SE(ŷ), ŷ+1.96SE(ŷ)]

New Estimator: The Negative One To answer the query: How many complaints for Smith? (Answer:121) We first ask: How many complaints NOT for Smith? Prof.SemesterComplaintsProf.SemesterComplaints AdamsFa 023SmithSu 017 JonesFa 022SmithSp 018 AdamsSp 029AdamsFa 004 JonesSp 022SmithFa 0033 SmithSp 0221SmithSu 0016 SmithFa 0136AdamsSu 003 JonesSu 011JonesSu 000 AdamsSu 012JonesSp 991 Est: (2+1+4+3+0)/8×16=20, The Negative Estimator: 148-20=128, Standard Error (SE) = 13.4

How two is always better than one: The Ensemble Estimator Linearly combining the direct (positive) estimator and the negative estimator Est new = α Est direct + (1- α ) Est negative (  0  α  1) Note since both the direct estimators and negative estimators are unbiased estimators, the ensemble estimator is also unbiased. Choose the parameter α to minimize the variance the ensemble estimator The ensemble estimator always is always more accurate If the individual estimators are independent, the optimal value of the parameter α is V(Est direct )/(V(Est direct )+V(Est negative )) In our example, α=0.0373, Est new =125.95, Standard Error (SE) = 13.1

What if we have higher-dimensional facts? Image we have the relational table EMPLOYEE(NAME, SEX,DEPARTMENT,JOB_TYPE, SALARY) Query: SELECT SUM (SALARY) FROM EMPLOYEE WHERE SEX=‘M’ AND DEPARTMENT=‘ACCOUNT’ AND JOB_TYPE=‘SUPERVISOR’ Pre-Aggregation 1-dimesional facts

More negative estimators SELECT SUM (SALARY) FROM EMPLOYEE WHERE SEX=‘M’ AND DEPARTMENT=‘ACCOUNT’ AND JOB_TYPE=‘SUPERVISOR’ b1 b2 b3 SEX JOB_TYPE DEPARTMENT b1^b2^b3 b1^b2^b3, or SEX  ‘M’ AND DEPARTMENT  ‘ACCOUNT’ AND JOB_TYPE  ‘SUPERVISOR’

More negative estimators SELECT SUM (SALARY) FROM EMPLOYEE WHERE SEX=‘M’ AND DEPARTMENT=‘ACCOUNT’ AND JOB_TYPE=‘SUPERVISOR’ b1 b2 b3 SEX JOB_TYPE DEPARTMENT b1^b2^b3 b1^b2^b3, or SEX  ‘M’ AND DEPARTMENT  ‘ACCOUNT’ AND JOB_TYPE  ‘SUPERVISOR’ b1^b2^b3

More negative estimators (cont’d) SELECT SUM (SALARY) FROM EMPLOYEE WHERE SEX=‘M’ AND DEPARTMENT=‘ACCOUNT’ AND JOB_TYPE=‘SUPERVISOR’ b1 b2 b3 SEX JOB_TYPE DEPARTMENT b1^b2^b3 b1^b2^b3, or SEX  ‘M’ AND DEPARTMENT  ‘ACCOUNT’ AND JOB_TYPE  ‘SUPERVISOR’ b1^b2^b3

Combining Positive and Negative Estimators in APA1+ We will have multiple negative estimators Est new = α0 Est direct + α1 Est negative1 + α2 Est negative2 +… 0  αi  1, α0+ α1+ α2+… = 1 Decompose the negative estimators into the cell representations Each cell in the cube correspond to a direct estimation The variance of the cell can be estimated We can use Lagrange multipliers to optimize all the parameters (αi) We assume the direct estimations for each cell is independent This procedure usually involve a linear solver for a linear equation

Actually, the estimators are correlated Fortunately, we are able to capture such correlation analytically If each individual estimator is approximately normal, and they are independent, the combined estimator is also approximately normal However, the correction effect results in a slightly different distribution Analytically very close to the spherically symmetric distribution, of which normal distribution is a special case. Empirically, it shows strong tendency to normal We use normal distribution to derive the confidence interval

Empirical Distribution of the Ensemble Estimators Empirical distribution of APA0+ Empirical distribution of APA1+

Experimental Evaluation Four datasets Forest Cover data (from UCI KDD archive) River Flow data William Shakespeare data Image Feature vector Approximation techniques Simple Random Sampling Stratified Sampling APA0+ APA1+ Queries 2000 queries for each dataset

Measure the estimated confidence interval We generate 95% confidence intervals of all estimation techniques for each query Accuracy level What are the real chances the correct answers actually fall in the confidence intervals? Interval width How tight are the bounds of the confidence intervals?

How good are the new estimators? Accuracy of the confidence intervals (Expected: 95%) APA1+ average around 90%, which was 23.2% higher than simple random sampling (the next best alternative in terms of accuracy) –The accuracy of APA0+, random sampling, and stratified sample are comparable, all less than 70% in average Confidence interval width The width of the confidence interval produced by APA1+ is only 1/2 the size of one from random sampling Compared with stratified sampling, APA1+ is at least 20% smaller The width of the confidence interval produced by APA0+ is around 15% smaller than random sampling

Discussion Overall, the new estimators work pretty well! It’s very simple! Significantly better than the random sampling Significantly better than the stratified sampling APA1+ is the only estimator which provides the confidence interval close to the theoretically expected accuracy and with much smaller width! Suitable for both categorical, numerical data APA0+, and APA1 unaffected by high dimensions! Future work How to apply this idea to work with more complicated aggregation functions?

Thanks!!

Roadmap Approximate Query Processing and Confidence Interval Motivating Example Generalization and Handling Correlation Experimental Results Conclusions Inspired by Chris’s original APA approach (how to find multiple estimators) Ensemble Classifiers in Statistical Learning

New Sampling-Based Estimators for OLAP Queries Ruoming Jin, Kent State University Leo Glimcher, The Ohio State University Chris Jermaine, University of.

Similar presentations

Presentation on theme: "New Sampling-Based Estimators for OLAP Queries Ruoming Jin, Kent State University Leo Glimcher, The Ohio State University Chris Jermaine, University of."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

New Sampling-Based Estimators for OLAP Queries Ruoming Jin, Kent State University Leo Glimcher, The Ohio State University Chris Jermaine, University of.

Similar presentations

Presentation on theme: "New Sampling-Based Estimators for OLAP Queries Ruoming Jin, Kent State University Leo Glimcher, The Ohio State University Chris Jermaine, University of."— Presentation transcript:

Similar presentations

About project

Feedback