Reading Report 6 Yin Chen 5 Mar 2004

Slides:



Advertisements
Similar presentations
Introduction Simple Random Sampling Stratified Random Sampling
Advertisements

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Fast Algorithms For Hierarchical Range Histogram Constructions
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
Hypothesis: It is an assumption of population parameter ( mean, proportion, variance) There are two types of hypothesis : 1) Simple hypothesis :A statistical.
Written By Surajit Chaudhuri, Gautam Das, Vivek Marasayya (Microsoft Research, Washington) Presented By Melissa J Fernandes.
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed.
Visual Recognition Tutorial
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Point estimation, interval estimation
Evaluation.
Elementary hypothesis testing Purpose of hypothesis testing Type of hypotheses Type of errors Critical regions Significant levels Hypothesis vs intervals.
A new sampling method: stratified sampling
Radial Basis Function Networks
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
Trust-Aware Optimal Crowdsourcing With Budget Constraint Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Pearson Prentice-Hall, Inc.Chap 8-1 Statistics for Managers Using Microsoft® Excel 5th Edition.
OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented by Sushanth.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Presented By Anirban Maiti Chandrashekar Vijayarenu
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented By: Vivek Tanneeru.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
1 A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Proceedings of the.
Variability. The differences between individuals in a population Measured by calculations such as Standard Error, Confidence Interval and Sampling Error.
Estimating standard error using bootstrap
Algorithm Analysis 1.
Variability.
Statistical analysis.
Dependent-Samples t-Test
Data Transformation: Normalization
Chapter 7. Classification and Prediction
Chapter 14 Inference on the Least-Squares Regression Model and Multiple Regression.
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
DTC Quantitative Methods Bivariate Analysis: t-tests and Analysis of Variance (ANOVA) Thursday 20th February 2014  
STATISTICAL INFERENCE
12. Principles of Parameter Estimation
Point and interval estimations of parameters of the normally up-diffused sign. Concept of statistical evaluation.
3. The X and Y samples are independent of one another.
CHAPTER 13 Design and Analysis of Single-Factor Experiments:
Statistical analysis.
Chapter 5 Probability 5.2 Random Variables 5.3 Binomial Distribution
Relative Values.
A paper on Join Synopses for Approximate Query Answering
Data Mining K-means Algorithm
Comparing Three or More Means
Streaming & sampling.
Chapter 12: Query Processing
Scalar Quantization – Mathematical Model
Sampling and Sampling Distributions
Stratified Sampling STAT262.
Chapter 9 Hypothesis Testing.
Introduction to Instrumentation Engineering
CONCEPTS OF ESTIMATION
Discrete Event Simulation - 4
Geology Geomath Chapter 7 - Statistics tom.h.wilson
Comparing Populations
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Summarizing Data by Statistics
Lecture 7 Sampling and Sampling Distributions
Chapter 7: The Normality Assumption and Inference with OLS
Product moment correlation
Parametric Methods Berlin Chen, 2005 References:
CHAPTER – 1.2 UNCERTAINTIES IN MEASUREMENTS.
Random Number Generation
12. Principles of Parameter Estimation
Calibration and homographies
CHAPTER – 1.2 UNCERTAINTIES IN MEASUREMENTS.
Presentation transcript:

Reading Report 6 Yin Chen 5 Mar 2004 Reference: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries, Surajit Chaudhuri, Gautam das, Vivek Narasayya, Microsoft Research, One Microsoft Way. http://portal.acm.org/citation.cfm?id=375694&dl=ACM&coll=portal

Problem Over View Decision support applications such as On Line Analytical Processing (OLAP) and data mining for analyzing LARGE databases have become popular. But expensive and resource intensive This work uses precomputed samples of the data instead of the complete data to answer the queries, to give approximate answers efficiently. 3 drawbacks of previous works Lack of formulation thus difficult to evaluate theoretically Do NOT deal with uncertainty Ignore the variance in the data distribution of the aggregated column(s).

Related work Some works based on randomized techniques, assume a fixed workload, and do NOT cope with uncertainty. Each record is tagged with a frequency. An expected number of k records are selected in the sample, where the probability of selecting a record t with frequency ft is k*(ft/Σufu)) Records that are accessed more frequently have a greater chance of being included inside the sample. BUT has poor quality. i.e. Consider a set of queries, let a few queries reference large partitions and most queries reference very small partitions. By the weighted sampling scheme most record will come from the large partitions. Thus, with high probability, there will be no records selected from many of the small partitions, causing large error. An IMPROVMENT: collected outliers of the data (i.e. the records that contribute to high variance) into a separate index, while the remaining data is sampled using a weighted sampling technique. Queries are answered by running them against both the outlier index and the weighted sample, and an estimated answer is composed out of both results. Some other works use on-the-fly sampling, but can be expensive.

Architecture Workload Architecture ScaleFactor A workload W is specified as a set of pairs of queries and their corresponding weights: i.e., W = {<Q1,w1>, … <Qq,wq>} Weight wi indicates the importance of query Qi in the workload. Without loss of generality, assume the weights are normalized, i.e., Σiwi =1 Architecture Inputs: a database and a workload W 2 components An offline component for selecting a sample An online component that Rewrites an incoming query to use the sample to answer the query approximately. Reports the answer with an estimate of the error in the answer. ScaleFactor Each record in the sample contains an additional column, ScaleFactor. The value of the aggregated column of each record in the sample is first scaled up by multiplying with the ScaleFactor, and then aggregated.

Architecture (Cont.) Error metrics : used to determine the quality of an approximate answer to an aggregation query. Suppose the correct answer for a query Q is y while the approximate answer is y’ Relative error : E(Q) = |y - y’| / y Squared error : SE(Q) = (|y - y’| / y)² Suppose the correct answer for the ith group is yi while the approximate answer is yi’ Squared error in answering a GROUP BY query Q : SE(Q) = (1/g) Σi ((yi – yi’)/ yi)² Given a probability distribution of queries pw Mean squared error for the distribution : MSE(pw) =ΣQ pw (Q)*SE(Q), ( pw (Q) is the probability of query Q) Root mean squared error : RMSE(pw) = square root of MSE(pw) Other error metrics L1 metrics : the mean error over all queries in the workload L∞ metrics : the maximum error over all queries

The Special Case of a Fixed Workload Overview Here provide a solution for the special case of a fixed workload, i.e., when the incoming queries are identical to the given workload. Use an effective deterministic scheme rater than the conventional randomization scheme. Problem Formulation Problem : FIXEDSAMP Input : R, W, k Output : A sample of k records (with appropriate additional columns) such that MSE(W) is minimized MSE(W) = MSE(Pw) (Mean squared error ), where a query Q has a probability of occurrence of 1 is Q ∈W and 0 otherwise. Fundamental Regions For a given relation R and workload W, consider partitioning the records in R into a minimum number of regions F = {R1, R2, …, Rr} such that for any region Rj, each query in W selects either all records in Rj or none. These regions are the fundamental regions of R induced by W. i.e. Consider a relation R (with aggregate column C) containing nine records (with C values 10, 20, …, 90). Let W consist of two queries, Q1 (which selects records with C values between 10 and 50) and Q2 (which selects records with C values between 40 and 70). These two queries induce a partition of R into four fundamental regions, R1, R1, R1, R4. In general the total number of fundamental regions r depends on R and W and is upper-bounded by min(2|W|, n), where n is the number of records in R.

The Special Case of a Fixed Workload (Cont.) Solutions for FIXEDSAMP A deterministic algorithm called FIXED Step1 (Identify Fundamental Regions) : Let r be the number of fundamental regions. Case A : r ≤ k (the selected sample can answer queries WITHOUT any errors) Step 2A (Pick Sample Records) : Pick exactly one record from each fundamental region. Step 3A (Assign Values to Additional Columns) : The idea is that each sample record can be used to “summarize” all records from the corresponding fundamental region, WITHOUT incurring any error. For a workload consisting of ONLY COUNT queries, store the count of the number of records in that fundamental region in a SINGLE additional column in the sample records (called RegionCount). For a workload consisting of ONLY SUM queries, store the sum of the values in the aggregate column for records in that fundamental region in the SINGLE additional column in the sample (called AggSum). For a workload contains a MIX of COUNT and SUM queries, need BOTH the RegionCount and AggSum column, which can also answer AVG queries. Case B : r > k (select sample that try to minimize the errors in queries) Step 2B (Pick Sample Records) : Sort all r regions by their importance, select the TOP k, and then pick up one record from each of the selected regions. The importance of region Rj is defined as fj * nj², where fj is the sum of the weights of all queries in W that select the region, nj is the number of records in the region. fj measures the weights of queries that are affected by Rj while nj² measures the effect on the (squared) error by not including Rj. Step 3B (Assign Values to Additional Columns) : Note that the extra column values of a sample are NOT required to characterize the corresponding fundamental region; all we care is that they contain appropriate values so that the error for the workload is minimized. To assign values to the RegionCount and AggSum columns of the k selected sample records, express MSE(W) as a quadratic function of 2*k unknowns: {RC1,…,RCk} and {AS1,…, ASk}, and partially differentiating with each variable and setting each result to zero. This gives rise to 2*k simultaneous (sparse) linear equations, which can be solved by using an iterative technique.

The Non-Special Case with a Lifted Workload Overview Here provide a solution for the non-special case which incoming query is “similar” but NOT identical to queries. The problem was focus on the SINGLE-TABLE selection queries with aggregation containing either the SUM or COUNT aggregate. A workload W consists of exactly ONE query Q on relation R. Problem Formulation Problem : SAMP Input : R, Pw (a probability distribution function specified by W), and k Output : A sample of k records, (with the appropriate additional column(s)) such that the MSE(Pw) is minimized. lifted workload For a given W, define a lifted workload pw, i.e., a probability distribution of incoming queries. Intuitively, for any query Q’ (not necessarily in W), pw(Q’) should be related to the amount of similarity (dissimilarity) of Q’ to the workload: high if Q’ is similar to queries in the workload, and low otherwise. We say that two queries Q’ and Q are similar if the records selected by Q’ and Q have significant overlap. The objective is to define the distribution P{Q} Since for the purposed of lifting, only concern the set of records selected by a query and NOT the query itself. Thus instead of mapping queries to probabilities, P{Q} maps subsets of R to probabilities.

The Non-Special Case with a Lifted Workload lifted workload (Cont.) Assume two parameters δ (½ ≤ δ ≤1) and γ (0 ≤ γ ≤ ½). These parameters define the degree to which the workload “influences” the query distribution. For any given record inside (resp. outside) RQ, the parameter δ (resp. γ) represents the probability that an incoming query will select this record. For all R’⊆R, P{Q}(R’) is the probability of occurrence of any query that selects exactly the set of records R’. RQ is the records selected by Q. n1, n2, n3, and n4 are the counts of records in the regions. When n2 or n4 are large (i.e., the overlap is large), P{Q}(R’) is high (i.e. queries that select RQ are likely to occur). When n1 or n3 are large (i.e., the overlap is small), P{Q}(R’) is low (i.e. queries that select RQ are unlikely to occur). Setting the parameters δ and γ: δ → 1 and γ → 0 : implies that incoming queries are identical to workload queries δ → 1 and γ → ½ : implies that incoming queries are supersets to workload queries δ → ½ and γ → 0 : implies that incoming queries are subsets to workload queries δ → ½ and γ → ½ : implies that incoming queries are unrestricted

The Non-Special Case with a Lifted Workload Stratified sampling Stratified sampling is a well-known generalization of uniform sampling where a population is partitioned into multiple strata and samples are selected uniformly from each stratum, with “important” strata contributing relatively more samples. Define population of a query Q (denoted by POPQ) on a relation R as a set of size |R| that contains the value of the aggregated column that is selected by Q, or 0 if the record is not selected. A stratified sampling scheme partitions R into r strata containing n1,…,nr records (where Σnj = n), with k1,…,kr records uniformly sampled from each stratum (where Σkj = k). The scheme also associates a ScaleFactor with each record in the sample. Queries are answered by execution them on the sample instead of R. For a COUNT query, the ScaleFactor entries of the selected records are summed, while for a SUM(y) query the expression y*ScaleFactor is summed. If also wish to return an error guarantee with each query, then instead of ScaleFactor, we have to keep track of each nj and kj individually for each stratum. Solution for STRAT : 3 steps : Step one : stratification step, determine : (a) How many strata r to partition relation R into, and (b) The records from R that belong to each stratum. At the end of step one, we have r strata R1,…,Rr containing n1,…,nr records such that Σnj = n. Step two : allocation step, determine how to divide k (the number of records available for the sample) into integers k1,…,kr across the r strata such that Σkj = k. Step three : sampling step, uniformly samples kj records from stratum Rj to form the final sample of k records.

The Non-Special Case with a Lifted Workload Solution for STRAT (Cont.) Solution for COUNT Aggregate Stratification Step Lemma 1: Consider a relation R with n records and a workload W of COUNT queries. In the limit when n tends to infinity, the fundamental regions F = {R1, R2, …, Rr} represent an optimal stratification. Allocation Step (1) Express MSE(pw) as a weighted sum of the MSE of each query in the workload: Lemma 2 : MSE(pw) = Σi wi MSE(p{Q}) (2) For any Q∈W, express MSE(p{Q}) as a function of the kj’s Lemma 3 : For a COUNT query Q in W, let ApproxMSE(p{Q}) = Then Since have an (approximate) formula for MSE(p{Q}), we can exppress MSE(pw) as a function of the variables kj’ Corollary 1 : MSE(pw) = Σj(αj/ kj), where each αj is a function of n1,…,nr, δ, and γ αj captures the “importance” of a region; it is positively correlated with nj as well as the frequency of queries in the workload that access Rj (3) Minimize MSE(pw) Lemma 4: Σj(αj/ kj) is minimized subject to Σjkj = k if kj =k*(sqrt(αj) / Σisqrt(αi)) Lemma 4 provides a closed-form and computationally inexpensive solution to the allocation problem since αj depends only on δ, γ and the number of tuples in each fundamental region.

The Non-Special Case with a Lifted Workload Solution for STRAT (Cont.) Solution for SUM Aggregate Stratification Step Since each stratum may have LARGE internal variance in the values of the aggregate column, CANNOT use the same stratification as in the COUNT case, i.e., strata = fundamental regions. Divide fundamental regions with large variance into a set of finer regions, each of which has significantly lower internal variance, and treat these finer regions as the strata. Within a new stratum the aggregate column values of records are CLOSE to one another. Borrow from statistics literature an approximation of the optimal Neymann Allocation technique for minimizing variance, use it to divide each fundamental region into h finer regions, thus generating a total of h*r, which become the strata. (h was set to 6). Allocation Step (1) Similar to COUNT, it is expressed as an optimization problem with h*r unknowns k1,…, kh*r. (2) Different from COUNT, the specific values of the aggregate column in each region influence MSE(p{Q}). Let yj (Yj) be the AVERAG (sum) of the aggregate column values of all records in region Rj. Since the variance within each region is SMALL (due to stratification), we can assume that each value within the region can be approximated as yj. Thus to express MSE(p{Q}) as a function of the kj’s for a SUM query Q in W: As with COUNT, MSE(pW) for SUM is functionally of the form Σj(αj/ kj), and αj depends on the same parameters n1, …nh*r , δ, and γ (see Corollary 1), (3)The same for the minimization step as in Lemma 4.