Written By Surajit Chaudhuri, Gautam Das, Vivek Marasayya (Microsoft Research, Washington) Presented By Melissa J Fernandes.

Slides:



Advertisements
Similar presentations
The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
Advertisements

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.
Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Order Statistics Sorted
Fast Algorithms For Hierarchical Range Histogram Constructions
Brian Babcock Surajit Chaudhuri Gautam Das at the 2003 ACM SIGMOD International Conference By Shashank Kamble Gnanoba.
Harikrishnan Karunakaran Sulabha Balan CSE  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion.
February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed.
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
1 Primitives for Workload Summarization and Implications for SQL Prasanna Ganesan* Stanford University Surajit Chaudhuri Vivek Narasayya Microsoft Research.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Chapter 4 Multiple Regression.
Sampling Methods and Sampling Distributions Chapter.
Data mining and statistical learning - lecture 13 Separating hyperplane.
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
STAT262: Lecture 5 (Ratio estimation)
Inferences About Process Quality
O VERCOMING L IMITATIONS OF S AMPLING FOR A GGREGATION Q UERIES Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Ch. 8 & 9 – Linear Sorting and Order Statistics What do you trade for speed?
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Psy B07 Chapter 1Slide 1 ANALYSIS OF VARIANCE. Psy B07 Chapter 1Slide 2 t-test refresher  In chapter 7 we talked about analyses that could be conducted.
Lecture 5 slides on Central Limit Theorem Stratified Sampling How to acquire random sample Prepared by Amrita Tamrakar.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
Estimation of Statistical Parameters
Chap 20-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 20 Sampling: Additional Topics in Sampling Statistics for Business.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
Section 8.1 Estimating  When  is Known In this section, we develop techniques for estimating the population mean μ using sample data. We assume that.
OnLine Analytical Processing (OLAP)
Trust-Aware Optimal Crowdsourcing With Budget Constraint Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Basic concept Measures of central tendency Measures of central tendency Measures of dispersion & variability.
OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented by Sushanth.
Managing Server Energy and Operational Costs Chen, Das, Qin, Sivasubramaniam, Wang, Gautam (Penn State) Sigmetrics 2005.
OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.
Load Shedding Techniques for Data Stream Systems Brian Babcock Mayur Datar Rajeev Motwani Stanford University.
Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Presented By Anirban Maiti Chandrashekar Vijayarenu
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented By: Vivek Tanneeru.
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP BY QUERIES Swaroop Acharya,Philip B Gibbons, VishwanathPoosala By Agasthya Padisala Anusha Reddy.
1 A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Proceedings of the.
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
University of Texas at Arlington Presented By Srikanth Vadada Fall CSE rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
Introduction Sample surveys involve chance error. Here we will study how to find the likely size of the chance error in a percentage, for simple random.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.
Theory of Computational Complexity M1 Takao Inoshita Iwama & Ito Lab Graduate School of Informatics, Kyoto University.
Data Transformation: Normalization
BlinkDB.
Data Mining: Concepts and Techniques
BlinkDB.
A paper on Join Synopses for Approximate Query Answering
i) Two way ANOVA without replication
Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang
Overcoming Limitations of Sampling for Aggregation Queries
Spatial Online Sampling and Aggregation
Introduction to Instrumentation Engineering
2. Stratified Random Sampling.
Reading Report 6 Yin Chen 5 Mar 2004
Stratified Sampling for Data Mining on the Deep Web
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Presented by: Mariam John CSE /14/2006
Presentation transcript:

Written By Surajit Chaudhuri, Gautam Das, Vivek Marasayya (Microsoft Research, Washington) Presented By Melissa J Fernandes

Goal of the papers we have studied so far is to approximately answer aggregation queries A ccurately Efficiently Main Idea of This Paper : Tailor the choice of samples to be robust for W similar not necessary identical to give workloads.

Two ways of Data mining for analyzing Large DB. 1)OLAP :- Online analytical processing is an approach to quickly answer multi-dimensional analytical queries using OLAP cubes. OLAP cube :- is a data structure that allows fast analysis of data. Example Data of Product 1)By city 2)By type 3)By product type Drawbacks: 1)Expensive 2)Resource intensive -wiki

2) Pre-computed samples -This method gives approximate answers very efficiently. -But may have larger errors, cause finding the samples with large variance is almost impossible. 3) Using Workloads What is a Workload ? Set of Transact-SQL statements that execute against a DB or databases that have to be tuned. - msdn Tuned :- Optimize performance of DB

How Sample set is generated from Workload? - In practical world Queries fall under a particular pattern. Eg : Info of Texas state. -The queries are run on the entire data base (R). -A column is added to the records. This column holds the data that states if the record was selected by the queries. (tagging) -If the record was selected by many queries its probability of being in the sample is higher. -This way Workload is used to generate Sample set.

R Icicle R(Q1) R(Q2) R(Q3) ICICLES : A new class of samples that tune themselves to a dynamic Workload. Outliers : Identify the tuples with Outlier values and store them in a separate relation. 1)Run query on T1 2)URSAMP on T2 3)Estimate the true result 4)Using methods in paper combine results. Outlier table (T1) No Outlier values Table (T2)

Drawbacks of Previous studies :- 1)They have intuitive appeal they lack rigorous problem formulation. 2)Do not deal with uncertainty in Workloads. 3)Ignore data variance in the data distribution of the aggregate column.

Architecture for Approximate Query Processing Offline component for selecting samples from (R) a)Rewrite the queries to use the sample to answer the query approximately. b)Report the answers with estimate errors.

Offline component selects samples from R. 1)Each record has Scale Factor Column (or in different relation) 2)Value for the aggregate column of each record is scaled up by multiplying by scale factor and then aggregated.

Error Metrics : y = correct ans y’ = approximate ans Relative error :E(Q) = (|y-y’|)/|y| Squared relative error : SE(Q) = (|y-y’|) 2 / |y| 2 Group By Query includes g groups yi = correct ans for ith group SE(Q) = (1/g) Σ i (yi – yi’) 2 / yi 2 A group by query with g groups g select queries with 1/g weight each.

pw = Probability Distribution pw(Q) = probability of query Q is given Mean Square error is MSE(pw) = ΣQ pw(Q) * SE(Q) Root mean squared error :

Fundamental Regions R = Relation W = Workspace n = no. of records in R R is divided into min no. of regions R1,R2,…… Rr Rj = each query in the W selects either all records in Rj or none. Upper Bound on no. of regions is min (2 |w|, n) Q1 = select no. between 10 and 50 Q2= select values between 40 and 70

Fixed Workload Problem Statement : (Fixed Samp) Input : R, W, k Output: A sample of k records such that MSE(W) is minimized. Soultion : (3 steps) 1)Identify fundamental regions 2)Picking exactly one record from each imp fundamental region 3)Assign appropriate values to additional columns in the sample records.

Step 1 : Number of fundamental regions (r) is induced by the Workload W. Case 1 : (r k) ____________________________ Case 1: (r<=k) Step 2 : Pick one sample from each fundamental region (10,40,60,80) Step 3: Column RegionCount and AggSum are updated Region Count = { 3,2,2,2,) AggSum = {60,90,130,170) We can ans COUNT, SUM and AVG queries.

Case 2 :- (r>k) Step 2: 1) Sort all regions by importance Importance = f j *n j 2 fj = sum of weights of all the queries in W that select the region j. nj = no. of records in the region j. (fj) = measures the weights of the queries that are affected by Rj (nj 2 ) = measures the effect on the error by not including Rj. 2) Pick the top k Step 3 Assign the values to additional coulmns (Regioncount, AggSum) Not same as pervious (k rec, 2k unknows ({RC1, … RCk} & {AS1,,ASk} ) MSE(W) = quadratic eq. on Differentiation we get Linear eq. Disadvantage : If queries not identical unpredictable errors.

Lifting Workload Q an Q’ => similar when records selected by them significantly overlap R = Relation Q = Query RQ = rec selected by the query p {q} (R’) => denotes the probability of occurrence of any query that selects the set of records R’.

δ (½ ≤ δ ≤1) and γ (0 ≤ γ ≤ ½) Define the degree to which W influences Q’s distribution. δ Probability that an incoming query will select this record γ For any record inside R Q For any record outside R Q n 1, n 2, n 3, and n 4 are the counts of records in the regions. n 2 or n 4 large (large overlap), P {Q} (R’) is high n 1 or n 3 large (small overlap), P {Q} (R’) is low

δ = inside RQ &selected by R’ γ = outside RQ &selected by R’ δ => 1 γ => 0 RQ and R’ identical δ => 1 γ => ½ R’ super set of RQ δ => ½ γ => 0 R’ subset of RQ δ => ½ γ => ½ R’ is unrestricted

STRATIFIED SAMPLING PROBLEM : SAMP Input : R, pw, k (pw probability distribution fun specified by W) Output : a sample of k records such that MSE(pw) is minimized

Solution STRAT for single-table selection queries with Aggregation 3 Steps : 1.Stratification (a) How many strata to partition R into ? (b) How many records in each strata? 2. Allocation  Determine the number of samples required across each strata 3.Sampling

Count Aggregation 1)Stratification : No of statum = No. of fundamental regions (Lemma1) 2) Allocation : We want to minimize the error over queries in p w. k 1, … k r are unknown variables such that Σk j = k. From Equation on an earlier slide, MSE(p W ) can be expressed as a weighted sum of the MSE of each query in the workload: Lemma 2: MSE(p W ) = Σ i w i MSE(p {Q} )

Lemma 3 : For a COUNT query Q in W, let ApproxMSE(p {Q} ) = Then Expected squared error in estimating the count (R Q union R j ) Expected squared error in estimation the count of (R Q union R/R Q ) Expected relative squared error in estimating count of R Q

Since we have an (approximate) formula for MSE(p {Q} ), we can express MSE(p w ) as a function of the k j ’s variables. Corollary 1 : MSE(p w ) = Σ j (α j / k j ), where each α j is a function of n 1,…,n r, δ, and γ. α j captures the “importance” of a region; it is positively correlated with n j as well as the frequency of queries in the workload that access R j. Now we can minimize MSE(p w ). Lemma 4: Σ j (α j / k j ) is minimized subject to Σ j k j = k if k j = k * ( sqrt(α j ) / Σ i sqrt(α i ) ) This provides a closed-form and computationally inexpensive solution to the allocation problem since α j depends only on δ, γ and the number of tuples in each fundamental region.

Solution For Sum Aggregate 1) Stratification -Can not use same stratification as in Count. -Use Bucketing Technique Fundamental regions are with large variance are divided into finer regions with significantly lower internal variance. Each finer region treated as a strata.

2) Allocation -Like COUNT, we express an optimization problem with h*r unknowns k 1,…, k h*r. -Unlike COUNT, the specific values of the aggregate column in each region (as well as the variance of values in each region) influence MSE(p {Q} ). -Let y j (Y j ) be the average (sum) of the aggregate column values of all records in region R j. -Since the variance = small, so approximate each value in the region to y j.

Thus to express MSE(p {Q} ) as a function of the k j ’s for a SUM query Q in W: As with COUNT, MSE(p W ) for SUM is functionally of the form Σ j (α j / k j ), and α j depends on the same parameters n 1, …n h*r, δ, and γ (Corollary 1).

Pragmatic Issues -Identifying Fundamental Regions -Handling Large Number of Fundamental Regions -Obtaining Integer Solutions -Obtaining an Unbiased Estimator

Extensions for more general Workloads Group Queries : - -Q partitions R into g groups -Lifting model : - Replace Q with g separate selection queries -Tagging step :- append (c = column id used in Group by. V = value of Group by in record t.) Join Query :- Star queries contains -One source relation and a set of dimension relations connected via foreign key joins. -Group by and selection on source and dimension relation -Aggregation over columns of the source relation. -Approach 1 - identify samples only over source relation. ( source relation is large) -Approach 2 - Identify source relation and precompute its join with all dimension relations.

Experimental Results : FIXED – solution for FIXEDSAMP, fixed workload, identical queries STRAT – solution for SAMP, workloads with single-table selection queries with aggregation PREVIOUS WORK USAMP – uniform random sampling WSAMP – weighted sampling OTLIDX – outlier indexing combined with weighted sampling CONG – Congressional sampling

Conclusion: The solutions FIXED and STRAT handle the problems of data variance, heterogeneous mixes of queries, GROUP BY and foreign-key joins.