Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1.

Slides:

Advertisements

Similar presentations

Divide-and-Conquer and Statistical Inference for Big Data

Advertisements

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.

Hypothesis testing and confidence intervals by resampling by J. Kárász.

Uncertainty and confidence intervals Statistical estimation methods, Finse Friday , 12.45–14.05 Andreas Lindén.

Sampling Distributions (§ )

BOAT - Optimistic Decision Tree Construction Gehrke, J. Ganti V., Ramakrishnan R., Loh, W.

Online Aggregation Liu Long Aggregation Operations related to aggregating data in DBMS –AVG –SUM –COUNT.

Autocorrelation and Linkage Cause Bias in Evaluation of Relational Learners David Jensen and Jennifer Neville.

Discovering Affine Equalities Using Random Interpretation Sumit Gulwani George Necula EECS Department University of California, Berkeley.

Evaluation (practice). 2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount.

Importance Sampling. What is Importance Sampling ? A simulation technique Used when we are interested in rare events Examples: Bit Error Rate on a channel,

Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.

Pricing an Option Monte Carlo Simulation. We will explore a technique, called Monte Carlo simulation, to numerically derive the price of an option or.

BHS Methods in Behavioral Sciences I

A Probabilistic Approach to Collaborative Multi-robot Localization Dieter Fox, Wolfram Burgard, Hannes Kruppa, Sebastin Thrun Presented by Rajkumar Parthasarathy.

Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA * Bioinformatics.

Chapter 14 Simulation. Monte Carlo Process Statistical Analysis of Simulation Results Verification of the Simulation Model Computer Simulation with Excel.

Bootstrapping LING 572 Fei Xia 1/31/06.

(C) 2001 SNU CSE Biointelligence Lab Incremental Classification Using Tree- Based Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection.

Experimental Evaluation

1 Simulation Methodology H Plan: –Introduce basics of simulation modeling –Define terminology and methods used –Introduce simulation paradigms u Time-driven.

Statistical inference Population - collection of all subjects or objects of interest (not necessarily people) Sample - subset of the population used to.

Bootstrapping applied to t-tests

Walter Hop Web-shop Order Prediction Using Machine Learning Master’s Thesis Computational Economics.

M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,

Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人：黃子齊

CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,

1 A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs Ph.D Showcase, Dept. of Computer Science Sasi Kumar Pitchaimalai Ph.D Candidate Database.

Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Charu Aggarwal + * Department of Computer Science, University of Texas at Dallas + IBM T. J. Watson.

1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.

Copyright ©2011 Nelson Education Limited The Normal Probability Distribution CHAPTER 6.

CS 391L: Machine Learning: Ensembles

Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.

Basic Statistics for Engineers. Collection, presentation, interpretation and decision making. Prof. Dudley S. Finch.

Christopher Re and Dan Suciu University of Washington Efficient Evaluation of HAVING Queries on a Probabilistic Database.

Lab 3b: Distribution of the mean

Delivering Integrated, Sustainable, Water Resources Solutions Monte Carlo Simulation Robert C. Patev North Atlantic Division – Regional Technical Specialist.

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented by Sushanth.

BUS304 – Chapter 6 Sample mean1 Chapter 6 Sample mean  In statistics, we are often interested in finding the population mean (µ):  Average Household.

Lecture 2 Basics of probability in statistical simulation and stochastic programming Leonidas Sakalauskas Institute of Mathematics and Informatics Vilnius,

Monté Carlo Simulation  Understand the concept of Monté Carlo Simulation  Learn how to use Monté Carlo Simulation to make good decisions  Learn how.

Limits to Statistical Theory Bootstrap analysis ESM April 2006.

PowerPoint presentation to accompany Operations Management, 6E (Heizer & Render) © 2001 by Prentice Hall, Inc., Upper Saddle River, N.J F-1 Operations.

Big Data, Computation and Statistics Michael I. Jordan February 23,

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

"Classical" Inference. Two simple inference scenarios Question 1: Are we in world A or world B?

1 1 Slide Simulation Professor Ahmadi. 2 2 Slide Simulation Chapter Outline n Computer Simulation n Simulation Modeling n Random Variables and Pseudo-Random.

Classification Ensemble Methods 1

ApproxHadoop Bringing Approximations to MapReduce Frameworks

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.

Empirical Evaluation (Ch 5) how accurate is a hypothesis/model/dec.tree? given 2 hypotheses, which is better? accuracy on training set is biased – error:

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.

Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate its.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.

1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.

Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate accuracy.

Recent Trends in Large Scale Data Intensive Systems

CPSC 531: System Modeling and Simulation

Spatial Online Sampling and Aggregation

StreamApprox Approximate Stream Analytics in Apache Flink

StreamApprox Approximate Stream Analytics in Apache Spark

StreamApprox Approximate Computing for Stream Analytics

Pramod Bhatotia, Ruichuan Chen, Myungjin Lee

Ensemble learning Reminder - Bagging of Trees Random Forest

Sampling Distributions (§ )

Supporting Online Analytics with User-Defined Estimation and Early Termination in a MapReduce-Like Framework Yi Wang, Linchuan Chen, Gagan Agrawal The.

Presentation transcript:

Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1

Why Approximate Query Processing? AQP is critical for massive data – Ever-growing size of big data – Need for timely and cost-effective analysis – Widely applied RDBMSs (e.g., online aggregation) MapReduce systems (e.g., BlinkDB) Data stream systems (load shedding) 2

Sampling: widely-used in AQP Error estimation: fundamental in AQP – Analytic error estimation – Bootstrap Massive Data AVG 5.5 Approx. Mean sample (6, 2, 7, 8, 5, 1, 3, 4, 9, 10) Sample Sampling & Quality assessment Need to assess the quality! 3 What is the error of this approx. mean?

Massive Data query: AVG 5.5 Approx. Mean sample (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) Sample collect # of tuples, Variance Central Limit Theorem Analytic Error Estimation Use closed-form formulas Pro: very fast Con: restricted to simple aggregates What if I want to estimate? 1.Complex SQL queries 2.Data mining tasks 3.…. 4

Bootstrap [Efron 1979] Resample with replacement from the sample Run the query on the resample Repeat many times, typically 100s or even 1000s of times 5 (6, 2, 7, 8, 5, 1, 3, 4, 9, 10) (2, 10, 10, 5, 9, 2, 5, 10, 8, 10) (8, 1, 2, 1, 1, 9, 7, 4, 10, 1) (9, 10, 2, 10, 7, 1, 3, 6, 10, 10) (9, 10, 2, 10, 7, 1, 3, 6, 10, 10) …… Sample Mean resample query: AVG collect Same Size

Compute the error from the empirical distribution of all the query results 95% 6

Notes on Bootstrap Bootstrap treats Q as a black-box Can handle (almost) arbitrarily complex queries including UDFs! Embarrassingly Parallel Computational demanding Use too much resources 7

Error Estimation Analytic error estimation – Fast but limited to simple aggregates Bootstrap (Monte Carlo simulation): – Expensive but general Fast and General?

How To Make Bootstrap Faster Optimize the Monte-Carlo simulation process – EARL system [VLDB12][ICDE13] Bypass the Monte-Carlo simulation process – Analytical Bootstrap method (ABM) [SIGMOD14] 9

EARLY ACCURATE RESULT LIBRARY (EARL PROJECT) 10

Motivation Existing systems (e.g. Hadoop) use batch processing – High latency – Waste of resources Goals: a general driver that can – Return approximate results – With accuracy guarantee – For a wide range of tasks 11

Incremental Computation A small sample  a larger sample  …… Use Bootstrap to test accuracy Time efficient: Enable early returns Resource efficient: Do not waste resources Massive Data Sample sampleenlarge bootstrap Accurate enough? bootstrap Accurate enough? …… 12 Sample

Basic Ideas: Optimization Intra-iteration optimization – We have to repeat the same computation on all resamples – Many data are shared! – Compute the shared part once …… 13 Shared Non-shared

Basic Ideas: Optimization Inter-iteration optimization – Reuse the old computation – Cannot simply merge for randomness – Keep a small sample in memory for adjustment …… Adjustment is small 14

ANALYTICAL BOOTSTRAP 15

Analytical Bootstrap A single-round evaluation = 100s/1000s of bootstrap trials! # of times a tuple will be drawn in a bootstrap trial 16

Bootstrap Resamples As Multiset DB Bootstrap generates multiset relations – Tuples annotated with multiplicities – Query processing manipulate these multiplicities IDProductQty 1A2 2B3 3A2 4A4 IDProductQty# 1A21 2B30 3A22 4A41 IDProductQty 1A2 2B3 2B3 4A4 IDProductQty 1A2 2B3 4A4 4A4 IDProductQty 1A2 3A2 3A2 4A4 resample …… sample 17

Querying Multiset DB: Projection Projection takes sum of multiplicities IDProductQty# 1A21 2B30 3A22 4A41 ProductQty# A23 B30 A41 18 SELECT Product, SUM(Qty) FROM Orders WHERE Qty < (SELECT SUM(Qty) / 4 FROM Orders) GROUP BY Product How many products are ordered by small quantity orders?

Querying Multiset DB: Aggregate Aggregate takes weighted sum of multiplicities IDProductQty# 1A21 2B30 3A22 4A41 SUM(Qty) #

Querying Multiset DB: Join Join takes product of multiplicities ProductQty# A23 B30 A41 SUM(Qty)# 101 ProductQtySUM(Qty)# A2103 B3 0 A4 1 20

Querying Multiset DB: Selection Selection takes product of multiplicities ProductQtySUM(Qty)# A2103 B3 0 A4 1 ProductQtySUM(Qty)# A2103 B3 0 A4 0 21

Bootstrap Resamples As Multiset DB 22

IDProductQty# 1A22 2B31 3A21 4A40 IDProductQty# 1A21 2B31 3A20 4A42 IDProductQty# 1A21 2B30 3A22 4A41 IDProductQty# 1A2 2B3 3A2 4A Probabilistic Multiset DB Probabilistic Multiset DB (PMDB) Similar to Tossing Coins 23

Querying PMDB 24

IDProductQty# 1A2 2B3 3A2 4A4 ProductQty# A2 B3 A4 Querying PMDB: Projection Projection takes convolution sum of multiplicities 25

From Theory To Practice Annotated random variables – Marginal distribution IDProductQty# 1A2 2B3 3A2 4A IDProductQty # n01 1A B A A Numeric Form!

IDProductQty# 1A2 2B3 3A2 4A4 ProductQty# A2 B3 A4 IDProductQty # n01 1A B A A ProductQty # n01 A240.5 B A ProductQty # n01 A240.5 Querying PMDB: an Example 27

Querying PMDB: an Example IDProductQty# 1A2 2B3 3A2 4A4 ProductQty# A2 B3 A4 28

Querying PMDB in Numeric Form ABM is correct for queries with eligible plans A large subset of queries can be evaluated by ABM in DBPTIME Eligible plans can be tested at compile time Functional Dependency Rules 29

Coverage of Various Techniques Analytic error estimation TPCH (9/22); Conviva Log (36.9 %) ABM DBPTIME eligible TPCH (15/22); Conviva Log (81.0 %) ABM eligible TPCH (19/22); Conviva Log (98.6 %) ABM TPCH (19/22); Conviva Log (99.1 %) Bootstrap TPCH (19/22); Conviva Log (99.1 %) 30 Over 6660 queries

EXPERIMENTAL EVALUATION 31

Experimental Setting Synthetic and real-life datasets and queries: – TPC-H: 100 GB – Skewed-TPC-H: 1 GB – Customer: 52 GB Compare relative error – Of: mean, standard-deviation, quantile, KS-distance, confidence interval, existence probability – Between: Analytical Bootstrap Method (ABM), bootstrap (BS), ground truth (GT) 32

Accuracy of ABM Comparing the distributions given by ABM & bootstrap on quantiles & existence probability (1% sample) 1% ABM models Bootstrap accurately 33

Accuracy of ABM Comparing user-defined measures given by ABM & bootstrap to ground truth (1% sample) ABM is consistent with Bootstrap 34

Accuracy of ABM Comparing predictions given by ABM & bootstrap when varying number of bootstrap trials (TPC-H 1%) Bootstrap converges to ABM 35

Time Performance of ABM Bootstrap: Original bootstrap BLB-10: Bag of Little Bootstrap using 10 machines ODM: On-Demand Materialization Comparing time performance of ABM & bootstrap variants (TPC-H 10%) ABM is 3-4 orders of magnitude faster than sequential/parallel bootstrap variants 36

Time Performance of ABM Exact: Run the query on the original data Sample: Run the query on the sample CLT: Analytic error estimation using Central Limit Theorem Comparing time performance of ABM & various techniques (TPC-H 10%) ABM introduces little overhead 37

Conclusion & Future Work Bootstrap is critical for scalable AQP ABM provides an analytical model for bootstrap, and achieves significant speed-up ABM+EARL: a bootstrap-based system that can automatically choose/combine error estimation methods Integrating ABM into Hive/Shark 38