BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Slides:

Advertisements

Similar presentations

Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.

Advertisements

Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research.

Copyright 2004 David J. Lilja1 Comparing Two Alternatives Use confidence intervals for Before-and-after comparisons Noncorresponding measurements.

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.

CS4432: Database Systems II

Sampling: Final and Initial Sample Size Determination

Brian Babcock Surajit Chaudhuri Gautam Das at the 2003 ACM SIGMOD International Conference By Shashank Kamble Gnanoba.

1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku

Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed.

Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.

Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA * Bioinformatics.

Chapter 11 Multiple Regression.

Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1.

Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

Constant process Separate signal & noise Smooth the data: Backward smoother: At any give T, replace the observation yt by a combination of observations.

One-Factor Experiments Andy Wang CIS 5930 Computer Systems Performance Analysis.

Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人：黃子齊

CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,

Chapter 4 Statistics. 4.1 – What is Statistics? Definition Data are observed values of random variables. The field of statistics is a collection.

Chapter Twelve Census: Population canvass - not really a “sample” Asking the entire population Budget Available: A valid factor – how much can we.

An Integration Framework for Sensor Networks and Data Stream Management Systems.

Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.

Penwell Debug Intel Confidential BRIEF OVERVIEW OF HIVE Jonathan Brauer ESE 380L Feb

Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Ronda Hilton.

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.

Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Nag Prajval B.C.

Brian Macpherson Ph.D, Professor of Statistics, University of Manitoba Tom Bingham Statistician, The Boeing Company.

HPDC 2013 Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices Yu Su*, Gagan Agrawal*, Jonathan Woodring # Kary Myers #, Joanne Wendelberger.

Stat 112: Notes 2 Today’s class: Section 3.3. –Full description of simple linear regression model. –Checking the assumptions of the simple linear regression.

1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.

This material is approved for public release. Distribution is limited by the Software Engineering Institute to attendees. Sponsored by the U.S. Department.

EMIS 7300 SYSTEMS ANALYSIS METHODS FALL 2005 Dr. John Lipp Copyright © Dr. John Lipp.

Understanding Sampling

Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha ( ) Deepak Anand ( ) By:

Buffer-pool aware Query Optimization Ravishankar Ramamurthy David DeWitt University of Wisconsin, Madison.

Trust Me, I’m Partially Right: Incremental Visualization Lets Analysts Explore Large Datasets Faster Shengliang Dai.

Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.

Classification Ensemble Methods 1

ApproxHadoop Bringing Approximations to MapReduce Frameworks

CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP BY QUERIES Swaroop Acharya,Philip B Gibbons, VishwanathPoosala By Agasthya Padisala Anusha Reddy.

HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas.

Tutorial I: Missing Value Analysis

1 Chapter 9 Tuning Table Access. 2 Overview Improve performance of access to single table Explain access methods – Full Table Scan – Index – Partition-level.

COURSE: JUST 3900 INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE Test Review: Ch. 4-6 Peer Tutor Slides Instructor: Mr. Ethan W. Cooper, Lead Tutor © 2013.

By: Peter J. Haas and Joseph M. Hellerstein published in June 1999 : Presented By: Sthuti Kripanidhi 9/28/20101 CSE Data Exploration.

Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.

Working Efficiently with Large SAS® Datasets Vishal Jain Senior Programmer.

ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan Shruti P. Gopinath CSE 6339.

Estimating standard error using bootstrap

SQL Server Statistics and its relationship with Query Optimizer

Recent Trends in Large Scale Data Intensive Systems

A paper on Join Synopses for Approximate Query Answering

Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang

Estimating with PROBE II

Spatial Online Sampling and Aggregation

StreamApprox Approximate Stream Analytics in Apache Flink

Predictive Performance

StreamApprox Approximate Stream Analytics in Apache Spark

StreamApprox Approximate Computing for Stream Analytics

Akshay Tomar Prateek Singh Lohchubh

Chap. 7 Regularization for Deep Learning (7.8~7.12 )

One-Factor Experiments

Presented by: Mariam John CSE /14/2006

Presentation transcript:

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Motivation Support interactive SQL-like aggregate queries over massive sets of data

Feature Most queries focus on global message of the whole table. blinkdb> SELECT AVG(jobtime) FROM very_big_log AVG, COUNT, SUM, STDEV, PERCENTILE etc.

Where and group semantics focus on limited clauses. blinkdb> SELECT AVG(jobtime) FROM very_big_log WHERE src = ‘hadoop’ FILTERS, GROUP BY clauses Feature

Hard Disks ½ - 1 Hour1 - 5 Minutes1 second ? Memory 100 TB on 1000 machines Query Execution on Samples

IDCityBuff Ratio 1NYC0.78 2NYC0.13 3Berkeley0.25 4NYC0.19 5NYC0.11 6Berkeley0.09 7NYC0.18 8NYC0.15 9Berkeley Berkeley NYC Berkeley0.10 Query Execution on Samples What is the average buffering ratio in the table?

IDCityBuff Ratio 1NYC0.78 2NYC0.13 3Berkeley0.25 4NYC0.19 5NYC0.11 6Berkeley0.09 7NYC0.18 8NYC0.15 9Berkeley Berkeley NYC Berkeley0.10 Query Execution on Samples What is the average buffering ratio in the table? IDCityBuff RatioSampling Rate 2NYC0.131/4 6Berkeley0.251/4 8NYC0.191/4 Uniform Sample

IDCityBuff Ratio 1NYC0.78 2NYC0.13 3Berkeley0.25 4NYC0.19 5NYC0.11 6Berkeley0.09 7NYC0.18 8NYC0.15 9Berkeley Berkeley NYC Berkeley0.10 Query Execution on Samples What is the average buffering ratio in the table? IDCityBuff RatioSampling Rate 2NYC0.131/4 6Berkeley0.251/4 8NYC0.191/4 Uniform Sample /

IDCityBuff Ratio 1NYC0.78 2NYC0.13 3Berkeley0.25 4NYC0.19 5NYC0.11 6Berkeley0.09 7NYC0.18 8NYC0.15 9Berkeley Berkeley NYC Berkeley0.10 Query Execution on Samples What is the average buffering ratio in the table? IDCityBuff RatioSampling Rate 2NYC0.131/2 3Berkeley0.251/2 5NYC0.191/2 6Berkeley0.091/2 8NYC0.181/2 12Berkeley0.491/2 Uniform Sample $0.22 +/ /- 0.05

Speed/Accuracy Trade-off Error 30 mins Time to Execute on Entire Dataset Interactive Queries 2 sec Execution Time (Sample Size)

Error 30 mins Time to Execute on Entire Dataset Interactive Queries 2 sec Speed/Accuracy Trade-off Pre-Existing Noise Execution Time (Sample Size)

Sampling Vs. No Sampling Fraction of full data Query Response Time (Seconds) x as response time is dominated by I/O 10x as response time is dominated by I/O

Sampling Vs. No Sampling Fraction of full data Query Response Time (Seconds) (0.02%) (0.07%)(1.1%)(3.4%) (11%) Error Bars

What is BlinkDB? A framework built on Shark and Spark that … -creates and maintains a variety of uniform and stratified samples from underlying data -returns fast, approximate answers with error bars by executing queries on samples of data -verifies the correctness of the error bars that it returns at runtime

BlinkDB Background System Overview Sample Creation BlinkDB Runtime Inplementation & Evaluation

Background One common assumption is that future queries will be similar to historical queries. The meaning of “similarity” can differ. This choice of model of past workloads is one of the key differences between BlinkDB and prior work

Workload Taxonomy

System overview BlinkDB extends the Apache Hive frame work by adding two major components to it: (1)an offline sampling module that creates and maintains samples over time (2) a run-time sample selection module that creates an Error-Latency Profile(ELP) for queries

Supported queries standard SQL aggregate queries involving COUNT, AVG, SUM and QUANTILE. Queries involving these operations can be annotated with either an error bound, or a time constraint. Nested or joines queries not supported yet, but not a hindrance

It would also be straight forward to extend BlinkDB to deal with foreign keyjoins between two sampled tables (or a self join on one sampled table) where both tables have a stratified sample on the set of columns used for joins.

Sample Creation Why Stratified samples are useful? Samples carry storage costs, so we can only build a limited number of them.

Stratified Samples when uniform sample is useful? when uniform sample is useful A uniform sample may not contain any members of the subset at all, leading to a missing row in the final output of the query.

Stratified Samples for a single query

This problem has been studied before. Briefly, since error decreases at a decreasing rate as sample size increases, the best choices imply assigns equal sample size to each groups. In addition, the assignment of sample sizes is deterministic.

[16] S. Lohr. Sampling: design and analysis. Thomson, K=

Optimizing a set of stratified samples for all queries sharing a QCS n will change through queries.

Columns selection optimization In practice, we set M=K=100000

can also be useful by partially covering q j

The size of this optimization problem increases exponentially with the number of columns in T, which looks worrying. However, it is possible to solve these problems in practice by applying some simple optimizations, like considering only column sets that actually occurred in the past queries, or eliminating column sets that are unrealistically large.

BlinkDB Runtime

Predict mainly based on: 1. For all standard SQL aggregates, the variance is proportional to ∼ 1/n, and thus the standard deviation (or the statistical error) is proportional to ∼ 1/√n 2. BlinkDB simply predicts n by assuming that latency scales linearly with input size, as is commonly observed with a majority of I/O bounded queries in parallel distributed execution environments.

Bias correction use stratified sample to simulate a normal sample by trace the sample rate of every group.

Inplementation Enables queries with response time and error bounds Creates or updates the set of random and multi-dimensional samples re-writes the query and iteratively assigns it an appropriately sized uniform or stratified sample Modify all pre-existing aggregation functions with statistical closed forms to return errors bars and confidence intervals in addition to there result.

Sample refresh inaccuracies in analysis based on multiple queries. Multiple queries on unchanged biased sample will not help to convergence. periodically( typically, daily) samples from the original data to avoid correlation among the answers to queries which use the same sample.

IDCityBuff Ratio 1NYC0.78 2NYC0.13 3Berkeley0.25 4NYC0.19 5NYC0.11 6Berkeley0.09 7NYC0.18 8NYC0.15 9Berkeley Berkeley NYC Berkeley0.10 Query Execution on Samples IDCityBuff RatioSampling Rate 2NYC0.131/4 6Berkeley0.251/4 8NYC0.191/4 Uniform Sample

Time cost for sample uniform samples are generally created in a few hundred seconds. creating stratified samples on a set of columns takes anywhere between a 5− 30 minutes depending on the number of unique values to stratify on, which decides the number of reducers and the amount of data shuffled.

Evaluation workloads and sample storage cost QCS choices change through the storage budget

Response time improvement by sample

Error by different samples

Error Convergence

Time and error bound

Scaling Up Highly selective queries Those queries that only operate on a small fraction of input data consist of one or more highly selective WHERE clauses those queries that are intended to crunch huge amounts of data Average among x=2 Average among all the data

Conclusion BlinkDB, a parallel, sampling-based approximate query engine that provides support for ad-hoc queries with error and response time constraints two key ideas: (i) a multi-dimensional sampling strategy that builds and maintains a variety of samples. (ii) a run-time dynamic sample selection strategy that uses parts of a sample to estimate query selectivity and chooses the best samples for satisfying query constraints. Answer a “range” of queries within 2 seconds on 17 TB of data with 90-98% accuracy.