BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.

Slides:



Advertisements
Similar presentations
Shark:SQL and Rich Analytics at Scale
Advertisements

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Fast Algorithms For Hierarchical Range Histogram Constructions
Brian Babcock Surajit Chaudhuri Gautam Das at the 2003 ACM SIGMOD International Conference By Shashank Kamble Gnanoba.
4.1.5 System Management Background What is in System Management Resource control and scheduling Booting, reconfiguration, defining limits for resource.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.
Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.
Online Aggregation Liu Long Aggregation Operations related to aggregating data in DBMS –AVG –SUM –COUNT.
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Michael Armbrust A Functional Query Optimization Framework.
Clydesdale: Structured Data Processing on MapReduce Jackie.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
1 Primitives for Workload Summarization and Implications for SQL Prasanna Ganesan* Stanford University Surajit Chaudhuri Vivek Narasayya Microsoft Research.
Hive: A data warehouse on Hadoop
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA * Bioinformatics.
The Power of Choice in Data-Aware Cluster Scheduling
UT DALLAS Erik Jonsson School of Engineering & Computer Science FEARLESS engineering Secure Data Storage and Retrieval in the Cloud Bhavani Thuraisingham,
Raghav Ayyamani. Copyright Ellis Horowitz, Why Another Data Warehousing System? Problem : Data, data and more data Several TBs of data everyday.
Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1.
Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
PIQL: Success- Tolerant Query Processing in the Cloud Michael Armbrust, Kristal Curtis, Tim Kraska Armando Fox, Michael J. Franklin, David A. Patterson.
Hive : A Petabyte Scale Data Warehouse Using Hadoop
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
An Integration Framework for Sensor Networks and Data Stream Management Systems.
Introduction to Hadoop and HDFS
DBSQL 14-1 Copyright © Genetic Computer School 2009 Chapter 14 Microsoft SQL Server.
Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Ronda Hilton.
Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Nag Prajval B.C.
MURI: Integrated Fusion, Performance Prediction, and Sensor Management for Automatic Target Exploitation 1 Dynamic Sensor Resource Management for ATE MURI.
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented by Sushanth.
Load Shedding Techniques for Data Stream Systems Brian Babcock Mayur Datar Rajeev Motwani Stanford University.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)
Presented By Anirban Maiti Chandrashekar Vijayarenu
ICDCS 2014 Madrid, Spain 30 June-3 July 2014
ApproxHadoop Bringing Approximations to MapReduce Frameworks
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP BY QUERIES Swaroop Acharya,Philip B Gibbons, VishwanathPoosala By Agasthya Padisala Anusha Reddy.
University of Texas at Arlington Presented By Srikanth Vadada Fall CSE rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
Sameer Agarwal, Aurojit Panda, Barzan Moxafari Samuel Madden, Ion Stoica.
Microsoft Ignite /28/2017 6:07 PM
Image taken from: slideshare
When deep learning meets object detection: Introduction to two technologies: SSD and YOLO Wenchi Ma.
Recent Trends in Large Scale Data Intensive Systems
PROTECT | OPTIMIZE | TRANSFORM
BlinkDB.
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
BlinkDB.
Spark Presentation.
A paper on Join Synopses for Approximate Query Answering
Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang
Fast Approximate Query Answering over Sensor Data with Deterministic Error Guarantees Chunbin Lin Joint with Etienne Boursier, Jacque Brito, Yannis Katsis,
Distributed Submodular Maximization in Massive Datasets
Introduction to Spark.
Spatial Online Sampling and Aggregation
StreamApprox Approximate Stream Analytics in Apache Spark
On Spatial Joins in MapReduce
Random Sampling over Joins Revisited
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Big Data Analytics: Exploring Graphs with Optimized SQL Queries
Presented by: Mariam John CSE /14/2006
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur

Motivation 2 Traditional SQL queries Can we support interactive SQL-like aggregate queries over massive datasets?

Motivation 100 TB on 1000 machines 3 Query execution on samples of data ½ - 1 Hour1 - 5 minutes1 second ?

Query Execution on Samples What is the average latency in the table? IDCityBuff Ratio Sampling Rate 2NYC381/2 3SLC341/2 5SLC371/2 7NYC321/2 8NYC381/2 12LA341/2 IDCityLatency 1NYC30 2NYC38 3SLC34 4LA36 5SLC37 6SF28 7NYC32 8NYC38 9LA36 10SF35 11NYC38 12LA34 Full data: Rate ¼: ± 2.18 Rate ½: 35.5 ± 1.02 Uniform Sample 4

What is BlinkDB? A framework built on Apache Hive that … -Creates and maintains a variety of uniform and stratified samples from underlying data (offline) -Returns fast, approximate answers by executing queries on samples of data selected dynamically (online) -Compatible and integrated with Apache Hive, supports Hive’s SQL style query structure 5

Design considerations Query-column-set (QCS): Appears in query filtering/groupby clause, data expected to be stable over time Targets predictable query-column-set (QCS) style workloads Enables pre-computing samples that generalize to future workloads 6

Queries Supports COUNT, AVG, SUM and QUANTILE Relies on closed form error estimation for these aggregates Can be annotated with error bound/time constraint Selects appropriate sample type and size 7

High level architecture Table Sample creation module Uniform Stratified on C1 Stratified on C2 Query with error/latency bound Query plan Sample selection module Updated Query plan Execute 8

U 1.FILTER rand() < 1/3 2.Adds per-row Weights 3.(Optional) ORDER BY rand() IDCityLatencyWeight 2NYC381/3 6SF281/3 8NYC381/3 12LA341/3 Sample creation (uniform) IDCityLatency 1NYC30 2NYC38 3SLC34 4LA36 5SLC37 6SF28 7NYC32 8NYC38 9LA36 10SF35 11NYC38 12LA34 9

S1S1 S2S2 CityCountRatio NYC72/7 SF52/5 S2S2 JOIN Sample creation (stratified) SPLIT GROU P IDCityLatency 1NYC34 2NYC32 3SF36 4NYC28 5NYC37 6SF33 7NYC31 8NYC30 9SF32 10SF34 11NYC35 12SF36 10

S1S1 S2S2 S2S2 U IDCityDataWeight 2NYC322/7 8NYC302/7 6SF332/5 12SF362/5 Sample creation (stratified) 11

Sample creation (stratified) 12 Stratified sample size per group

Sample creation for multiple queries Multiple queries sharing QCS, different values of n (#rows to satisfy query) Sample depends on n (error/time bound) and selectivity of query Requires maintaining one sample per family of stratified samples S n 13 Sample for multiple queries with shared QCS

Sample creation (optimization) Multi-dimensional stratified samples Objective function Constraints 14 Weighted sum of coverage of QCSs of historical queries Storage cost for the samples Sample’s coverage probability for query QCS

Sample selection (runtime) Selecting the sample type Query’s column-set is subset of stratified sample QCS? Select, else Run query across all samples to pick ones with high selectivity Selecting sample size Error-Latency profile by running query on smaller samples Project profile for larger sample sizes Error profile Estimate query selectivity, sample variance, input data distribution Use standard closed form statistical error estimate Latency profile: Assumes latency scales linearly with input size 15

Evaluation BlinkDB vs. No sampling Conviva error comparison TPC-H error comparison 16 Expected error minimized

Evaluation Response time bounds Relative error bounds Scaleup 17 Smaller sample sizes Low communication cost

Conclusion Sampling based approximate query engine that supports query error and response time constraints Uses multi-dimensional stratified sampling with runtime sample selection strategy Can answer queries within 2 seconds on upto 17 TB of data with 90-98% accuracy 18

Thoughts Novel concepts introduced with grounding in statistics/sampling theory to build upon Can be integrated to existing query processing frameworks like Hive & Shark Follow up work such as supporting more generic aggregates and UDFs Potentially crucial aspects not addressed properly: M and K values are fixed, optimization space could be huge (heuristics unclear), sample replacement period, etc. What if ELP estimates are not accurate? And do we verify error estimates, query feasibility? 19

Thank you! 20

Extra slides 21

Speed/Accuracy Trade-off Enable exploring speed- accuracy tradeoff curve for performance Real time analysis Pre-existing noise from data collection already 22

Apache Hive Built on top of Hadoop to query/manage large datasets Imposes structure on variety of data formats SQL-like query language, can be extended to write UDF’s Batch jobs over large sets with scalability, extensibility, fault tolerance and loose coupling with input formats 23

BlinkDB Architecture Hadoop Storage (e.g., HDFS, Hbase, Presto) Meta store Meta store Hadoop/Spark/Presto SQL Parser Query Optimizer Physical Plan SerDes, UDFs Execution Driver Command-line Shell Thrift/JDBC 24

Error Estimation Closed Form Aggregate Functions -Central Limit Theorem -Applicable to AVG, COUNT, SUM, VARIANCE and STDEV 25

Error Estimation Closed Form Aggregate Functions -Central Limit Theorem  Applicable to AVG, COUNT, SUM, VARIANCE and STDEV 26