Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

Slides:



Advertisements
Similar presentations
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li, Antonio Lupher, Justin Ma,
Advertisements

Shark:SQL and Rich Analytics at Scale
Shark Hive SQL on Spark Michael Armbrust.
CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
Software and Services Group “Project Panthera”: Better Analytics with SQL, MapReduce and HBase Jason Dai Principal Engineer Intel SSG (Software and Services.
UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Berkeley Data Analytics Stack (BDAS) Overview Ion Stoica UC Berkeley UC BERKELEY.
Spark: Cluster Computing with Working Sets
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.
Reynold Xin Shark: Hive (SQL) on Spark. Stage 0: Map-Shuffle-Reduce Mapper(row) { fields = row.split("\t") emit(fields[0], fields[1]); } Reducer(key,
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Clydesdale: Structured Data Processing on MapReduce Jackie.
Hive: A data warehouse on Hadoop
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Putting the Sting in Hive Page 1 Alan F.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
AMPCamp Introduction to Berkeley Data Analytics Systems (BDAS)
Outline | Motivation| Design | Results| Status| Future
Raghav Ayyamani. Copyright Ellis Horowitz, Why Another Data Warehousing System? Problem : Data, data and more data Several TBs of data everyday.
Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
Clearstorydata.com Using Spark and Shark for Fast Cycle Analysis on Diverse Data Vaibhav Nivargi.
Chapter 3 Single-Table Queries
Hive : A Petabyte Scale Data Warehouse Using Hadoop
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Penwell Debug Intel Confidential BRIEF OVERVIEW OF HIVE Jonathan Brauer ESE 380L Feb
Introduction to Hadoop and HDFS
Hive Facebook 2009.
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
A NoSQL Database - Hive Dania Abed Rabbou.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Intro to SQL Management Studio. Please Be Sure!! Make sure that your access is read only. If it isn’t, you have the potential to change data within your.
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)
Hive. What is Hive? Data warehousing layer on top of Hadoop – table abstractions SQL-like language (HiveQL) for “batch” data processing SQL is translated.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin,
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
Sameer Agarwal, Aurojit Panda, Barzan Moxafari Samuel Madden, Ion Stoica.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.
Session Name Pelin ATICI SQL Premier Field Engineer.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Image taken from: slideshare
Recent Trends in Large Scale Data Intensive Systems
BlinkDB.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
So, what was this course about?
BlinkDB.
Spark Presentation.
Hive Mr. Sriram
Projects on Extended Apache Spark
Introduction to Spark.
StreamApprox Approximate Stream Analytics in Apache Flink
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
StreamApprox Approximate Stream Analytics in Apache Spark
Introduction to Apache
Overview of big data tools
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Fast, Interactive, Language-Integrated Cluster Computing
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael Jordan, Samuel Madden, Ion Stoica

Our Goal Support interactive SQL-like aggregate queries over massive sets of data

Our Goal Support interactive SQL-like aggregate queries over massive sets of data blinkdb> SELECT AVG(jobtime) FROM very_big_log AVG, COUNT, SUM, STDEV, PERCENTILE etc.

Support interactive SQL-like aggregate queries over massive sets of data blinkdb> SELECT AVG(jobtime) FROM very_big_log WHERE src = ‘hadoop’ FILTERS, GROUP BY clauses Our Goal

Support interactive SQL-like aggregate queries over massive sets of data blinkdb> SELECT AVG(jobtime) FROM very_big_log WHERE src = ‘hadoop’ LEFT OUTER JOIN logs2 ON very_big_log.id = logs.id JOINS, Nested Queries etc. Our Goal

Support interactive SQL-like aggregate queries over massive sets of data blinkdb> SELECT my_function(jobtime) FROM very_big_log WHERE src = ‘hadoop’ LEFT OUTER JOIN logs2 ON very_big_log.id = logs.id ML Primitives, User Defined Functions ML Primitives, User Defined Functions Our Goal

Hard Disks ½ - 1 Hour1 - 5 Minutes1 second ? Memory 100 TB on 1000 machines Query Execution on Samples

IDCitySalary 1NYC50,000 2NYC62,492 3Berkeley78,212 4NYC120,242 5NYC98,341 6Berkeley75,453 7NYC60,000 8NYC72,492 9Berkeley88,212 10Berkeley92,242 11NYC70,000 12Berkeley102,492 Query Execution on Samples What is the average Salary of all the people in the table? $80,848

IDCitySalary 1NYC50,000 2NYC62,492 3Berkeley78,212 4NYC120,242 5NYC98,341 6Berkeley75,453 7NYC60,000 8NYC72,492 9Berkeley88,212 10Berkeley92,242 11NYC70,000 12Berkeley102,492 Query Execution on Samples What is the average Salary of all the people in the table? IDCitySalarySampling Rate 2NYC62,4921/4 6Berkeley75,4531/4 8NYC72,4921/4 Uniform Sample $70,145 $80,848

IDCitySalary 1NYC50,000 2NYC62,492 3Berkeley78,212 4NYC120,242 5NYC98,341 6Berkeley75,453 7NYC60,000 8NYC72,492 9Berkeley88,212 10Berkeley92,242 11NYC70,000 12Berkeley102,492 Query Execution on Samples What is the average Salary of all the people in the table? IDCitySalarySampling Rate 2NYC62,4921/4 6Berkeley75,4531/4 8NYC72,4921/4 Uniform Sample $70,145 +/- 10,815 $80,848

IDCitySalary 1NYC50,000 2NYC62,492 3Berkeley78,212 4NYC120,242 5NYC98,341 6Berkeley75,453 7NYC60,000 8NYC72,492 9Berkeley88,212 10Berkeley92,242 11NYC70,000 12Berkeley102,492 Query Execution on Samples What is the average Salary of all the people in the table? IDCitySalarySampling Rate 2NYC62,4921/2 3Berkeley78,2121/2 5NYC60,0001/2 6Berkeley75,4531/2 8NYC72,4921/2 12Berkeley102,4921/2 Uniform Sample $75,190 +/- 5,895 $80,848 $70,145 +/- 10,815

Speed/Accuracy Trade-off Execution Time Error 30 mins Time to Execute on Entire Dataset Interactive Queries 5 sec

Execution Time Error 30 mins Time to Execute on Entire Dataset Interactive Queries 5 sec Speed/Accuracy Trade-off Pre-Existing Noise

What is BlinkDB? A data analysis (warehouse) system that … -builds on Shark and Spark -returns fast, approximate answers with error bars by executing queries on small samples of data -is compatible with Apache Hive (storage, serdes, UDFs, types, metadata) and supports Hive’s SQL- like query structure with minor modifications

Sampling Vs. No Sampling Fraction of full data Query Response Time (Seconds) x as response time is dominated by I/O 10x as response time is dominated by I/O

Sampling Vs. No Sampling Fraction of full data Query Response Time (Seconds) (0.02%) (0.07%)(1.1%)(3.4%) (11%) Error Bars

Hive Architecture Hadoop Storage (e.g., HDFS, HBase) Meta store Meta store MapReduce SQL Parser Query Optimizer Physical Plan SerDes, UDFs Execution Driver Command-line Shell Thrift/JDBC

Shark Architecture Hadoop Storage (e.g., HDFS, HBase) Meta store Meta store Spark SQL Parser Query Optimizer Physical Plan SerDes, UDFs Execution Driver Command-line Shell Thrift/JDBC

BlinkDB Architecture Hadoop Storage (e.g., HDFS, HBase) Meta store Meta store Spark SQL Parser Query Optimizer Physical Plan SerDes, UDFs Execution Driver Command-line Shell Thrift/JDBC

BlinkDB alpha Released and available at 2.Allows you to create random and stratified samples on native tables and materialized views 3.Adds approximate aggregate functions with statistical closed forms to HiveQL : approx_avg(), approx_sum(), approx_count() etc.

Example: Preparing the Data blinkdb>

blinkdb> create external table logs (dt string, event string, bytes int) row format delimited fields terminated by ' ' location ’/tmp/logs’; Referencing an external table logs in BlinkDB Example: Preparing the Data

blinkdb> create external table logs (dt string, event string, bytes int) row format delimited fields terminated by ' ' location ’/tmp/logs'; blinkdb> create table logs_sample as select * from logs samplewith 0.01; Create a 1% random sample logs_sample from logs Example: Preparing the Data

blinkdb> create external table logs (dt string, event string, bytes int) row format delimited fields terminated by ' ' location ’/tmp/logs'; blinkdb> create table logs_sample as select * from logs samplewith 0.01; blinkdb> create table logs_sample_cached as select * from logs_sample; Supports all Shark primitives for caching samples in memory Example: Preparing the Data

blinkdb> set blinkdb.sample.size=32810 blinkdb> set blinkdb.dataset.size= Giving BlinkDB information about the size of sample you wish to operate on and the size of the original dataset Example: Analyzing the Data

blinkdb> set blinkdb.sample.size=32810 blinkdb> set blinkdb.dataset.size= blinkdb> select approx_count(1) from logs_sample_cached where event = “foo”; Example: Analyzing the Data Prefixing approx_ to an aggregate operator tells BlinkDB to return an approximate answer

blinkdb> set blinkdb.sample.size=32810 blinkdb> set blinkdb.dataset.size= blinkdb> select approx_count(1) from logs_sample_cached where event = “foo”; / (99% Confidence) Example: Analyzing the Data Returns an approximate answer with an error bar and confidence interval

blinkdb> create table logs_sample as select * from [any subquery] samplewith 0.01; Example: There’s more! The sample operator can be anywhere in the query graph

blinkdb> create table logs_sample as select * from [any subquery] samplewith 0.01; blinkdb> select approx_count(1) from logs_sample_cached where event = “foo” GROUP BY dt ORDER BY dt; Example: There’s more! Retains remaining Hive Query Structure

blinkdb> create table logs_sample as select * from [any subquery] samplewith 0.01; blinkdb> select approx_count(1) from logs_sample_cached where event = “foo” GROUP BY dt ORDER BY dt; / (99% Confidence) Example: There’s more! Note: The output is a String

Feature Roadmap 1.Integrating BlinkDB with Shark as an experimental feature (coming soon!) 2.Automatic Sample Management 3.More Hive Aggregates, UDAF Support 4.Runtime Correctness Tests

SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’ WITHIN 1 SECONDS ± Automatic Sample Management Goal: The API should abstract the details of creating, deleting and managing samples from the user

SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’ WITHIN 2 SECONDS ± Automatic Sample Management ± 4.96 Goal: The API should abstract the details of creating, deleting and managing samples from the user

SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’ ERROR 0.1 CONFIDENCE 95.0% Automatic Sample Management Goal: The API should abstract the details of creating, deleting and managing samples from the user

TABLE Sampling Module Original Data Offline-sampling: Creates an optimal set of samples on native tables and materialized views based on query history and workload characteristics Automatic Sample Management

TABLE Sampling Module In-Memory Samples On-Disk Samples Original Data Sample Placement: Samples striped over 100s or 1,000s of machines both on disks and in-memory. Automatic Sample Management

SELECT foo (*) FROM TABLE WITHIN 2 Query Plan HiveQL/SQL Query Sample Selection TABLE Sampling Module In-Memory Samples On-Disk Samples Original Data Automatic Sample Management

SELECT foo (*) FROM TABLE WITHIN 2 Query Plan HiveQL/SQL Query Sample Selection TABLE Sampling Module In-Memory Samples On-Disk Samples Original Data Online sample selection to pick best sample(s) based on query latency and accuracy requirements Automatic Sample Management

TABLE Sampling Module In-Memory Samples On-Disk Samples Original Data Shark SELECT foo (*) FROM TABLE WITHIN 2 New Query Plan HiveQL/SQL Query Sample Selection Error Bars & Confidence Intervals Result ± 5.56 (95% confidence) Parallel query execution on multiple samples striped across multiple machines. Automatic Sample Management

1.Using Bootstrap to estimate error More Aggregates/ UDAFs Support Sample A

1.Using Bootstrap to estimate error More Aggregates/ UDAFs Support Sample A A1A1 A2A2 AnAn … … Bootstrap Operator

1.Using Bootstrap to estimate error More Aggregates/ UDAFs Support Sample A A1A1 A2A2 AnAn … … Placement of the Bootstrap Operator in the query graph is critical to performance

1.Using Bootstrap to estimate error More Aggregates/ UDAFs Support Sample A A1A1 A2A2 AnAn … … However, the bootstrap can fail

1.Given a query,how do you know if it can be approximated at runtime? -Depends on the query, data distribution, and sample size 2.Need for runtime diagnosis tests -Check whether error improves as sample size increases -30,000 extremely small query tasks Runtime Correctness Tests

1.BlinkDB alpha released and available at Takes just 5-10 minutes to run it locally or to spin an EC2 cluster 3.Hands-on Exercises today at the AMPCamp 4.Designed to be a drop-in tool like Shark Getting Started

1.Approximate queries is an important means to achieve interactivity in processing large datasets 2.BlinkDB.. -builds on Shark and Spark -approximate answers with error bars by executing queries on small samples of data -supports existing Hive Query with minor modifications 3.For more information, please check out our EuroSys 2013 ( and KDD 2014 ( papershttp://bit.ly/blinkdb-1http://bit.ly/blinkdb-2 Summary Thanks!