Download presentation
Presentation is loading. Please wait.
Published byAugustine Andrews Modified over 9 years ago
1
Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael Jordan, Samuel Madden, Ion Stoica
2
Our Goal Support interactive SQL-like aggregate queries over massive sets of data
3
Our Goal Support interactive SQL-like aggregate queries over massive sets of data blinkdb> SELECT AVG(jobtime) FROM very_big_log AVG, COUNT, SUM, STDEV, PERCENTILE etc.
4
Support interactive SQL-like aggregate queries over massive sets of data blinkdb> SELECT AVG(jobtime) FROM very_big_log WHERE src = ‘hadoop’ FILTERS, GROUP BY clauses Our Goal
5
Support interactive SQL-like aggregate queries over massive sets of data blinkdb> SELECT AVG(jobtime) FROM very_big_log WHERE src = ‘hadoop’ LEFT OUTER JOIN logs2 ON very_big_log.id = logs.id JOINS, Nested Queries etc. Our Goal
6
Support interactive SQL-like aggregate queries over massive sets of data blinkdb> SELECT my_function(jobtime) FROM very_big_log WHERE src = ‘hadoop’ LEFT OUTER JOIN logs2 ON very_big_log.id = logs.id ML Primitives, User Defined Functions ML Primitives, User Defined Functions Our Goal
7
Hard Disks ½ - 1 Hour1 - 5 Minutes1 second ? Memory 100 TB on 1000 machines Query Execution on Samples
8
IDCitySalary 1NYC50,000 2NYC62,492 3Berkeley78,212 4NYC120,242 5NYC98,341 6Berkeley75,453 7NYC60,000 8NYC72,492 9Berkeley88,212 10Berkeley92,242 11NYC70,000 12Berkeley102,492 Query Execution on Samples What is the average Salary of all the people in the table? $80,848
9
IDCitySalary 1NYC50,000 2NYC62,492 3Berkeley78,212 4NYC120,242 5NYC98,341 6Berkeley75,453 7NYC60,000 8NYC72,492 9Berkeley88,212 10Berkeley92,242 11NYC70,000 12Berkeley102,492 Query Execution on Samples What is the average Salary of all the people in the table? IDCitySalarySampling Rate 2NYC62,4921/4 6Berkeley75,4531/4 8NYC72,4921/4 Uniform Sample $70,145 $80,848
10
IDCitySalary 1NYC50,000 2NYC62,492 3Berkeley78,212 4NYC120,242 5NYC98,341 6Berkeley75,453 7NYC60,000 8NYC72,492 9Berkeley88,212 10Berkeley92,242 11NYC70,000 12Berkeley102,492 Query Execution on Samples What is the average Salary of all the people in the table? IDCitySalarySampling Rate 2NYC62,4921/4 6Berkeley75,4531/4 8NYC72,4921/4 Uniform Sample $70,145 +/- 10,815 $80,848
11
IDCitySalary 1NYC50,000 2NYC62,492 3Berkeley78,212 4NYC120,242 5NYC98,341 6Berkeley75,453 7NYC60,000 8NYC72,492 9Berkeley88,212 10Berkeley92,242 11NYC70,000 12Berkeley102,492 Query Execution on Samples What is the average Salary of all the people in the table? IDCitySalarySampling Rate 2NYC62,4921/2 3Berkeley78,2121/2 5NYC60,0001/2 6Berkeley75,4531/2 8NYC72,4921/2 12Berkeley102,4921/2 Uniform Sample $75,190 +/- 5,895 $80,848 $70,145 +/- 10,815
12
Speed/Accuracy Trade-off Execution Time Error 30 mins Time to Execute on Entire Dataset Interactive Queries 5 sec
13
Execution Time Error 30 mins Time to Execute on Entire Dataset Interactive Queries 5 sec Speed/Accuracy Trade-off Pre-Existing Noise
14
What is BlinkDB? A data analysis (warehouse) system that … -builds on Shark and Spark -returns fast, approximate answers with error bars by executing queries on small samples of data -is compatible with Apache Hive (storage, serdes, UDFs, types, metadata) and supports Hive’s SQL- like query structure with minor modifications
15
Sampling Vs. No Sampling 1 10 -1 10 -2 10 -3 10 -4 10 -5 Fraction of full data Query Response Time (Seconds) 103 1020 1813 108 10x as response time is dominated by I/O 10x as response time is dominated by I/O
16
Sampling Vs. No Sampling 1 10 -1 10 -2 10 -3 10 -4 10 -5 Fraction of full data Query Response Time (Seconds) 103 1020 1813 108 (0.02%) (0.07%)(1.1%)(3.4%) (11%) Error Bars
17
Hive Architecture Hadoop Storage (e.g., HDFS, HBase) Meta store Meta store MapReduce SQL Parser Query Optimizer Physical Plan SerDes, UDFs Execution Driver Command-line Shell Thrift/JDBC
18
Shark Architecture Hadoop Storage (e.g., HDFS, HBase) Meta store Meta store Spark SQL Parser Query Optimizer Physical Plan SerDes, UDFs Execution Driver Command-line Shell Thrift/JDBC
19
BlinkDB Architecture Hadoop Storage (e.g., HDFS, HBase) Meta store Meta store Spark SQL Parser Query Optimizer Physical Plan SerDes, UDFs Execution Driver Command-line Shell Thrift/JDBC
20
BlinkDB alpha-0.1.0 1.Released and available at http://blinkdb.orghttp://blinkdb.org 2.Allows you to create random and stratified samples on native tables and materialized views 3.Adds approximate aggregate functions with statistical closed forms to HiveQL : approx_avg(), approx_sum(), approx_count() etc.
21
Example: Preparing the Data blinkdb>
22
blinkdb> create external table logs (dt string, event string, bytes int) row format delimited fields terminated by ' ' location ’/tmp/logs’; Referencing an external table logs in BlinkDB Example: Preparing the Data
23
blinkdb> create external table logs (dt string, event string, bytes int) row format delimited fields terminated by ' ' location ’/tmp/logs'; blinkdb> create table logs_sample as select * from logs samplewith 0.01; Create a 1% random sample logs_sample from logs Example: Preparing the Data
24
blinkdb> create external table logs (dt string, event string, bytes int) row format delimited fields terminated by ' ' location ’/tmp/logs'; blinkdb> create table logs_sample as select * from logs samplewith 0.01; blinkdb> create table logs_sample_cached as select * from logs_sample; Supports all Shark primitives for caching samples in memory Example: Preparing the Data
25
blinkdb> set blinkdb.sample.size=32810 blinkdb> set blinkdb.dataset.size=3198910 Giving BlinkDB information about the size of sample you wish to operate on and the size of the original dataset Example: Analyzing the Data
26
blinkdb> set blinkdb.sample.size=32810 blinkdb> set blinkdb.dataset.size=3198910 blinkdb> select approx_count(1) from logs_sample_cached where event = “foo”; Example: Analyzing the Data Prefixing approx_ to an aggregate operator tells BlinkDB to return an approximate answer
27
blinkdb> set blinkdb.sample.size=32810 blinkdb> set blinkdb.dataset.size=3198910 blinkdb> select approx_count(1) from logs_sample_cached where event = “foo”; 12810132 +/- 3423 (99% Confidence) Example: Analyzing the Data Returns an approximate answer with an error bar and confidence interval
28
blinkdb> create table logs_sample as select * from [any subquery] samplewith 0.01; Example: There’s more! The sample operator can be anywhere in the query graph
29
blinkdb> create table logs_sample as select * from [any subquery] samplewith 0.01; blinkdb> select approx_count(1) from logs_sample_cached where event = “foo” GROUP BY dt ORDER BY dt; Example: There’s more! Retains remaining Hive Query Structure
30
blinkdb> create table logs_sample as select * from [any subquery] samplewith 0.01; blinkdb> select approx_count(1) from logs_sample_cached where event = “foo” GROUP BY dt ORDER BY dt; 12810132 +/- 3423 (99% Confidence) Example: There’s more! Note: The output is a String
31
Feature Roadmap 1.Integrating BlinkDB with Shark as an experimental feature (coming soon!) 2.Automatic Sample Management 3.More Hive Aggregates, UDAF Support 4.Runtime Correctness Tests
32
SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’ WITHIN 1 SECONDS 234.23 ± 15.32 Automatic Sample Management Goal: The API should abstract the details of creating, deleting and managing samples from the user
33
SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’ WITHIN 2 SECONDS 234.23 ± 15.32 Automatic Sample Management 239.46 ± 4.96 Goal: The API should abstract the details of creating, deleting and managing samples from the user
34
SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’ ERROR 0.1 CONFIDENCE 95.0% Automatic Sample Management Goal: The API should abstract the details of creating, deleting and managing samples from the user
35
TABLE Sampling Module Original Data Offline-sampling: Creates an optimal set of samples on native tables and materialized views based on query history and workload characteristics Automatic Sample Management
36
TABLE Sampling Module In-Memory Samples On-Disk Samples Original Data Sample Placement: Samples striped over 100s or 1,000s of machines both on disks and in-memory. Automatic Sample Management
37
SELECT foo (*) FROM TABLE WITHIN 2 Query Plan HiveQL/SQL Query Sample Selection TABLE Sampling Module In-Memory Samples On-Disk Samples Original Data Automatic Sample Management
38
SELECT foo (*) FROM TABLE WITHIN 2 Query Plan HiveQL/SQL Query Sample Selection TABLE Sampling Module In-Memory Samples On-Disk Samples Original Data Online sample selection to pick best sample(s) based on query latency and accuracy requirements Automatic Sample Management
39
TABLE Sampling Module In-Memory Samples On-Disk Samples Original Data Shark SELECT foo (*) FROM TABLE WITHIN 2 New Query Plan HiveQL/SQL Query Sample Selection Error Bars & Confidence Intervals Result 182.23 ± 5.56 (95% confidence) Parallel query execution on multiple samples striped across multiple machines. Automatic Sample Management
40
1.Using Bootstrap to estimate error More Aggregates/ UDAFs Support Sample A
41
1.Using Bootstrap to estimate error More Aggregates/ UDAFs Support Sample A A1A1 A2A2 AnAn … … Bootstrap Operator
42
1.Using Bootstrap to estimate error More Aggregates/ UDAFs Support Sample A A1A1 A2A2 AnAn … … Placement of the Bootstrap Operator in the query graph is critical to performance
43
1.Using Bootstrap to estimate error More Aggregates/ UDAFs Support Sample A A1A1 A2A2 AnAn … … However, the bootstrap can fail
44
1.Given a query,how do you know if it can be approximated at runtime? -Depends on the query, data distribution, and sample size 2.Need for runtime diagnosis tests -Check whether error improves as sample size increases -30,000 extremely small query tasks Runtime Correctness Tests
45
1.BlinkDB alpha-0.1.0 released and available at http://blinkdb.org http://blinkdb.org 2.Takes just 5-10 minutes to run it locally or to spin an EC2 cluster 3.Hands-on Exercises today at the AMPCamp 4.Designed to be a drop-in tool like Shark Getting Started
46
1.Approximate queries is an important means to achieve interactivity in processing large datasets 2.BlinkDB.. -builds on Shark and Spark -approximate answers with error bars by executing queries on small samples of data -supports existing Hive Query with minor modifications 3.For more information, please check out our EuroSys 2013 (http://bit.ly/blinkdb-1) and KDD 2014 (http://bit.ly/blinkdb-2) papershttp://bit.ly/blinkdb-1http://bit.ly/blinkdb-2 Summary Thanks!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.