Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang

Slides:



Advertisements
Similar presentations
Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.
Advertisements

Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
CS4432: Database Systems II
SQL SERVER 2012 XVELOCITY COLUMNSTORE INDEX Conor Cunningham Principal Architect SQL Server Engine.
Fast Algorithms For Hierarchical Range Histogram Constructions
Brian Babcock Surajit Chaudhuri Gautam Das at the 2003 ACM SIGMOD International Conference By Shashank Kamble Gnanoba.
Harikrishnan Karunakaran Sulabha Balan CSE  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion.
1 Primitives for Workload Summarization and Implications for SQL Prasanna Ganesan* Stanford University Surajit Chaudhuri Vivek Narasayya Microsoft Research.
ACM GIS An Interactive Framework for Raster Data Spatial Joins Wan Bae (Computer Science, University of Denver) Petr Vojtěchovský (Mathematics,
Privacy Preserving OLAP Rakesh Agrawal, IBM Almaden Ramakrishnan Srikant, IBM Almaden Dilys Thomas, Stanford University.
Improving Min/Max Aggregation over Spatial Objects Donghui Zhang, Vassilis J. Tsotras University of California, Riverside ACM GIS’01.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Database Management 9. course. Execution of queries.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Ronda Hilton.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :
DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented by Sushanth.
Frank Dehnewww.dehne.net Parallel Data Cube Data Mining OLAP (On-line analytical processing) cube / group-by operator in SQL.
OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.
Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha ( ) Deepak Anand ( ) By:
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
Presented By Anirban Maiti Chandrashekar Vijayarenu
Interactive Data Exploration Using Semantic Windows Alexander Kalinin Ugur Cetintemel, Stan Zdonik.
HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas.
Query Processing CS 405G Introduction to Database Systems.
1 A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Proceedings of the.
Alon Levy 1 Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation. – Projection ( ) Deletes.
Offering a Precision- Performance Tradeoff for Aggregation Queries over Replicated Data Paper by Chris Olston, Jennifer Widom Presented by Faizaan Kersi.
SF-Tree and Its Application to OLAP Speaker: Ho Wai Shing.
University of Texas at Arlington Presented By Srikanth Vadada Fall CSE rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.
ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan Shruti P. Gopinath CSE 6339.
Dense-Region Based Compact Data Cube
Plan for Final Lecture What you may expect to be asked in the Exam?
CS222: Principles of Data Management Lecture #4 Catalogs, Buffer Manager, File Organizations Instructor: Chen Li.
BlinkDB.
CPS216: Data-intensive Computing Systems
Parallel Databases.
BlinkDB.
A paper on Join Synopses for Approximate Query Answering
Ripple Joins for Online Aggregation
Anthony Okorodudu CSE ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan.
Optimizing SQL Queries
Query Sampling in DB2.
Blazing-Fast Performance:
Chapter 15 QUERY EXECUTION.
Database Management Systems (CS 564)
LSI, SVD and Data Management
Spatial Online Sampling and Aggregation
Cardinality Estimator 2014/2016
AQUA: Approximate Query Answering
Data Integration with Dependent Sources
Communication and Memory Efficient Parallel Decision Tree Construction
Random Sampling over Joins Revisited
Akshay Tomar Prateek Singh Lohchubh
Query Sampling in DB2.
Query Processing CSD305 Advanced Databases.
Fast and Exact K-Means Clustering
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
CS222P: Principles of Data Management Notes #13 Set operations, Aggregation, Query Plans Instructor: Chen Li.
Evaluation of Relational Operations: Other Techniques
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #03 Row/Column Stores, Heap Files, Buffer Manager, Catalogs Instructor: Chen Li.
All about Indexes Gail Shaw.
Presentation transcript:

Sample + Seek: Approximating Aggregates with Distribution Precision Guarantee Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang {bolind, surajitc, kaushik, chiw}@microsoft.com shuang86@illinois.edu Microsoft Research University of Illinois * Work done while in Microsoft

Outline Introduction Overview of Our Approach Main Technical Results Related Work Experiments Conclusion

Outline Introduction Overview of Our Approach Main Technical Results Related Work Experiments Conclusion

Motivation Interactive BI queries Error guarantee is important Aggregate queries with complicated and (maybe) selective predicates Large data size and response in seconds Answer with small error is fine General trend is enough for decision making Error exists in visualization anyway Error guarantee is important

Problem Definition Aggregate queries on a simple table 𝑇 Group by dimensions Measure attribute Extensible to SPJ queries (with foreign key joins between fact and dim tables) SELECT 𝐴 1 , 𝐴 2 ,… 𝐴 𝑑 , Agg(𝐵) FROM 𝑇 WHERE 𝐹 GROUP BY 𝐴 1 , 𝐴 2 ,…, 𝐴 𝑑 Aggregate function: count, sum, avg, count distinct Filter predicate: AND/OR of atomic predicates, atomic predicate can be equality or range predicate [𝒇 𝟏 , 𝒇 𝟐 ] Answer and Normalize Exact answer: 𝑓=[ 𝑓 1 , 𝑓 2 ,…, 𝑓 𝐺 ] Approximation: 𝑓 =[ 𝑓 1 , 𝑓 2 ,…, 𝑓 𝐺 ]

Problem Definition: Error Guarantee Measuring error: 𝑓 −𝑓 2 = 𝑔 𝑓 𝑔 − 𝑓 𝑔 2 Guarantee on the whole distribution Considering information gain in decision trees Semantics of the guarantee Total difference of the sector areas Exact answer: 𝑓=[ 𝑓 1 , 𝑓 2 ,…, 𝑓 𝐺 ] Approximation: 𝑓 =[ 𝑓 1 , 𝑓 2 ,…, 𝑓 𝐺 ] 𝒇= [𝒇 𝟏 , 𝒇 𝟐 ] 𝒇 = [ 𝒇 𝟏 , 𝒇 𝟐 ] Error

Problem Definition: Goal of Our System Error bound 𝜖 (e.g., 5%) Give an answer with 𝒇 −𝒇 𝟐 ≤𝝐 for any query Pre-built sample and index: query-independent, size sublinear or linear to the original data size Sublinear processing time Constant number of random I/Os (if needed) Exact answer: 𝑓=[ 𝑓 1 , 𝑓 2 ,…, 𝑓 𝐺 ] Approximation: 𝑓 =[ 𝑓 1 , 𝑓 2 ,…, 𝑓 𝐺 ] 𝒇= [𝒇 𝟏 , 𝒇 𝟐 ] 𝒇 = [ 𝒇 𝟏 , 𝒇 𝟐 ]

Outline Introduction Overview of Our Approach Main Technical Results Related Work Experiments Conclusion

Overview of Our Approach 𝜖-approximation answer 𝑓 with || 𝑓 −𝑓| ​ 2 ≤𝜖 Query Processor Online Offline In-memory random sample each with size at most 𝑂( 𝑁 / 𝜖 2 ) Sample MetaData Dictionary (On-disk) LF Index: scanning at most 𝑂( 𝑁 ) rows for each query LF Index Columnizer (On-disk) MA index: issuing 𝑂( min 𝑁 , 1 𝜖 2 ) random access for each query MA Index DataConnector Data source e.g., SQL DB 𝑵: size of the fact table 𝝐: error bound

Outline Introduction Overview of Our Approach Main Technical Results Related Work Experiments Conclusion

Main Technical Results 𝜖-approximation answer 𝑓 with || 𝑓 −𝑓| ​ 2 ≤𝜖 Query Processor Online Offline In-memory random sample each with size at most 𝑂( 𝑁 / 𝜖 2 ) Sample MetaData Dictionary (On-disk) LF Index: scanning at most 𝑂( 𝑁 ) rows for each query LF Index Columnizer (On-disk) MA index: issuing 𝑂( min 𝑁 , 1 𝜖 2 ) random access for each query MA Index DataConnector Data source e.g., SQL DB 𝑵: size of the fact table 𝝐: error bound

Sampling for COUNT Aggregates Simple uniform sampling (offline) Pick a random sample 𝑆 of size O 1 𝜖 2 from table 𝑇 (online) Run the query on 𝑆 to get 𝑓 Lemma I (corollary from [Li et al. PODS15]) If the query has NO predicate, w.h.p., we have 𝑓 −𝑓 2 ≤𝜖 SELECT 𝐴 1 , 𝐴 2 ,… 𝐴 𝑑 , COUNT(∗) FROM 𝑇 WHERE 𝐹 GROUP BY 𝐴 1 , 𝐴 2 ,…, 𝐴 𝑑

Sampling for COUNT Aggregates with Predicates Uniform sampling for queries predicates (offline) Pick a random sample 𝑆 of size O 𝑁 𝜖 2 from table 𝑇 with 𝑁=|𝑇| (online) Run the query on 𝑆 to get 𝑓 Lemma II (our new result) Selectivity of predicate 𝐹: 𝑠= 𝑁 𝐹 /𝑁 If 𝑠≥1/ 𝑁 , w.h.p., we have 𝑓 −𝑓 2 ≤𝜖 For SUM, 𝑓 −𝑓 2 ≤ 𝚫 𝟏.𝟓 ⋅𝜖, where 𝚫 is the range of measure SELECT 𝐴 1 , 𝐴 2 ,… 𝐴 𝑑 , COUNT(∗) FROM 𝑇 WHERE 𝐹 GROUP BY 𝐴 1 , 𝐴 2 ,…, 𝐴 𝑑

Sampling for SUM Aggregates with Predicates Weighted sampling and HT-like estimator SUM(B): SUM aggregate on dimension B (offline) Pick a random sample 𝑆 of size O 𝑁 𝜖 2 from table 𝑇 weighed on B (online) Run the query on 𝑆 with COUNT aggregate Lemma III (our new result) If 𝑠≥1/ 𝑁 , w.h.p., we have 𝑓 −𝑓 2 ≤𝜖 SELECT 𝐴 1 , 𝐴 2 ,… 𝐴 𝑑 , SUM(𝐵) FROM 𝑇 WHERE 𝐹 GROUP BY 𝐴 1 , 𝐴 2 ,…, 𝐴 𝑑

Main Technical Results 𝜖-approximation answer 𝑓 with || 𝑓 −𝑓| ​ 2 ≤𝜖 Query Processor Online Offline In-memory random sample each with size at most 𝑂( 𝑁 / 𝜖 2 ) Sample MetaData Dictionary (On-disk) LF Index: scanning at most 𝑂( 𝑁 ) rows for each query LF Index Columnizer (On-disk) MA index: issuing 𝑂( min 𝑁 , 1 𝜖 2 ) random access for each query MA Index DataConnector Data source e.g., SQL DB 𝑵: size of the fact table 𝝐: error bound

When Sampling is Not Sufficient For low predicate selectivity Table 𝑇 with 𝑁 rows Global sample 𝑆 with 𝑛 rows Given query 𝑄, rows satisfying 𝐹: 𝑁 𝐹 < 𝑁 Have NOT collected enough rows satisfying 𝐹: 𝑛 𝐹 < 1 𝜖 2

When Sampling is Not Sufficient For low predicate selectivity Table 𝑇 with 𝑁 rows Solution: since 𝑁 𝐹 is small we can access all the rows satisfying 𝐹 (with the help of LF index) what if there are 𝑁 rows and they are not stored sequentially? Given query 𝑄, rows satisfying 𝐹: 𝑁 𝐹 < 𝑁

Main Technical Results 𝜖-approximation answer 𝑓 with || 𝑓 −𝑓| ​ 2 ≤𝜖 Query Processor Online Offline In-memory random sample each with size at most 𝑂( 𝑁 / 𝜖 2 ) Sample MetaData Dictionary (On-disk) LF Index: scanning at most 𝑂( 𝑁 ) rows for each query LF Index Columnizer (On-disk) MA index: issuing 𝑂( min 𝑁 , 1 𝜖 2 ) random access for each query MA Index DataConnector Data source e.g., SQL DB 𝑵: size of the fact table 𝝐: error bound

When Sampling is Not Sufficient For low predicate selectivity Table 𝑇 with 𝑁 rows Per-column non-clustered index Solution: since 𝑁 𝐹 is small we can access all the rows satisfying 𝐹 (with the help of MA index) what if there are 𝑁 rows and they are not stored sequentially? Get a 1 𝜖 2 -sample of row addresses Given query 𝑄, rows satisfying 𝐹: 𝑁 𝐹 < 𝑁

Augment Index with Approximate Measures Approximate weighted sampling SUM(B): SUM aggregate on dimension B (offline) Attach an approximate measure M’ s.t. 1 ≤ M’/M ≤ 2 to the index (online) Lookup for a random sample 𝑆 of size O 1 𝜖 2 weighted on B’ (online) Issue I/Os and run the query on 𝑆 with HT-like estimator (no predicate) Lemma VI (our new result) W.h.p., we have 𝑓 −𝑓 2 ≤𝜖 SELECT 𝐴 1 , 𝐴 2 ,… 𝐴 𝑑 , SUM(𝐵) FROM 𝑇 WHERE 𝐹 (very selective) GROUP BY 𝐴 1 , 𝐴 2 ,…, 𝐴 𝑑

Outline Introduction Overview of Our Approach Main Technical Results Related Work Experiments Conclusion

Related Work OLAP (data cube) and ColumnStore index Online aggregation Offline sampling (BlinkDB) Deterministic approximate query processing (DAQ)

Outline Introduction Overview of Our Approach Main Technical Results Related Work Experiments Conclusion

Experiments TPC-H schema (with eight tables - 300M rows in LINEITEM) Queries with 1-4 group-by dims, and 0-4 predicate dims 100× speedup for >80% queries Actual error is “always” smaller than the requested 𝜖 (consistent to the theoretical results)

Experiments A real enterprise log table with 30 dimensions and 1B+ rows Queries with 1-4 group-by dims, and 0-4 predicate dims SPS: Our method DBX: A commercial RDBMS with columnstore BLK: BlinkDB SMG: SmallGroup sampling Sample part scales sublinearly Index part tends to be constant Adjust 𝜖 Better accuracy-time tradeoff

Outline Introduction Overview of Our Approach Main Technical Results Related Work Experiments Conclusion

Conclusion Time flies! Let’s process everything faster!

When Sampling is Sufficient For high predicate selectivity Global sample 𝑆 with O 𝑁 𝜖 2 rows Table 𝑇 with 𝑁 rows Given query 𝑄, rows satisfying 𝐹: 𝑁 𝐹 ≥ 𝑁 Have collected enough rows satisfying 𝐹: 𝑛 𝐹 ≥ 1 𝜖 2

When Sampling is Sufficient For high predicate selectivity Global sample 𝑆 with O 𝑁 𝜖 2 rows Table 𝑇 with 𝑁 rows Run 𝑄 on 𝑆 𝜖-approximation Given query 𝑄, rows satisfying 𝐹: 𝑁 𝐹 ≥ 𝑁 Use Lemma II (COUNT) and Lemma III (SUM) Have collected enough rows satisfying 𝐹: 𝑛 𝐹 ≥ 1 𝜖 2