BlinkDB.

Slides:

Advertisements

Similar presentations

Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research.

Advertisements

OLAP Tuning. Outline OLAP 101 – Data warehouse architecture – ROLAP, MOLAP and HOLAP Data Cube – Star Schema and operations – The CUBE operator – Tuning.

1 Multi-way Algorithm for Cube Computation CPS Notes 8.

Dimensional Modeling Business Intelligence Solutions.

Data Warehousing - 3 ISYS 650. Snowflake Schema one or more dimension tables do not join directly to the fact table but must join through other dimension.

1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.

CS346: Advanced Databases

1 Basic concepts of On-Line Analytical processing DT211 /4.

8/20/ Data Warehousing and OLAP. 2 Data Warehousing & OLAP Defined in many different ways, but not rigorously. Defined in many different ways, but.

1 CUBE: A Relational Aggregate Operator Generalizing Group By By Ata İsmet Özçelik.

CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.

1 Cube Computation and Indexes for Data Warehouses CPS Notes 7.

1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently.

OnLine Analytical Processing (OLAP)

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

1 Data Warehouses BUAD/American University Data Warehouses.

October 28, Data Warehouse Architecture Data Sources Operational DBs other sources Analysis Query Reports Data mining Front-End Tools OLAP Engine.

Designing Aggregations. Performance Fundamentals - Aggregations Pre-calculated summaries of data Intersections of levels from each dimension Tradeoff.

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented by Sushanth.

1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Presented By Anirban Maiti Chandrashekar Vijayarenu

Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.

Chapter 4 Logical & Physical Database Design

Data warehousing for Profit Analysis By A Sai Krishna Geethika Lokanadham Mithun Rajanna KV Kumar.

1 Introduction to Database Systems, CS420 SQL Views and Indexes.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.

Introduction to OLAP and Data Warehouse Assoc. Professor Bela Stantic September 2014 Database Systems.

Dense-Region Based Compact Data Cube

Just Enough Database Theory for Power Pivot / Power BI

Short History of Data Storage

Table General Guidelines for Better System Performance

CPS216: Data-intensive Computing Systems

Wander Join: Online Aggregation via Random Walks

Introduction to SQL Server Analysis Services

A B D C G5b Date 1Qtr 2Qtr 3Qtr 4Qtr TV Product PC

Remember the Sales Data Cube? Each cell contains a sales measurement, e.g., the number of sales (may contain many other measurements of product-date-country.

Parallel Databases.

So, what was this course about?

On-Line Analytic Processing

Physical Database Design and Performance

A paper on Join Synopses for Approximate Query Answering

Chapter 13 The Data Warehouse

Ripple Joins for Online Aggregation

Data storage is growing Future Prediction through historical data

Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang

Fast Approximate Query Answering over Sensor Data with Deterministic Error Guarantees Chunbin Lin Joint with Etienne Boursier, Jacque Brito, Yannis Katsis,

Data Warehouse.

Database Performance Tuning and Query Optimization

Introduction to Query Optimization

Blazing-Fast Performance:

Chapter 15 QUERY EXECUTION.

COSC 6340 Projects & Homeworks Spring 2002

Physical Database Design

University of Houston-Clear Lake Kaiser Permanente San Jose

External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.

File Organizations and Indexing

Selected Topics: External Sorting, Join Algorithms, …

Table General Guidelines for Better System Performance

DATA CUBES E0 261 Jayant Haritsa Computer Science and Automation

Implementation of Relational Operations

Chapter 11 Database Performance Tuning and Query Optimization

Applying Data Warehouse Techniques

OLAP Defined by E F Codd “relational databases were never intended to provide the very powerful functions for data synthesis, analysis and consolidation.

Data Pre-processing Lecture Notes for Chapter 2

Analytics, BI & Data Integration

Slides based on those originally by : Parminder Jeet Kaur

Applying Data Warehouse Techniques

CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #03 Row/Column Stores, Heap Files, Buffer Manager, Catalogs Instructor: Chen Li.

Presentation transcript:

BlinkDB

BlinkDB: My main takeaways IDEA 1: Organize sampling around query column sets (a) A small set may “cover” all queries From workloads (b) MILP formulation which picks these column sets IDEA 2: Determine on the fly, the qcs that gives the best “bang for the buck” Cute idea of selectivity – number of rows selected divided by the number of rows read Both of these are nice, novel ideas.

Where could the BlinkDB approach fail?

Where could the BlinkDB approach fail? QCSes are not stable # of rare subgroups are high, dimensionality bad For example, if three groups have dimensionality 10000 each, the stratified sample of the cross-product could be GIGANTIC Aqua would also fail

What are the drawbacks of the BlinkDB system? Let’s talk about Optimization Techniques Query Class Repeated Queries

What are the drawbacks of the BlinkDB system: Optimization Techniques?

What are the drawbacks of the BlinkDB system: Optimization Techniques? Not clear how to tune various parameters: K, M whether the MILP is indeed optimal if techniques will apply to other workloads or case studies (two datasets is very limited) Could have done bootstrapping

What are the drawbacks of the BlinkDB system: Query Classes?

What are the drawbacks of the BlinkDB system: Query Classes? Does not handle joins/nesting Very simple queries

Other Drawbacks A QCS either receives K or none at all. Is this wise? Should depend on variance, distribution If all values of cost of sale is 1 for this product, why keep a sample of K, 1 would suffice. If variance is small, a smaller sample may suffice Paper claims partial covering is OK if need be. Is that true? Partial covering may lead to a biased sample May completely miss some groups

Aqua vs. BlinkDB Very similar ideas for offline precomputed samples Is query agnostic, will take full collection of group-by columns to construct stratified sample May be too much Broader class of queries General enough to apply to joins (foreign key)

Alternatives for speeding up latency Offline Approximate Query Processing Better or more hardware Leveraging main memory (e.g., spark) Parallelism (e.g., dremel) Data Cube Materialization (NEXT) Online Approximate Query Processing (NEXT)

OLAP From the 90s Analytics: Business Intelligence Without aggregates, the underlying data can be represented as a snowflake schema The key idea of a data cube to materialize and store aggregates for this snowflake schema

Two representations Normalized Representation Fact tables and dimension tables Denormalized representation One row per transaction

A Sample Data Cube Date Product Country Total annual sales of TV in U.S.A. Date Product Country sum TV VCR PC 1Qtr 2Qtr 3Qtr 4Qtr U.S.A Canada Mexico All this data is completely precomputed before any queries arrive!

Data Cube Materialization: Idea from the 90s Sales volume as a function of product, month, and region Country Also called Data Warehousing Product Month

The curse of dimensionality That said, not clear if any of the queries in their set are greater than 7-8 dimensions

Online AQP: Aggregation: Another Idea from the 90s No upfront computation or storage. Sample from each group in a random fashion If you have indexes, can even do stratified sampling Display results + place control in hands of the user Why is this bad??

Why is OLA bad? Random I/O sucks! Remember: Disk seek 10,000,000 ns Read 1 MB sequentially from disk 30,000,000 ns Thus, 1 random I/O seek = reading 1/3 MB sequentially.

Why is OLA bad? Not the case with memory, however!! OLA may still be important That said, memory is FAST, so unless you’re dealing with large datasets (unlikely to keep in memory due to $$$), scanning entire data may still be pretty fast

When is ONAQP/OFAQP/materialization better?

ONAQP, OFAQP, materialization OFAQP, materialization both better when you have space (disk/memory) and a high predictive power Materialization better when (small) fixed set of queries or report generation OFAQP better for ad-hoc queries, ad-hoc measures Both OFAQP and materialization are bad for time-varying data! ONAQP is better only if done in memory, but only LARGE memory – if small memory, not much performance difference between scanning entire thing vs. seeking around