Workload-Aware Aggregate Maintenance in Columnar In-Memory Databases Stephan Müller, Lars Butzmann, Stefan Klauck, Hasso Plattner 2013 IEEE International.

Slides:



Advertisements
Similar presentations
The HV-tree: a Memory Hierarchy Aware Version Index Rui Zhang University of Melbourne Martin Stradling University of Melbourne.
Advertisements

BY LECTURER/ AISHA DAWOOD DW Lab # 4 Overview of Extraction, Transformation, and Loading.
Chapter 1 Overview of Databases and Transaction Processing.
Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.
1 DynaMat A Dynamic View Management System for Data Warehouses Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan.
An Efficient Multi-Dimensional Index for Cloud Data Management Xiangyu Zhang Jing Ai Zhongyuan Wang Jiaheng Lu Xiaofeng Meng School of Information Renmin.
A Scalable, Predictable Join Operator for Highly Concurrent Data Warehouses George Candea (EPFL & Aster Data) Neoklis Polyzotis (UC Santa Cruz) Radek Vingralek.
A Fast Growing Market. Interesting New Players Lyzasoft.
Presented by Vigneshwar Raghuram
Outline SQL Server Optimizer  Enumeration architecture  Search space: flexibility/extensibility  Cost and statistics Automatic Physical Tuning  Database.
1 HYRISE – A Main Memory Hybrid Storage Engine By: Martin Grund, Jens Krüger, Hasso Plattner, Alexander Zeier, Philippe Cudre-Mauroux, Samuel Madden, VLDB.
Incremental Maintenance for Non-Distributive Aggregate Functions work done at IBM Almaden Research Center Themis Palpanas (U of Toronto) Richard Sidle.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
Achieving Adaptivity for OLAP-XML Federations Torben Bach Pedersen Aalborg University Joint work with Dennis Pedersen, TARGIT.
6.814/6.830 Lecture 8 Memory Management. Column Representation Reduces Scan Time Idea: Store each column in a separate file GM AAPL.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Presented by Marie-Gisele Assigue Hon Shea Thursday, March 31 st 2011.
1 Distributed Databases Chapter Two Types of Applications that Access Distributed Databases The application accesses data at the level of SQL statements.
Chapter 6: Database Evolution Title: AutoAdmin “What-if” Index Analysis Utility Authors: Surajit Chaudhuri, Vivek Narasayya ACM SIGMOD 1998.
How to effectively store the history of data in a relational DBMS Database systems MSE-Seminar © Raphael Gfeller,
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Data Warehouse View Maintenance Presented By: Katrina Salamon For CS561.
1 Introduction Introduction to database systems Database Management Systems (DBMS) Type of Databases Database Design Database Design Considerations.
Top-k Monitoring in Wireless Sensor Networks Minji Wu, Jianliang Xu, Xueyan Tang, and Wang-Chien Lee IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
Recovery Techniques in Distributed Databases Naveen Jones December 5, 2011.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
Selecting and Implementing An Embedded Database System Presented by Jeff Webb March 2005 Article written by Michael Olson IEEE Software, 2000.
The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.
Cloud Computing Lecture Column Store – alternative organization for big relational data.
Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.
1 Experimental Evidence on Partitioning in Parallel Data Warehouses Pedro Furtado Prof. at Univ. of Coimbra & Researcher at CISUC DEI/CISUC-Universidade.
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
DANIEL J. ABADI, ADAM MARCUS, SAMUEL R. MADDEN, AND KATE HOLLENBACH THE VLDB JOURNAL. SW-Store: a vertically partitioned DBMS for Semantic Web data.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.
Daniel J. Abadi · Adam Marcus · Samuel R. Madden ·Kate Hollenbach Presenter: Vishnu Prathish Date: Oct 1 st 2013 CS 848 – Information Integration on the.
Efficiently Processing Queries on Interval-and-Value Tuples in Relational Databases Jost Enderle, Nicole Schneider, Thomas Seidl RWTH Aachen University,
Towards a Billion Routing Lookups per Second in Software  Author: Marko Zec, Luigi, Rizzo Miljenko Mikuc  Publisher: SIGCOMM Computer Communication Review,
CISC Machine Learning for Solving Systems Problems Presented by: Alparslan SARI Dept of Computer & Information Sciences University of Delaware
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
View Materialization & Maintenance Strategies By Ashkan Bayati & Ali Reza Vazifehdoost.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
Reporter : Yu Shing Li 1.  Introduction  Querying and update in the cloud  Multi-dimensional index R-Tree and KD-tree Basic Structure Pruning Irrelevant.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
To Tune or not to Tune? A Lightweight Physical Design Alerter Nico Bruno, Surajit Chaudhuri DMX Group, Microsoft Research VLDB’06.
Jens Krüger, Christian Tinnefeld, Martin Grund, Alexander Zeier, Hasso Plattner A Case for Online Mixed Workload Processing.
Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.
Lazy Maintenance of Materialized Views Jingren Zhou, Microsoft Research, USA Paul Larson, Microsoft Research, USA Hicham G. Elmongui, Purdue University,
Chapter 1 Overview of Databases and Transactions.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
M.Kersten MonetDB, Cracking and recycling Martin Kersten CWI Amsterdam.
Streaming Big Data with Self-Adjusting Computation Umut A. Acar, Yan Chen DDFP January 2014 SNU IDB Lab. Namyoon Kim.
for all Hyperion video tutorial/Training/Certification/Material Essbase Optimization Techniques by Amit.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.
Chapter 1 Overview of Databases and Transaction Processing.
Oracle Announced New In- Memory Database G1 Emre Eftelioglu, Fen Liu [09/27/13] 1 [1]
Dense-Region Based Compact Data Cube
Failure-Atomic Slotted Paging for Persistent Memory
CPS216: Data-intensive Computing Systems
Lecture 16: Data Storage Wednesday, November 6, 2006.
Evaluation of Relational Operations: Other Operations
Automatic Physical Design Tuning: Workload as a Sequence
I don’t need a title slide for a lecture
Evaluation of Relational Operations: Other Techniques
Data Warehousing Concepts
Recommending Adaptive Changes for Framework Evolution
Presentation transcript:

Workload-Aware Aggregate Maintenance in Columnar In-Memory Databases Stephan Müller, Lars Butzmann, Stefan Klauck, Hasso Plattner 2013 IEEE International Conference on Big Data 01 May 2014 SNU IDB Lab. Namyoon Kim

2 / 28 Outline Introduction Related Work Workloads Aggregate Maintenance Strategies Switching Between Aggregate Maintenance Strategies Benchmarks Conclusion

3 / 28 Introduction OLTP/OLAP Transactional and analytical queries have traditionally been associated with separate applications However, this is no longer the case ATP (available-to-promise) OLTP: product stock movements OLAP: aggregate over product movements to determine delivery dates for customers Financial Accounting OLTP: document creation OLAP: profit and loss statements

4 / 28 Execution Speedup Materialized View Database view whose tuples are persisted in the database Materialized Aggregate Materialized view whose creation query contains aggregations Columnar in-memory Database IMDBs such as SAP HANA, Hyrise or Hyper are separated into a read-optimized main storage and a write-optimized delta storage All data changes of a table are propagated to the delta storage Periodically, the main is combined with the delta (merge operation)

5 / 28 Merge Update Merge update Novel view maintenance strategy for IMDBs with a main-delta architecture Materialized aggregate table only contains data from main storage Query results are produced by aggregating delta on the fly and combining with the materialized aggregate table Outperforms other view maintenance strategies for workloads with high insert ratios However, not the ideal choice for the full range of insert ratios Goals Propose and evaluate an adaptive, workload-aware materialized aggregate en- gine

6 / 28 Related Work Overview and related issues of materialized views [1] Database vendors on problem of materialized view maintenance [2],[3] Materialized view research in data warehousing environments [4],[5],[6],[7] Different from this scenario; maintenance downtimes are acceptable Importance of automated physical database design [8] Index and materialized view selection based on changing workloads Extended definition of workload [9] Not only ratios of query types in a workload, but also their sequence

7 / 28 Workloads Workload A DB’s workload is characterized by its queries Queries Single inserts changing the base table Selects querying single aggregate values Workload can be described by insert ratio and select ratio Insert Ratio: number of insert queries in relation to the total number of queries Select Ratio: 1 – insert ratio

8 / 28 Evaluation Patterns

9 / 28 Aggregate Maintenance Strategies Cost functions Required time to access the aggregate Required time to maintain the aggregate

10 / 28 Break Even Point Smart lazy incremental update (SLIU) and Merge update (MU) We call the workload characteristic where the best performing strategy changes the break even point

11 / 28 Smart Lazy Incremental Update For read intensive workloads Maintenance is done when reading the materialized aggregate After processing a select, the requested aggregate is up to date Aggregate maintenance Dictionary structure stores changes caused by inserts since the last maintenance point Multiple changes for the same aggregated value are combined into one value to increase performance

12 / 28 SLIU Cost (1) T select : average time for a single read of an aggregate Multiplied by select ratio ( R select ) to weight costs, since they are not required for inserts T dict + T maintenance : cost of a single maintenance activity Increases with an increasing insert ratio ( R insert ) since each insert requires a maintenance activity with corresponding aggregate request Optimization Maintenance cost can be optimized, in two scenarios 1. When R insert ≤ 0.5, R insert × ( T dict + T maintenance ) is linear 2. When R insert > 0.5, R insert × ( T dict + T maintenance ) is smaller because: Possibility of combining multiple values in the dictionary structure with the same grouping attributes Bulk maintenance where all relevant values from the dictionary structure are processed together

13 / 28 SLIU Cost (2) Cost for a single query Optimization function

14 / 28 Algorithm (SLIU) Setup A dictionary structure is required to store the inserts that occur between two select queries Tear down The values from the dictionary structure have to be included into the materialized aggregate

15 / 28 Merge Update in Action

16 / 28 Merge Update Cost MU only creates costs when requesting an aggregate Cost is higher than that of SLIU because of delta storage access T delta : cost for aggregating on delta T union : cost to combine T select and T delta

17 / 28 MU Setup and Tear Down Setup After switching, materialized aggregate table contains both the records of main and delta Values from delta have to be subtracted from the materialized aggregate so that it only contains main storage records Alternatively, can merge to transfer delta into main storage Tear down Values from delta have to be included into the materialized aggregate The delta values are aggregated and the result is used to update the materialized main aggregate

18 / 28 Algorithm (MU)

19 / 28 Swtiching Strategies Main influence factor is R insert How to determine current insert ratio? Track the last n queries Size of the delta storage No switching Does not switch between different view maintenance strategies; baseline for benchmark Switching Each time system determines the current insert ratio, it chooses the optimal strategy ASAP

20 / 28 Test Setup - Architecture Uses SanssouciDB

21 / 28 Test Setup - Data 1M record base table Incrementally maintain aggregates 4,000 record materialized aggregate (i.e. date-product combinations) Selects querying aggregates filtered by product Inserts with about 1,000 different date-product combinations

22 / 28 Test Setup – Workload and Hardware 20k queries 200 phases of constant insert ratios Between consecutive phases, insert ratios can stay constant or increase/decrease by 10% Hardware 8 × Intel Xeon E5450 3GHz 12MB cache 64GB main memory Benchmark Every benchmark is run at least three times Result is the median of the three Switching vs. no switching No switching is run twice; once using MU, once using SLIU

23 / 28 Evaluation Patterns Revisited

24 / 28 Basic Workload Patterns

25 / 28 Random Workloads - Ranges [0,1] (a – c): covers the largest possible interval Switching improvement should be greatest [0.2,0.6] (d – f): covers near the break even point Switching improvement should be lower [0,0.5] (g – i): interval beneficial for SLIU [0.3,0.8] (j – l): interval beneficial for MU

26 / 28 Random Workloads - Results

27 / 28 Conclusion Contributions Motivated the importance of materialized view maintenance in columnar IMDBs with mixed database workloads Proposed an algorithm to select optimal view maintenance strategy Based on ratio between reads of the materialized view and inserts to the base table affecting the view Future Work Extend simple switching algorithm to evaluate workload history and switch cost Implement machine learning to predict future workload changes

28 / 28 References [1] A. Gupta and I. S. Mumick. Maintenance of materialized views: Problems, techniques, and applications. IEEE Data Eng. Bull [2] R. G. Bello, K. Dias, A. Downing, J. J. F. Jr., J. L. Finnerty, W. D. Norcott, H. Sun, A. Witkowski, and M. Ziauddin. Materialized views in oracle. In VLDB, pages 659–664, [3] J. Zhou, P.-A. Larson, and H. G. Elmongui. Lazy maintenance of materialized views. In VLDB, pages 231–242, [4] Y. Zhuge, H. Garc´ıa-Molina, J. Hammer, and J. Widom. View maintenance in a warehousing environment. In SIGMOD, pages 316–327, [5] D. Agrawal, A. El Abbadi, A. Singh, and T. Yurek. Efficient view maintenance at data warehouses. In SIGMOD, [6] H. Jain and A. Gosain. A comprehensive study of view maintenance approaches in data warehousing evolution. SIGSOFT Softw. Eng. Notes [7] I. S. Mumick, D. Quass, and B. S. Mumick. Maintenance of data cubes and summary tables in a warehouse. In SIGMOD, [8] S. Chaudhuri and V. Narasayya. Self-tuning database systems: a decade of progress. In VLDB, [9] S. Agrawal, E. Chu, and V. Narasayya. Automatic physical design tuning: Workload as a Sequence. In SIGMOD, 2006.