Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance TILMANN RABL, MEIKEL POESS, HANS- ARNO JACOBSEN, PATRICK.

Slides:



Advertisements
Similar presentations
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Advertisements

CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
Chapter 10: Designing Databases
Robust query processing Goetz Graefe, Christian König, Harumi Kuno, Volker Markl, Kai-Uwe Sattler Dagstuhl – September 2010.
OLAP Tuning. Outline OLAP 101 – Data warehouse architecture – ROLAP, MOLAP and HOLAP Data Cube – Star Schema and operations – The CUBE operator – Tuning.
1 Copyright © 2011, Oracle and/or its affiliates. All rights reserved.
Using the Optimizer to Generate an Effective Regression Suite: A First Step Murali M. Krishna Presented by Harumi Kuno HP.
SQL Server Accelerator for Business Intelligence (SSABI)
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.
File Processing : Hash 2015, Spring Pusan National University Ki-Joune Li.
CS 245Notes 51 CS 245: Database System Principles Hector Garcia-Molina Notes 5: Hashing and More.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
BigBench: Big Data Benchmark Proposal Ahmad Ghazal, Tilmann Rabl, Minqing Hu, Francois Raab, Meikel Poess, Alain Crolotte, Hans-Arno Jacobsen.
Linked Bernoulli Synopses Sampling Along Foreign Keys Rainer Gemulla, Philipp Rösch, Wolfgang Lehner Technische Universität Dresden Faculty of Computer.
Technical BI Project Lifecycle
IBM Software Group ® Recommending Materialized Views and Indexes with the IBM DB2 Design Advisor (Automating Physical Database Design) Jarek Gryz.
C-Store: Introduction to TPC-H Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 20, 2009.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
1 Relational Query Optimization Module 5, Lecture 2.
Physical Database Monitoring and Tuning the Operational System.
Ch1: File Systems and Databases Hachim Haddouti
CS 245Notes 51 CS 245: Database System Principles Hector Garcia-Molina Notes 5: Hashing and More.
CS 277 – Spring 2002Notes 51 CS 277: Database System Implementation Arthur Keller Notes 5: Hashing and More.
8-1 Outline  Overview of Physical Database Design  File Structures  Query Optimization  Index Selection  Additional Choices in Physical Database Design.
CS CS4432: Database Systems II. CS Index definition in SQL Create index name on rel (attr) (Check online for index definitions in SQL) Drop.
Chapter 8 Physical Database Design. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Outline Overview of Physical Database.
Rapid Development of Data Generators Using Meta Generators in PDGF Tilmann Rabl, Meikel Poess, Manuel Danisch, Hans-Arno Jacobsen DBTest 2013, June 24,
Lecture 6 Indexing Part 2 Column Stores. Indexes Recap Heap FileBitmapHash FileB+Tree InsertO(1) O( log B n ) DeleteO(P)O(1) O( log B n ) Range Scan O(P)--
1DBTest2008. Motivation Background Relational Data Warehousing (DW) SQL Server 2008 Starjoin improvement Testing Challenge Extending Enterprise-class.
XML, distributed databases, and OLAP/warehousing The semantic web and a lot more.
1 Intro to Info Tech Database Management Systems Copyright 2003 by Janson Industries This presentation can be viewed on line at:
Databases C HAPTER Chapter 10: Databases2 Databases and Structured Fields  A database is a collection of information –Typically stored as computer.
Practical Database Design and Tuning. Outline  Practical Database Design and Tuning Physical Database Design in Relational Databases An Overview of Database.
CSC271 Database Systems Lecture # 30.
1 Experimental Evidence on Partitioning in Parallel Data Warehouses Pedro Furtado Prof. at Univ. of Coimbra & Researcher at CISUC DEI/CISUC-Universidade.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 12: Overview.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
DANIEL J. ABADI, ADAM MARCUS, SAMUEL R. MADDEN, AND KATE HOLLENBACH THE VLDB JOURNAL. SW-Store: a vertically partitioned DBMS for Semantic Web data.
1 Data Warehousing Lecture-13 Dimensional Modeling (DM) Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics Research.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.
Daniel J. Abadi · Adam Marcus · Samuel R. Madden ·Kate Hollenbach Presenter: Vishnu Prathish Date: Oct 1 st 2013 CS 848 – Information Integration on the.
Chapter 16 Practical Database Design and Tuning Copyright © 2004 Pearson Education, Inc.
Physical Database Design I, Ch. Eick 1 Physical Database Design I About 25% of Chapter 20 Simple queries:= no joins, no complex aggregate functions Focus.
C-Store: How Different are Column-Stores and Row-Stores? Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 8, 2009.
10/10/2012ISC239 Isabelle Bichindaritz1 Physical Database Design.
CS 245Notes 51 CS 245: Database System Principles Hector Garcia-Molina Notes 5: Hashing and More.
Methodology – Physical Database Design for Relational Databases.
CS 245Notes 51 CS 245: Database System Principles Hector Garcia-Molina Notes 5: Hashing and More.
Dependable Technologies for Critical Systems Copyright Critical Software S.A All Rights Reserved. Handling big dimensions in distributed data.
Indexes and Views Unit 7.
Introduction to Query Optimization, R. Ramakrishnan and J. Gehrke 1 Introduction to Query Optimization Chapter 13.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Introduction to Query Optimization Chapter 13.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
OLAP On Line Analytic Processing. OLTP On Line Transaction Processing –support for ‘real-time’ processing of orders, bookings, sales –typically access.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
1 Ullman et al. : Database System Principles Notes 5: Hashing and More.
Introduction to Core Database Concepts Getting started with Databases and Structure Query Language (SQL)
Physical Layer of a Repository. March 6, 2009 Agenda – What is a Repository? –What is meant by Physical Layer? –Data Source, Connection Pool, Tables and.
Em Spatiotemporal Database Laboratory Pusan National University File Processing : Hash 2004, Spring Pusan National University Ki-Joune Li.
Practical Database Design and Tuning
Parallel Databases.
Oracle Analytic Views Enhance BI Applications and Simplify Development
Enhance BI Applications and Simplify Development
Practical Database Design and Tuning
2018, Spring Pusan National University Ki-Joune Li
A – Pre Join Indexes.
Presentation transcript:

Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance TILMANN RABL, MEIKEL POESS, HANS- ARNO JACOBSEN, PATRICK AND ELIZABETH O’NEIL MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG ICPE 2013, PRAGUE, 24/04/2013

Real Life Data is Distributed Uniformly… ◦Customers zip codes typically clustered around metropolitan areas ◦Seasonal items (lawn mowers, snow shovels, …) sold mostly during specific periods ◦US retail sales: ◦peak during Holiday Season ◦December sales are 2x of January sales RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 2 Source: US Census Data Well, Not Really

Student Seminar Signup Distribution RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 3

How Can Skew Effect Database Systems? Data placement ◦Partitioning ◦Indexing Data structures ◦Tree balance ◦Bucket fill ratio ◦Histograms Optimizer  finding the optimal query plan ◦Index vs. non-index driven plans ◦Hash join vs. merge join ◦Hash group by vs. sort group by 4 RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS

Agenda Data Skew in Current Benchmarks Star Schema Benchmark (SSB) Parallel Data Generation Framework (PDGF) Introducing Skew in SSB 5 RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS

Data Skew in Benchmarks TPC-D ( ): only uniform data ◦SIGMOD “Successor of TPC-D should include data skew” ◦No effect until … TPC-DS (released 2012) ◦Contains comparability zones ◦Not fully utilized TPC-D/H variations ◦Chaudhuri and Narayasa: Zipfian distribution on all columns ◦Crolotte and Ghazal: comparability zones Still lots of open potential 6 RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS

Star Schema Benchmark I Star schema version of TPC-H ◦Merged Order and Lineitem ◦Date dimension ◦Dropped Partsupp ◦Selectivity hierarchies ◦C_City  C_Nation  C_Region ◦… RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 7

Star Schema Benchmark II Completely new set of queries 4 flights of 3-4 queries ◦Designed for functional coverage and selectivity coverage ◦Drill down in dimension hierarchies ◦Predefined selectivity RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 8 select sum(lo_extendedprice*lo_discount) as revenue from lineorder, date where lo_orderdate = d_datekey and d_year = 1993 and lo_discount between 1 and 3 and lo_quantity < 25; select sum(lo_extendedprice*lo_discount) as revenue from lineorder, date where lo_orderdate = d_datekey and d_yearmonthnum = and lo_discount between 1 and 3 and lo_quantity between 26 and 35; Drilldown Q1.1 Q1.2

Parallel Data Generation Framework Generic data generation framework Relational model ◦Schema specified in configuration file ◦Post-processing stage for alternative representations Repeatable computation ◦Based on XORSHIFT random number generators ◦Hierarchical seeding strategy RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 9 Frank, Poess, and Rabl: Efficient Update Data Generation for DBMS Benchmarks. ICPE '12. Rabl and Poess: Parallel Data Generation for Performance Analysis of Large, Complex RDBMS. DBTest '11. Poess, Rabl, Frank, and Danisch: A PDGF Implementation for TPC-H. TPCTC '11. Rabl, Frank, Sergieh, and Kosch: A Data Generator for Cloud-Scale Benchmarking. TPCTC '10.

Configuring PDGF Schema configuration Relational model ◦Tables, fields Properties ◦Table size, characters, … Generators ◦Simple generators ◦Metagenerators Update definition ◦Insert, update, delete ◦Generated as change data capture RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 10 ${S} <field name="S_SUPPKEY" size="" type="NUMERIC“ primary="true" unique="true"> 0 true 9 Supplier [..] PDGF XML DB

Opportunities to Inject Data Skew in Foreign key relations ◦E.g., L_PARTKEY One fact table measures ◦E.g., L_Quantity Single dimension hierarchy ◦E.g., P_Brand → P_Category → P_Mfgr Multiple dimension hierarchies ◦E.g., City → Nation in Supplier and Customer Experimental methodology ◦One experiment series for each of the above ◦Comparison to original SSB ◦Comparison of index-forced, non-index, and automatic optimizer mode ◦SSB scale factor 100 (100 GB), x86 server 11 RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS

Skew in Foreign Key Relations Very realistic Easy to implement in PDGF ◦Just add a distribution to the reference But! Dimension attributes uniformly distributed Dimension keys uncorrelated to dimension attributes  Very limited effect on selectivity Focus on attributes in selectivity predicates 12 RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS

Lo_Quantity distribution ◦Values range between 0 and 50 ◦Originally uniform distribution with: ◦P(X=x)=0.02 ◦Coefficient of variation of ◦Proposed skewed distribution with: ◦ Query 1.1 ◦lo_quantity < x, x ∈ [2, 51] Results ◦Switches too early to non-index plan ◦Switches too late to non-index plan ◦Optimizer agnostic to distribution Skew in Fact Table Measure – Lo_Quantity 13 RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS

Skew in Single Dimension Hierarchy - Part P_Category distribution ◦Uniform P(X=x)=0.04 ◦Skewed P(X=x)= ◦Probabilities explicitly defined Query 2.1 ◦Restrictions on two dimensions Results uniform case ◦Index driven superior ◦Optimizer chooses non-index driven Results skewed case ◦Switches too early to non-index plan RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 14

Skewed S_City & C_City ◦Probabilites exponentially distributed Query 3.3 ◦Restrictions on 3 dimensions ◦Variation on Supplier and Customer city Results uniform and skewed cases ◦Automatic plan performs best ◦Cross over between automatic uniform and skewed too late Skew in Multiple Dimension Hierarchies – S_City & C_City RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 15 Join CardinalityElapsed Time

Conclusion & Future Work PDGF implementation of SSB Introduction of skew in SSB Extensive performance analysis ◦Several interesting optimizer effects ◦Performance impact of skew Future Work Further analysis on impact of skew Skew in query generation Complete suite for testing skew effects 16 RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS

Thanks Questions? Download and try PDGF: (scripts used in the study available on website above) RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 17

Back-up Slides RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 18

Configuring PDGF Generation Generation configuration Defines the output ◦Scheduling ◦Data format ◦Sorting ◦File name and location Post processing ◦Filtering of values ◦Merging of tables ◦Splitting of tables ◦Templates (e.g. XML / queries) RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 19 [..] <!-- int y = (fields [0]. getPlainValue ()).intValue (); int d = (fields [1]. getPlainValue ()).intValue (); int q = (fields [2]. getPlainValue ()).intValue (); String n = pdgf.util.Constants.DEFAULT_LINESEPARATOR; buffer.append("-- Q1.1" + n); buffer.append("select sum(lo_extendedprice *"); buffer.append(" lo_discount) as revenue" + n); buffer.append(“ from lineorder, date" + n); buffer.append(“ where lo_orderdate = d_datekey" + n); buffer.append(“ and d_year = " + y + n); buffer.append(“ and lo_disc between " + (d - 1)); buffer.append(“ and " + (d + 1) + n); buffer.append(“ and lo_quantity < " + q + ";" + n); -->